Population‐based cancer registries have improved dramatically over the last 2 decades. These central cancer registries provide a critical framework that can elevate the science of cancer research. ...There have also been important technical and scientific advances that help to unlock the potential of population‐based cancer registries. These advances include improvements in probabilistic record linkage, refinements in natural language processing, the ability to perform genomic sequencing on formalin‐fixed, paraffin‐embedded (FFPE) tissue, and improvements in the ability to identify activity levels of many different signaling molecules in FFPE tissue. This article describes how central cancer registries can provide a population‐based sample frame that will lead to studies with strong external validity, how central cancer registries can link with public and private health insurance claims to obtain complete treatment information, how central cancer registries can use informatics techniques to provide population‐based rapid case ascertainment, how central cancer registries can serve as a population‐based virtual tissue repository, and how population‐based cancer registries are essential for guiding the implementation of evidence‐based interventions and measuring changes in the cancer burden after the implementation of these interventions.
Population‐based cancer registries have improved dramatically over the last 2 decades. These central cancer registries provide a critical framework that can elevate the science of cancer research.
Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence-for example, a single patient may generate multiple reports over the trajectory of a disease. In ...applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks-site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.
Bidirectional Encoder Representations from Transformers (BERT) and BERT-based approaches are the current state-of-the-art in many natural language processing (NLP) tasks; however, their application ...to document classification on long clinical texts is limited. In this work, we introduce four methods to scale BERT, which by default can only handle input sequences up to approximately 400 words long, to perform document classification on clinical texts several thousand words long. We compare these methods against two much simpler architectures - a word-level convolutional neural network and a hierarchical self-attention network - and show that BERT often cannot beat these simpler baselines when classifying MIMIC-III discharge summaries and SEER cancer pathology reports. In our analysis, we show that two key components of BERT - pretraining and WordPiece tokenization - may actually be inhibiting BERT's performance on clinical text classification tasks where the input document is several thousand words long and where correctly identifying labels may depend more on identifying a few key words or phrases rather than understanding the contextual meaning of sequences of text.
Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and ...expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, we analyze the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.
We compare the performance of each active learning strategy using two differently sized datasets and two different classification tasks. Our results show that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, i.e., no active learning. On our large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On our small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. We found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.
Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. Our results show that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.
Abstract
Background
Cancer is a leading cause of death by disease among children and adolescents in the United States. This study updates cancer incidence rates and trends using the most recent and ...comprehensive US cancer registry data available.
Methods
We used data from US Cancer Statistics to evaluate counts, age-adjusted incidence rates, and trends among children and adolescents younger than 20 years of age diagnosed with malignant tumors between 2003 and 2019. We calculated the average annual percent change (APC) and APC using joinpoint regression. Rates and trends were stratified by demographic and geographic characteristics and by cancer type.
Results
With 248 749 cases reported between 2003 and 2019, the overall cancer incidence rate was 178.3 per 1 million; incidence rates were highest for leukemia (46.6), central nervous system neoplasms (30.8), and lymphoma (27.3). Rates were highest for males, children 0 to 4 years of age, Non-Hispanic White children and adolescents, those in the Northeast census region, the top 25% of counties by economic status, and metropolitan counties with a population of 1 million people or more. Although the overall incidence rate of pediatric cancer increased 0.5% per year on average between 2003 and 2019, the rate increased between 2003 and 2016 (APC = 1.1%), and then decreased between 2016 and 2019 (APC = –2.1%). Between 2003 and 2019, rates of leukemia, lymphoma, hepatic tumors, bone tumors, and thyroid carcinomas increased, while melanoma rates decreased. Rates of central nervous system neoplasms increased until 2017, and then decreased. Rates of other cancer types remained stable.
Conclusions
Incidence of pediatric cancer increased overall, although increases were limited to certain cancer types. These findings may guide future public health and research priorities.
Recent reports indicate that thoracoscopic lobectomy for lung cancer may be associated with lower rates of surgical upstaging. We queried a statewide cancer registry for differences in upstaging ...rates and survival by surgical approach.
The Kentucky Cancer Registry (KCR) collects data, including centralized pathology reporting, on cancer patients treated statewide. We performed a retrospective review from 2010 to 2012 to examine clinical and pathologic stage. We assessed rates of upstaging and whether the surgical approach, thoracotomy (THOR) versus minimally invasive techniques (video-assisted thoracic surgery; VATS), had an impact on final pathologic stage and survival.
The KCR database from 2010 to 2012 contained information on 2830 lung cancer cases, 1964 having THOR procedure and 500 having VATS resections. Preoperatively, 36.4% of THOR were clinically stage 1a versus 47.4% VATS (p = 0.0002). Of these, final pathologic stage remained stage 1a in 30.5% of THOR procedures and 38.0% of VATS (p = 0.0002). The overall nodal upstaging rate for THOR was 9.9% and 4.8% for VATS (p = 0.002). Decreased nodal upstaging was found with VATS, independent of tumor size and extent of resection (odds ratio 0.6, 95% confidence interval CI: 0.387 to 0.985, p = 0.04). However, improved survival was found with VATS compared with THOR (hazard ratio 0.733, 95% CI: 0.592 to 0.907, p = 0.0042).
Consistent with other reports, we report a lower upstaging rate with VATS. Nevertheless, there is a survival advantage in VATS patients. Although selection bias may play a role in these observed differences, the improved quality of life measures associated with VATS may explain survival improvement despite lower surgical upstaging.
Background
Depression is common among breast cancer patients and can affect concordance with guideline‐recommended treatment plans. Yet, the impact of depression on cancer treatment and survival is ...understudied, particularly in relation to the timing of the depression diagnosis.
Methods
The Kentucky Cancer Registry data was used to identify female patients diagnosed with primary invasive breast cancer who were 20 years of age or older in 2007–2011. Patients were classified as having no depression, depression pre‐cancer diagnosis only, depression post‐ cancer diagnosis only, or persistent depression. The impact of depression on receiving guideline‐recommended treatment and survival was examined using multivariable logistic regression and Cox regression, respectively.
Results
Of 6054 eligible patients, 4.1%, 3.7%, and 6.2% patients had persistent depression, depression pre‐diagnosis only, and depression post‐diagnosis only, respectively. A total of 1770 (29.2%) patients did not receive guideline‐recommended cancer treatment. Compared to patients with no depression, the odds of receiving guideline‐recommended treatment were decreased in patients with depression pre‐diagnosis only (odds ratio OR, 0.75; 95% confidence interval CI, 0.54–1.04) but not in patients with post‐diagnosis only or persistent depression. Depression post‐diagnosis only (hazard ratio, 1.51; 95% CI, 1.24–1.83) and depression pre‐diagnosis only (hazard ratio, 1.26; 95% CI, 0.99–1.59) were associated with worse survival. No significant difference in survival was found between patients with persistent depression and patients with no depression (p > .05).
Conclusions
Neglecting depression management after a breast cancer diagnosis may result in poorer cancer treatment concordance and worse survival. Early detection and consistent management of depression is critical in improving patient survival.
Depression impacts cancer care and survival for breast cancer patients. Study results suggest depression screening and treatment is an important part of overall care for breast cancer patients.
Real-world evidence for radiation therapy (RT) is limited because it is often documented only in the clinical narrative. We developed a natural language processing system for automated extraction of ...detailed RT events from text to support clinical phenotyping.
A multi-institutional data set of 96 clinician notes, 129 North American Association of Central Cancer Registries cancer abstracts, and 270 RT prescriptions from HemOnc.org was used and divided into train, development, and test sets. Documents were annotated for RT events and associated properties: dose, fraction frequency, fraction number, date, treatment site, and boost. Named entity recognition models for properties were developed by fine-tuning BioClinicalBERT and RoBERTa transformer models. A multiclass RoBERTa-based relation extraction model was developed to link each dose mention with each property in the same event. Models were combined with symbolic rules to create a hybrid end-to-end pipeline for comprehensive RT event extraction.
Named entity recognition models were evaluated on the held-out test set with F1 results of 0.96, 0.88, 0.94, 0.88, 0.67, and 0.94 for dose, fraction frequency, fraction number, date, treatment site, and boost, respectively. The relation model achieved an average F1 of 0.86 when the input was gold-labeled entities. The end-to-end system F1 result was 0.81. The end-to-end system performed best on North American Association of Central Cancer Registries abstracts (average F1 0.90), which are mostly copy-paste content from clinician notes.
We developed methods and a hybrid end-to-end system for RT event extraction, which is the first natural language processing system for this task. This system provides proof-of-concept for real-world RT data collection for research and is promising for the potential of natural language processing methods to support clinical care.
Summary Lung cancer carries a poor prognosis and is the most common cause of cancer-related death worldwide. The integrin α6β4, a laminin receptor, promotes carcinoma progression in part by ...cooperating with various growth factor receptors to facilitate invasion and metastasis. In carcinoma cells with mutant TP53 , the integrin α6β4 promotes cell survival. TP53 mutations and integrin α6β4 overexpression co-occur in many aggressive malignancies. Due to the high frequency of TP53 mutations in lung squamous cell carcinoma (SCC), we sought to investigate the association of integrin β4 expression with clinicopathologic features and survival in non-small cell lung cancer (NSCLC). We constructed a lung cancer tissue microarray and stained sections for integrin β4 subunit expression using immunohistochemistry. We found that integrin β4 expression is elevated in SCC compared to adenocarcinoma ( P < .0001), which was confirmed in external gene expression datasets ( P < .0001). We also determined that integrin β4 overexpression associates with the presence of venous invasion ( P = .0048), and with reduced overall patient survival (Hazard ratio 1.46, 95% confidence interval 1.01 to 2.09, P = .0422). Elevated integrin β4 expression was also shown to associate with reduced overall survival in lung cancer gene expression datasets (Hazard ratio 1.49, 95% confidence interval 1.31 to 1.69, P < .0001). Using cBioPortal, we generated a network map demonstrating the 50 most highly altered genes neighboring ITGB4 in SCC which included laminins, collagens, CD151 , genes in the EGFR and PI3K pathways, and other known signaling partners. In conclusion, we demonstrate that integrin β4 is overexpressed in NSCLC where it is an adverse prognostic marker.
Abstract
Background
Public Data Commons (PDC) have been highlighted in the scientific literature for their capacity to collect and harmonize big data. On the other hand, local data commons (LDC), ...located within an institution or organization, have been underrepresented in the scientific literature, even though they are a critical part of research infrastructure. Being closest to the sources of data, LDCs provide the ability to collect and maintain the most up-to-date, high-quality data within an organization, closest to the sources of the data. As a data provider, LDCs have many challenges in both collecting and standardizing data, moreover, as a consumer of PDC, they face problems of data harmonization stemming from the monolithic harmonization pipeline designs commonly adapted by many PDCs. Unfortunately, existing guidelines and resources for building and maintaining data commons exclusively focus on PDC and provide very little information on LDC.
Results
This article focuses on four important observations. First, there are three different types of LDC service models that are defined based on their roles and requirements. These can be used as guidelines for building new LDC or enhancing the services of existing LDC. Second, the seven core services of LDC are discussed, including cohort identification and facilitation of genomic sequencing, the management of molecular reports and associated infrastructure, quality control, data harmonization, data integration, data sharing, and data access control. Third, instead of commonly developed monolithic systems, we propose a new data sharing method for data harmonization that combines both divide-and-conquer and bottom-up approaches. Finally, an end-to-end LDC implementation is introduced with real-world examples.
Conclusions
Although LDCs are an optimal place to identify and address data quality issues, they have traditionally been relegated to the role of passive data provider for much larger PDC. Indeed, many LDCs limit their functions to only conducting routine data storage and transmission tasks due to a lack of information on how to design, develop, and improve their services using limited resources. We hope that this work will be the first small step in raising awareness among the LDCs of their expanded utility and to publicize to a wider audience the importance of LDC.