Display omitted
•Obtain a pre-trained BERT model of Chinese clinical records, public and available for community.•Incorporate dictionary features and radical features into deep learning model, ...BERT + BiLSTM + CRF.•Outperform all other methods on CCKS-2017 and CCKS-2018 clinical named entity recognition datasets.
Clinical Named Entity Recognition (CNER) is a critical task which aims to identify and classify clinical terms in electronic medical records. In recent years, deep neural networks have achieved significant success in CNER. However, these methods require high-quality and large-scale labeled clinical data, which is challenging and expensive to obtain, especially data on Chinese clinical records. To tackle the Chinese CNER task, we pre-train BERT model on the unlabeled Chinese clinical records, which can leverage the unlabeled domain-specific knowledge. Different layers such as Long Short-Term Memory (LSTM) and Conditional Random Field (CRF) are used to extract the text features and decode the predicted tags respectively. In addition, we propose a new strategy to incorporate dictionary features into the model. Radical features of Chinese characters are used to improve the model performance as well. To the best of our knowledge, our ensemble model outperforms the state of the art models which achieves 89.56% strict F1 score on the CCKS-2018 dataset and 91.60% F1 score on CCKS-2017 dataset.
The incubation period and generation time are key characteristics in the analysis of infectious diseases. The commonly used contact‐tracing–based estimation of incubation distribution is highly ...influenced by the individuals' judgment on the possible date of exposure, and might lead to significant errors. On the other hand, interval censoring–based methods are able to utilize a much larger set of traveling data but may encounter biased sampling problems. The distribution of generation time is usually approximated by observed serial intervals. However, it may result in a biased estimation of generation time, especially when the disease is infectious during incubation. In this paper, the theory from renewal process is partially adopted by considering the incubation period as the interarrival time, and the duration between departure from Wuhan and onset of symptoms as the mixture of forward time and interarrival time with censored intervals. In addition, a consistent estimator for the distribution of generation time based on incubation period and serial interval is proposed for incubation‐infectious diseases. A real case application to the current outbreak of COVID‐19 is implemented. We find that the incubation period has a median of 8.50 days (95% confidence interval CI 7.22; 9.15). The basic reproduction number in the early phase of COVID‐19 outbreak based on the proposed generation time estimation is estimated to be 2.96 (95% CI 2.15; 3.86).
The ICH E9 (R1) addendum proposes five strategies to define estimands by addressing intercurrent events. However, mathematical forms of these targeted quantities are lacking, which might lead to ...discordance between statisticians who estimate these quantities and clinicians, drug sponsors, and regulators who interpret them. To improve the concordance, we provide a unified four‐step procedure for constructing the mathematical estimands. We apply the procedure for each strategy to derive the mathematical estimands and compare the five strategies in practical interpretations, data collection, and analytical methods. Finally, we show that the procedure can help ease tasks of defining estimands in settings with multiple types of intercurrent events using two real clinical trials.
In diagnostic radiology, the multireader multicase (MRMC) design and the free‐response receiver operating characteristics (FROC) method are often used in combination. The cross‐correlated data ...generated by the MRMC‐FROC study leads to difficulties in the corresponding analysis, and the need to include covariates in the model further complicates the subsequent analysis. In this paper, we propose a regression approach based on three new measures with good interpretability. The correlation structure of the original test results is taken directly into account in the estimation procedure. The proposed method also allows the inclusion of continuous or discrete covariates. Consistent and asymptotically normal estimators are derived for the new measures. Simulation studies are conducted to evaluate the performance of the proposed approach. The simulation results show that the regression approach performs well under a wide range of scenarios. We also apply the proposed regression approach to a diagnostic study of computer‐aided diagnosis in lung cancer.
The National Alzheimer's Coordinating Center Uniform Data Set includes test results from a battery of cognitive exams. Motivated by the need to model the cognitive ability of low‐performing patients ...we create a composite score from ten tests and propose to model this score using a partially linear quantile regression model for longitudinal studies with non‐ignorable dropouts. Quantile regression allows for modeling non‐central tendencies. The partially linear model accommodates nonlinear relationships between some of the covariates and cognitive ability. The data set includes patients that leave the study prior to the conclusion. Ignoring such dropouts will result in biased estimates if the probability of dropout depends on the response. To handle this challenge, we propose a weighted quantile regression estimator where the weights are inversely proportional to the estimated probability a subject remains in the study. We prove that this weighted estimator is a consistent and efficient estimator of both linear and nonlinear effects.
As a fundamental component of health care, disease screening is of highly importance. Oftentimes, two screening tests for a specific disease are compared in order to determine an optimal screening ...policy, for example, the digital rectal examination (DRE) and serum prostate specific antigen (PSA) level for screening prostate cancer. Ideally, if a gold standard test is given to each individual being screened to establish their true disease status, the difference in accuracy measures between two tests can be evaluated. In practice, however, it is common that only individuals who test positive on at least one screening test are to receive gold standard tests, which are often invasive and cannot be applied to those with negative results on both tests due to ethical reasons. Under such circumstances, estimates of the differences in accuracy measures between two tests cannot be determined, thus the inference problem within this framework is challenging. In this article, using sensitivity and specificity as measures of test accuracy, we show that their difference between two tests is interval‐identified, as bounded by estimable sharp bounds. Here, we develop the asymptotic normality for the estimators of the bounds and construct confidence intervals for the difference by utilizing the method for solving inference problem for partially identified parameters. The performance of constructed confidence intervals for the difference and their sharp bounds are evaluated via simulation studies. We also apply the proposed method to the prostate cancer example to compare the accuracy of DRE and PSA.
This study proposes novel estimation and inference approaches for heterogeneous local treatment effects using high‐dimensional covariates and observational data without a strong ignorability ...assumption. To achieve this, with a binary instrumental variable, the parameters of interest are identified on an unobservable subgroup of the population (compliers). Lasso estimation under a non‐convex objective function is developed for a two‐stage generalized linear model, and a debiased estimator is proposed to construct confidence intervals for treatment effects conditioned on covariates. Notably, this approach simultaneously corrects the biases due to high‐dimensional estimation at both stages. The finite sample performance is evaluated via simulation studies, and real data analysis is performed on the Oregon Health Insurance Experiment to illustrate the feasibility of the proposed procedure.
In prognosis studies to evaluate association between a continuous biomarker and a survival outcome, investigators often classify subjects into two subclasses of the high‐ and low‐expression groups ...and apply simple survival analysis techniques of the Kaplan‐Meier method and the logrank test. The high‐ and low‐expressions are defined according to whether or not the observation of the biomarker is higher than the cut‐off value, which is heterogeneous across studies. The heterogeneous definitions of the cut‐off value make it difficult to apply the standard meta‐analysis techniques. We propose a method to estimate the concordance index for a survival outcome synthesizing published prognosis studies, in which the Kaplan‐Meier estimates for the high‐ and low‐expression groups are reported. We illustrate our proposed method with a real dataset for meta‐analysis of prognosis studies evaluating Ki‐67 in early breast cancer and evaluate its performance with a simulation study.