Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the ...group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.
Clustering is a critical component of single-cell RNA sequencing (scRNA-seq) data analysis and can help reveal cell types and infer cell lineages. Despite considerable successes, there are few ...methods tailored to investigating cluster-specific genes contributing to cell heterogeneity, which can promote biological understanding of cell heterogeneity. In this study, we propose a zero-inflated negative binomial mixture model (ZINBMM) that simultaneously achieves effective scRNA-seq data clustering and gene selection. ZINBMM conducts a systemic analysis on raw counts, accommodating both batch effects and dropout events. Simulations and the analysis of five scRNA-seq datasets demonstrate the practical applicability of ZINBMM.
We study the asymptotic properties of bridge estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase to infinity with the sample size. We are ...particularly interested in the use of bridge estimators to distinguish between covariates whose coefficients are zero and covariates whose coefficients are nonzero. We show that under appropriate conditions, bridge estimators correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance. Thus, bridge estimators have an oracle property in the sense of Fan and Li J. Amer. Statist. Assoc. 96 (2001) 1348-1360 and Fan and Peng Ann. Statist. 32 (2004) 928-961. In general, the oracle property holds only if the number of covariates is smaller than the sample size. However, under a partial orthogonality condition in which the covariates of the zero coefficients are uncorrelated or weakly correlated with the covariates of nonzero coefficients, we show that marginal bridge estimators can correctly distinguish between covariates with nonzero and zero coefficients with probability converging to one even when the number of covariates is greater than the sample size.
To develop a reference of population-based gestational age-specific birth weight percentiles for contemporary Chinese.
Birth weight data was collected by the China National Population-based Birth ...Defects Surveillance System. A total of 1,105,214 live singleton births aged ≥28 weeks of gestation without birth defects during 2006-2010 were included. The lambda-mu-sigma method was utilized to generate percentiles and curves.
Gestational age-specific birth weight percentiles for male and female infants were constructed separately. Significant differences were observed between the current reference and other references developed for Chinese or non-Chinese infants.
There have been moderate increases in birth weight percentiles for Chinese infants of both sexes and most gestational ages since 1980s, suggesting the importance of utilizing an updated national reference for both clinical and research purposes.
The effects of thyroid-stimulating hormone (TSH) and thyroid hormones on the development of human papillary thyroid cancer (PTC) remain poorly understood.
The study population consisted of 741 (341 ...women, 400 men) histologically confirmed PTC cases and 741 matched controls with prediagnostic serum samples stored in the Department of Defense Serum Repository. Concentrations of TSH, total T3, total T4, and free T4 were measured in serum samples. Conditional logistic regression models were used to calculate ORs and 95% confidence intervals (CI).
The median time between blood draw and PTC diagnosis was 1,454 days. Compared with the middle tertile of TSH levels within the normal range, serum TSH levels below the normal range were associated with an elevated risk of PTC among women (OR, 3.74; 95% CI, 1.53-9.19) but not men. TSH levels above the normal range were associated with an increased risk of PTC among men (OR, 1.96; 95% CI, 1.04-3.66) but not women. The risk of PTC decreased with increasing TSH levels within the normal range among both men and women (
= 0.0005 and 0.041, respectively).
We found a significantly increased risk of PTC associated with TSH levels below the normal range among women and with TSH levels above the normal range among men. An inverse association between PTC and TSH levels within the normal range was observed among both men and women.
These results could have significant clinical implications for physicians who are managing patients with abnormal thyroid functions and those with thyroidectomy.
.
In analysis of bioinformatics data, a unique challenge arises from the high dimensionality of measurements. Without loss of generality, we use genomic study with gene expression measurements as a ...representative example but note that analysis techniques discussed in this article are also applicable to other types of bioinformatics studies. Principal component analysis (PCA) is a classic dimension reduction approach. It constructs linear combinations of gene expressions, called principal components (PCs). The PCs are orthogonal to each other, can effectively explain variation of gene expressions, and may have a much lower dimensionality. PCA is computationally simple and can be realized using many existing software packages. This article consists of the following parts. First, we review the standard PCA technique and their applications in bioinformatics data analysis. Second, we describe recent 'non-standard' applications of PCA, including accommodating interactions among genes, pathways and network modules and conducting PCA with estimating equations as opposed to gene expressions. Third, we introduce several recently proposed PCA-based techniques, including the supervised PCA, sparse PCA and functional PCA. The supervised PCA and sparse PCA have been shown to have better empirical performance than the standard PCA. The functional PCA can analyze time-course gene expression data. Last, we raise the awareness of several critical but unsolved problems related to PCA. The goal of this article is to make bioinformatics researchers aware of the PCA technique and more importantly its most recent development, so that this simple yet effective dimension reduction technique can be better employed in bioinformatics data analysis.
In survival analysis, when a subset of subjects has extremely long survival, the two-part cure rate model has been commonly adopted. In the two-part model, the first part is for a binary response and ...describes the probability of cure. The second part is for a survival response and describes the probability of survival. Despite their intuitive interconnections, most of the existing works estimate the two parts without any constraint. The existing works on proportionality promote similarity in magnitudes (i.e. quantitative similarity) and can be too restrictive. In this study, for the two-part cure rate model, we propose imposing a sign-based penalty to promote similarity in signs (i.e. qualitative similarity). The proposed strategy can be more informative than those that neglect the two-part interconnections and be less restrictive than the existing proportionality works. Penalty is also imposed to select relevant variables and accommodate high-dimensional data. Numerical studies, including simulation and two data analyses, demonstrate the advantageous performance of the proposed approach.
We report on whole-exome sequencing (WES) of 213 melanomas. Our analysis established NF1, encoding a negative regulator of RAS, as the third most frequently mutated gene in melanoma, after BRAF and ...NRAS. Inactivating NF1 mutations were present in 46% of melanomas expressing wild-type BRAF and RAS, occurred in older patients and showed a distinct pattern of co-mutation with other RASopathy genes, particularly RASA2. Functional studies showed that NF1 suppression led to increased RAS activation in most, but not all, melanoma cases. In addition, loss of NF1 did not predict sensitivity to MEK or ERK inhibitors. The rebound pathway, as seen by the induction of phosphorylated MEK, occurred in cells both sensitive and resistant to the studied drugs. We conclude that NF1 is a key tumor suppressor lost in melanomas, and that concurrent RASopathy gene mutations may enhance its role in melanomagenesis.
High-throughput technologies have been used to generate a large amount of omics data. In the past, single-level analysis has been extensively conducted where the omics measurements at different ...levels, including mRNA, microRNA, CNV and DNA methylation, are analyzed separately. As the molecular complexity of disease etiology exists at all different levels, integrative analysis offers an effective way to borrow strength across multi-level omics data and can be more powerful than single level analysis. In this article, we focus on reviewing existing multi-omics integration studies by paying special attention to variable selection methods. We first summarize published reviews on integrating multi-level omics data. Next, after a brief overview on variable selection methods, we review existing supervised, semi-supervised and unsupervised integrative analyses within parallel and hierarchical integration studies, respectively. The strength and limitations of the methods are discussed in detail. No existing integration method can dominate the rest. The computation aspects are also investigated. The review concludes with possible limitations and future directions for multi-level omics data integration.
Testicular cancer (TC) is the most common malignancy in young adult men, and in many countries the incidence rates of testicular cancer have been increasing since the middle of the twentieth century. ...Since disease presentation and tumor progression patterns are often heterogeneous across racial groups, there may be important racial differences in recent TC trends.
In this study, Surveillance, Epidemiology, and End Results (SEER) data on TC patients diagnosed between 1973 and 2015 were analyzed, including the following racial/ethnic groups: non-Hispanic whites (NHW), Hispanic whites (HW), blacks, and Asians and Pacific Islanders (API). Patient characteristics, age-adjusted incidence rates, and survival were compared across racial groups. A multivariate Cox model was used to analyze the survival data of TC patients, in order to evaluate racial differences across several relevant factors, including marital status, age group, histologic type, treatment, stage, and tumor location.
NHWs had the highest incidence rates, followed by blacks, HWs, and APIs. There were significant survival differences among the racial groups, with NHWs having the highest survival rates and blacks having the lowest.
An analysis of SEER data showed that racial differences existed among TC patients in the United States with respect to patient characteristics, incidence, and survival. The results can be useful to stakeholders interested in reducing the burden of TC morbidity and mortality.