Biclustering is increasingly used in biomedical data analysis, recommendation tasks, and text mining domains, with hundreds of biclustering algorithms proposed. When assessing the performance of ...these algorithms, more than real datasets are required as they do not offer a solid ground truth. Synthetic data surpass this limitation by producing reference solutions to be compared with the found patterns. However, generating synthetic datasets is challenging since the generated data must ensure reproducibility, pattern representativity, and real data resemblance.
We propose G-Bic, a dataset generator conceived to produce synthetic benchmarks for the normative assessment of biclustering algorithms. Beyond expanding on aspects of pattern coherence, data quality, and positioning properties, it further handles specificities related to mixed-type datasets and time-series data.G-Bic has the flexibility to replicate real data regularities from diverse domains. We provide the default configurations to generate reproducible benchmarks to evaluate and compare diverse aspects of biclustering algorithms. Additionally, we discuss empirical strategies to simulate the properties of real data.
G-Bic is a parametrizable generator for biclustering analysis, offering a solid means to assess biclustering solutions according to internal and external metrics robustly.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
In many contexts, confidentiality constraints severely restrict access to unique and valuable microdata. Synthetic data which mimic the original observed data and preserve the relationships between ...variables but do not contain any disclosive records are one possible solution to this problem. The synthpop package for R, introduced in this paper, provides routines to generate synthetic versions of original data sets. We describe the methodology and its consequences for the data characteristics. We illustrate the package features using a survey data example.
The impact of software vulnerabilities on everyday software systems is concerning. Although deep learning-based models have been proposed for vulnerability detection, their reliability remains a ...significant concern. While prior evaluation of such models reports impressive recall/F1 scores of up to 99%, we find that these models underperform in practical scenarios, particularly when evaluated on the entire codebases rather than only the fixing commit. In this paper, we introduce a comprehensive dataset ( Real-Vul ) designed to accurately represent real-world scenarios for evaluating vulnerability detection models. We evaluate DeepWukong, LineVul, ReVeal, and IVDetect vulnerability detection approaches and observe a surprisingly significant drop in performance, with precision declining by up to 95 percentage points and F1 scores dropping by up to 91 percentage points. A closer inspection reveals a substantial overlap in the embeddings generated by the models for vulnerable and uncertain samples (non-vulnerable or vulnerability not reported yet), which likely explains why we observe such a large increase in the quantity and rate of false positives. Additionally, we observe fluctuations in model performance based on vulnerability characteristics (e.g., vulnerability types and severity). For example, the studied models achieve 26 percentage points better F1 scores when vulnerabilities are related to information leaks or code injection rather than when vulnerabilities are related to path resolution or predictable return values. Our results highlight the substantial performance gap that still needs to be bridged before deep learning-based vulnerability detection is ready for deployment in practical settings. We dive deeper into why models underperform in realistic settings and our investigation revealed overfitting as a key issue. We address this by introducing an augmentation technique, potentially improving performance by up to 30%. We contribute (a) an approach to creating a dataset that future research can use to improve the practicality of model evaluation; (b) Real-Vul - a comprehensive dataset that adheres to this approach; and (c) empirical evidence that the deep learning-based models struggle to perform in a real-world setting.
In recent years, researchers designed several artificial intelligence solutions for healthcare applications, which usually evolved into functional solutions for clinical practice. Furthermore, deep ...learning (DL) methods are well-suited to process the broad amounts of data acquired by wearable devices, smartphones, and other sensors employed in different medical domains. Conceived to serve the role of diagnostic tool and surgical guidance, hyperspectral images emerged as a non-contact, non-ionizing, and label-free technology. However, the lack of large datasets to efficiently train the models limits DL applications in the medical field. Hence, its usage with hyperspectral images is still at an early stage. We propose a deep convolutional generative adversarial network to generate synthetic hyperspectral images of epidermal lesions, targeting skin cancer diagnosis, and overcome small-sized datasets challenges to train DL architectures. Experimental results show the effectiveness of the proposed framework, capable of generating synthetic data to train DL classifiers.
Object detection in traffic scenes has attracted considerable attention from both academia and industry recently. Modern detectors achieve excellent performance under a simple constrained environment ...while performing poorly under the actual complex and open traffic environment. Therefore, the capability of adapting to new and unseen domains is a key factor for the large-scale application and proliferation of detectors in autonomous driving. To this end, this paper proposes a novel category-induced coarse-to-fine domain adaptation approach (C2FDA) for cross-domain object detection, which consists of three pivotal components: (1) Attention-induced coarse-grained alignment module (ACGA), which strengthens the distribution alignment across disparate domains within the foreground features in category-agnostic way by the minimax optimization between the domain classifier and the backbone feature extractor; (2) Attention-induced feature selection module, which assists the model to emphasize the crucial foreground features and enables the ACGA to focus on the relevant and discriminative foreground features, without being affected by the distribution of inconsequential background features; (3) Category-induced fine-grained alignment module (CFGA), which reduces the domain shift in category-aware way by minimizing the distance of centroids with the same category from different domains and maximizing that of centroids with disparate categories. We evaluate the performance of our approach in various source/target domain pairs and comprehensive results demonstrate that C2FDA significantly outperforms the state-of-the-art on multiple domain adaptation scenarios, i.e., the synthetic-to-real adaptation, the weather adaptation, and the cross camera adaptation.
Remote photoplethysmography (rPPG) has been used to measure vital signs such as heart rate, heart rate variability, blood pressure (BP), and blood oxygen. Recent studies adopt features developed with ...photoplethysmography (PPG) to achieve contactless BP measurement via rPPG. These features can be classified into two groups: time or phase differences from multiple signals, or waveform feature analysis from a single signal. Here we devise a solution to extract the time difference information from the rPPG signal captured at 30 FPS. We also propose a deep learning model architecture to estimate BP from the extracted features. To prevent overfitting and compensate for the lack of data, we leverage a multi-model design and generate synthetic data. We also use subject information related to BP to assist in model learning. For real-world usage, the subject information is replaced with values estimated from face images, with performance that is still better than the state-of-the-art. To our best knowledge, the improvements can be achieved because of: 1) the model selection with estimated subject information, 2) replacing the estimated subject information with the real one, 3) the InfoGAN assistance training (synthetic data generation), and 4) the time difference features as model input. To evaluate the performance of the proposed method, we conduct a series of experiments, including dynamic BP measurement for many single subjects and nighttime BP measurement with infrared lighting. Our approach reduces the MAE from 15.49 to 8.78 mmHg for systolic blood pressure (SBP) and 10.56 to 6.16 mmHg for diastolic blood pressure(DBP) on a self-constructed rPPG dataset. On the Taipei Veterans General Hospital(TVGH) dataset for nighttime applications, the MAE is reduced from 21.58 to 11.12 mmHg for SBP and 9.74 to 7.59 mmHg for DBP, with improvement ratios of 48.47% and 22.07% respectively.
There is increasing interest in the potential of synthetic data to validate and benchmark machine learning algorithms as well as reveal any biases in real-world data used for algorithm development. ...This paper discusses the key requirements of synthetic data for such purposes and proposes an approach to generating and evaluating synthetic data that meets these requirements. We propose a framework to generate and evaluate synthetic data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data whilst also ensuring privacy. We include as a case study, a proof-of-concept synthetic dataset modelled on UK primary care data to demonstrate the application of this framework.
Deep learning (DL) research has made remarkable progress in recent years. Natural language processing and image generation have made the leap from computer science journals to open-source communities ...and commercial services. Pre-trained DL models built on massive datasets, also known as foundation models, such as the GPT-3 and BERT, have led the way in democratizing artificial intelligence (AI). However, their potential use as research tools has been overshadowed by fears of how this technology can be misused. Some have argued that AI threatens scholarship, suggesting they should not replace human collaborators. Others have argued that AI creates opportunities, suggesting that AI-human collaborations could speed up research. Taking a constructive stance, this editorial outlines ways to use foundation models to advance science. We argue that DL tools can be used to create realistic experiments and make specific types of quantitative studies feasible or safer with synthetic rather than real data. All in all, we posit that the use of generative AI and foundation models as a tool in information systems research is in very early stages. Still, if we proceed cautiously and develop clear guidelines for using foundation models and generative AI, their benefits for science and scholarship far outweigh their risks.
•The editorial discusses the opportunities and challenges of using foundation models.•Generative AI can be used to develop data or content for various forms of studies.•Generative AI content and data can be used to avoid privacy concerns.•A process for safely using generated content in research is proposed.
Nuclear magnetic resonance (NMR) spectroscopy is a powerful tool for quantitative metabolomics; however, quantification of metabolites from NMR data is often a slow and tedious process requiring user ...input and expertise. In this study, we propose a neural network approach for rapid, automated lipid identification and quantification from NMR data. Multilayered perceptron (MLP) networks were developed with NMR spectra as the input and lipid concentrations as output. Three large synthetic datasets were generated, each with 55,000 spectra from an original 30 scans of reference standards, by using linear combinations of standards and simulating experimental-like modifications (line broadening, noise, peak shifts, baseline shifts) and common interference signals (water, tetramethylsilane, extraction solvent), and were used to train MLPs for robust prediction of lipid concentrations. The performances of MLPS were first validated on various synthetic datasets to assess the effect of incorporating different modifications on their accuracy. The MLPs were then evaluated on experimentally acquired data from complex lipid mixtures. The MLP-derived lipid concentrations showed high correlations and slopes close to unity for most of the quantified lipid metabolites in experimental mixtures compared with ground-truth concentrations. The most accurate, robust MLP was used to profile lipids in lipophilic hepatic extracts from a rat metabolomics study. The MLP lipid results analyzed by two-way ANOVA for dietary and sex differences were similar to those obtained with a conventional NMR quantification method. In conclusion, this study demonstrates the potential and feasibility of a neural network approach for improving speed and automation in NMR lipid profiling and this approach can be easily tailored to other quantitative, targeted spectroscopic analyses in academia or industry.