There is a strong need for synthetic yet realistic distribution system test data sets that are as diverse, large, and complex to solve as real systems. Such data sets can facilitate the development ...of advanced algorithms and the assessment of emerging distributed energy resources while avoiding the need to acquire proprietary critical infrastructure or private data. Such synthetic data sets, however, are useful only if they are realistic enough to look and behave similarly to actual systems. This paper presents a comprehensive framework for validating synthetic distribution data sets using a three-pronged statistical, operational, and expert validation approach. It also presents a set of statistical and operational metric targets for achieving realistic data sets based on detailed characterization of more than 10,000 real U.S. utility feeders. The paper demonstrates the use of the proposed validation approach to validate three large-scale synthetic data sets developed by the authors representing Santa Fe, New Mexico; Greensboro, North Carolina; and the San Francisco Bay Area, California.
Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and ...so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as “GCC Dataset”. Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, (1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; (2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.Extensive experiments verify that the supervision algorithm outperforms the state-of-the-art performance on four real datasets: UCF_CC_50, UCF-QNRF, and Shanghai Tech Part A/B Dataset. The above results show the effectiveness, values of synthetic GCC for the pixel-wise crowd understanding. The tools of collecting/labeling data, the proposed synthetic dataset and the source code for counting models are available at
https://gjy3035.github.io/GCC-CL/
.
Image processing plays a major role in neurologists' clinical diagnosis in the medical field. Several types of imagery are used for diagnostics, tumor segmentation, and classification. Magnetic ...resonance imaging (MRI) is favored among all modalities due to its noninvasive nature and better representation of internal tumor information. Indeed, early diagnosis may increase the chances of being lifesaving. However, the manual dissection and classification of brain tumors based on MRI is vulnerable to error, time‐consuming, and formidable task. Consequently, this article presents a deep learning approach to classify brain tumors using an MRI data analysis to assist practitioners. The recommended method comprises three main phases: preprocessing, brain tumor segmentation using k‐means clustering, and finally, classify tumors into their respective categories (benign/malignant) using MRI data through a finetuned VGG19 (i.e., 19 layered Visual Geometric Group) model. Moreover, for better classification accuracy, the synthetic data augmentation concept i s introduced to increase available data size for classifier training. The proposed approach was evaluated on BraTS 2015 benchmarks data sets through rigorous experiments. The results endorse the effectiveness of the proposed strategy and it achieved better accuracy compared to the previously reported state of the art techniques.
Following preprocessing, ROI extracted using K‐means clustering & finetune VGG19 model for tumor classification (benign/malignant) applied, accuracy improved using synthetic data augmentation.
Electronic healthcare record data have been used to study risk factors of disease, treatment effectiveness and safety, and to inform healthcare service planning. There has been increasing interest in ...utilizing these data for new purposes such as for machine learning to develop predictive algorithms to aid diagnostic and treatment decisions. Synthetic data could potentially be an alternative to real‐world data for these purposes as well as reveal any biases in the data used for algorithm development. This article discusses the key requirements of synthetic data for multiple purposes and proposes an approach to generate and evaluate synthetic data focused on, but not limited to, cross‐sectional healthcare data. To our knowledge, this is the first article to propose a framework to generate and evaluate synthetic healthcare data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data while also ensuring privacy. We include findings and new insights from synthetic datasets modeled on both the Indian liver patient dataset and UK primary care dataset to demonstrate the application of this framework under different scenarios.
Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. ...Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.
Supervised deep learning with pixel-wise training labels has great successes on multi-person part segmentation. However, data labeling at pixel-level is very expensive. To solve the problem, people ...have been exploring to use synthetic data to avoid the data labeling. Although it is easy to generate labels for synthetic data, the results are much worse compared to those using real data and manual labeling. The degradation of the performance is mainly due to the domain gap, i.e., the discrepancy of the pixel value statistics between real and synthetic data. In this paper, we observe that real and synthetic humans both have a skeleton (pose) representation. We found that the skeletons can effectively bridge the synthetic and real domains during the training. Our proposed approach takes advantage of the rich and realistic variations of the real data and the easily obtainable labels of the synthetic data to learn multi-person part segmentation on real images without any human-annotated labels. Through experiments, we show that without any human labeling, our method performs comparably to several state-of-the-art approaches which require human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other hand, if part labels are also available in the real-images during training, our method outperforms the supervised state-of-the-art methods by a large margin. We further demonstrate the generalizability of our method on predicting novel keypoints in real images where no real data labels are available for the novel keypoints detection. Code and pre-trained models are available at https://github.com/kevinlin311tw/CDCL-human-part-segmentation .
Deep learning (DL) research has made remarkable progress in recent years. Natural language processing and image generation have made the leap from computer science journals to open-source communities ...and commercial services. Pre-trained DL models built on massive datasets, also known as foundation models, such as the GPT-3 and BERT, have led the way in democratizing artificial intelligence (AI). However, their potential use as research tools has been overshadowed by fears of how this technology can be misused. Some have argued that AI threatens scholarship, suggesting they should not replace human collaborators. Others have argued that AI creates opportunities, suggesting that AI-human collaborations could speed up research. Taking a constructive stance, this editorial outlines ways to use foundation models to advance science. We argue that DL tools can be used to create realistic experiments and make specific types of quantitative studies feasible or safer with synthetic rather than real data. All in all, we posit that the use of generative AI and foundation models as a tool in information systems research is in very early stages. Still, if we proceed cautiously and develop clear guidelines for using foundation models and generative AI, their benefits for science and scholarship far outweigh their risks.
•The editorial discusses the opportunities and challenges of using foundation models.•Generative AI can be used to develop data or content for various forms of studies.•Generative AI content and data can be used to avoid privacy concerns.•A process for safely using generated content in research is proposed.
Synthetic data generation (SDG) research has been ongoing for some time with promising results in different application domains, including healthcare, biometrics and energy consumption. The need for ...a robust SDG solution to capitalise on advances in Big Data and AI technology has never been greater to enable access to useful data while ensuring reasonable privacy protections. This paper presents a systematic review from the last 5 years (2016–2021) to analyse and report on recent approaches in synthetic tabular data generation (STDG) with a focus on the healthcare application context to preserve patient privacy, paying special attention to the contribution of Generative Adversarial Networks (GAN). In total 34 publications have been retrieved and analysed. A classification of approaches has been proposed and the performance of GAN-based approaches has been extensively analysed. From the systematic review it has been concluded that there is no universal method or metric to evaluate and benchmark the performance of various approaches and that further research is needed to improve the generalisability of GANs to find a model that works optimally across tabular healthcare data.
Today, the cutting edge of computer vision research greatly depends on the availability of large datasets, which are critical for effectively training and testing new methods. Manually annotating ...visual data, however, is not only a labor‐intensive process but also prone to errors. In this study, we present NOVA, a versatile framework to create realistic‐looking 3D rendered worlds containing procedurally generated humans with rich pixel‐level ground truth annotations. NOVA can simulate various environmental factors such as weather conditions or different times of day, and bring an exceptionally diverse set of humans to life, each having a distinct body shape, gender and age. To demonstrate NOVA's capabilities, we generate two synthetic datasets for person tracking. The first one includes 108 sequences, each with different levels of difficulty like tracking in crowded scenes or at nighttime and aims for testing the limits of current state‐of‐the‐art trackers. A second dataset of 97 sequences with normal weather conditions is used to show how our synthetic sequences can be utilized to train and boost the performance of deep‐learning based trackers. Our results indicate that the synthetic data generated by NOVA represents a good proxy of the real‐world and can be exploited for computer vision tasks.
Graphical :