Crowd analysis via computer vision techniques is an important topic in the field of video surveillance, which has wide-spread applications including crowd monitoring, public safety, space design and ...so on. Pixel-wise crowd understanding is the most fundamental task in crowd analysis because of its finer results for video sequences or still images than other analysis tasks. Unfortunately, pixel-level understanding needs a large amount of labeled training data. Annotating them is an expensive work, which causes that current crowd datasets are small. As a result, most algorithms suffer from over-fitting to varying degrees. In this paper, take crowd counting and segmentation as examples from the pixel-wise crowd understanding, we attempt to remedy these problems from two aspects, namely data and methodology. Firstly, we develop a free data collector and labeler to generate synthetic and labeled crowd scenes in a computer game, Grand Theft Auto V. Then we use it to construct a large-scale, diverse synthetic crowd dataset, which is named as “GCC Dataset”. Secondly, we propose two simple methods to improve the performance of crowd understanding via exploiting the synthetic data. To be specific, (1) supervised crowd understanding: pre-train a crowd analysis model on the synthetic data, then fine-tune it using the real data and labels, which makes the model perform better on the real world; (2) crowd understanding via domain adaptation: translate the synthetic data to photo-realistic images, then train the model on translated data and labels. As a result, the trained model works well in real crowd scenes.Extensive experiments verify that the supervision algorithm outperforms the state-of-the-art performance on four real datasets: UCF_CC_50, UCF-QNRF, and Shanghai Tech Part A/B Dataset. The above results show the effectiveness, values of synthetic GCC for the pixel-wise crowd understanding. The tools of collecting/labeling data, the proposed synthetic dataset and the source code for counting models are available at
https://gjy3035.github.io/GCC-CL/
.
There is a strong need for synthetic yet realistic distribution system test data sets that are as diverse, large, and complex to solve as real systems. Such data sets can facilitate the development ...of advanced algorithms and the assessment of emerging distributed energy resources while avoiding the need to acquire proprietary critical infrastructure or private data. Such synthetic data sets, however, are useful only if they are realistic enough to look and behave similarly to actual systems. This paper presents a comprehensive framework for validating synthetic distribution data sets using a three-pronged statistical, operational, and expert validation approach. It also presents a set of statistical and operational metric targets for achieving realistic data sets based on detailed characterization of more than 10,000 real U.S. utility feeders. The paper demonstrates the use of the proposed validation approach to validate three large-scale synthetic data sets developed by the authors representing Santa Fe, New Mexico; Greensboro, North Carolina; and the San Francisco Bay Area, California.
Generative data augmentation (GDA) has emerged as a promising technique to alleviate data scarcity in machine learning applications. This thesis presents a comprehensive survey and unified framework ...of the GDA landscape. We first provide an overview of GDA, discussing its motivation, taxonomy, and key distinctions from synthetic data generation. We then systematically analyze the critical aspects of GDA—selection of generative models, techniques to utilize them, data selection methodologies, validation approaches, and diverse applications. Our proposed unified framework categorizes the extensive GDA literature, revealing gaps such as the lack of universal benchmarks. The thesis summarizes promising research directions, including , effective data selection, theoretical development for large-scale models’ application in GDA and establishing a benchmark for GDA. By laying a structured foundation, this thesis aims to nurture more cohesive development and accelerate progress in the vital arena of generative data augmentation.
•Extensive and Latest Compilation: Drawing from 230 seminal works in the last three years, this survey presents the most comprehensive reviews on GDA.•Unified Framework Proposal: We introduce a structured GDA framework. This offers researchers a systematic guideline for improving and implementing GDA.•Deep Dive into Selection & Validation: Our survey delves deeply into the synthetic data selection and validation, which are given little attention on in the previous research works.•Future Roadmap: Benefitting from the extensive literature review, we discuss the existing challenges and potential breakthrough avenues.
Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. ...Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.
Electronic healthcare record data have been used to study risk factors of disease, treatment effectiveness and safety, and to inform healthcare service planning. There has been increasing interest in ...utilizing these data for new purposes such as for machine learning to develop predictive algorithms to aid diagnostic and treatment decisions. Synthetic data could potentially be an alternative to real‐world data for these purposes as well as reveal any biases in the data used for algorithm development. This article discusses the key requirements of synthetic data for multiple purposes and proposes an approach to generate and evaluate synthetic data focused on, but not limited to, cross‐sectional healthcare data. To our knowledge, this is the first article to propose a framework to generate and evaluate synthetic healthcare data with the aim of simultaneously preserving the complexities of ground truth data in the synthetic data while also ensuring privacy. We include findings and new insights from synthetic datasets modeled on both the Indian liver patient dataset and UK primary care dataset to demonstrate the application of this framework under different scenarios.
Supervised deep learning with pixel-wise training labels has great successes on multi-person part segmentation. However, data labeling at pixel-level is very expensive. To solve the problem, people ...have been exploring to use synthetic data to avoid the data labeling. Although it is easy to generate labels for synthetic data, the results are much worse compared to those using real data and manual labeling. The degradation of the performance is mainly due to the domain gap, i.e., the discrepancy of the pixel value statistics between real and synthetic data. In this paper, we observe that real and synthetic humans both have a skeleton (pose) representation. We found that the skeletons can effectively bridge the synthetic and real domains during the training. Our proposed approach takes advantage of the rich and realistic variations of the real data and the easily obtainable labels of the synthetic data to learn multi-person part segmentation on real images without any human-annotated labels. Through experiments, we show that without any human labeling, our method performs comparably to several state-of-the-art approaches which require human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other hand, if part labels are also available in the real-images during training, our method outperforms the supervised state-of-the-art methods by a large margin. We further demonstrate the generalizability of our method on predicting novel keypoints in real images where no real data labels are available for the novel keypoints detection. Code and pre-trained models are available at https://github.com/kevinlin311tw/CDCL-human-part-segmentation .
Today, the cutting edge of computer vision research greatly depends on the availability of large datasets, which are critical for effectively training and testing new methods. Manually annotating ...visual data, however, is not only a labor‐intensive process but also prone to errors. In this study, we present NOVA, a versatile framework to create realistic‐looking 3D rendered worlds containing procedurally generated humans with rich pixel‐level ground truth annotations. NOVA can simulate various environmental factors such as weather conditions or different times of day, and bring an exceptionally diverse set of humans to life, each having a distinct body shape, gender and age. To demonstrate NOVA's capabilities, we generate two synthetic datasets for person tracking. The first one includes 108 sequences, each with different levels of difficulty like tracking in crowded scenes or at nighttime and aims for testing the limits of current state‐of‐the‐art trackers. A second dataset of 97 sequences with normal weather conditions is used to show how our synthetic sequences can be utilized to train and boost the performance of deep‐learning based trackers. Our results indicate that the synthetic data generated by NOVA represents a good proxy of the real‐world and can be exploited for computer vision tasks.
Graphical :
Synthetic data generation (SDG) research has been ongoing for some time with promising results in different application domains, including healthcare, biometrics and energy consumption. The need for ...a robust SDG solution to capitalise on advances in Big Data and AI technology has never been greater to enable access to useful data while ensuring reasonable privacy protections. This paper presents a systematic review from the last 5 years (2016–2021) to analyse and report on recent approaches in synthetic tabular data generation (STDG) with a focus on the healthcare application context to preserve patient privacy, paying special attention to the contribution of Generative Adversarial Networks (GAN). In total 34 publications have been retrieved and analysed. A classification of approaches has been proposed and the performance of GAN-based approaches has been extensively analysed. From the systematic review it has been concluded that there is no universal method or metric to evaluate and benchmark the performance of various approaches and that further research is needed to improve the generalisability of GANs to find a model that works optimally across tabular healthcare data.
Cryo-electron tomography (cryo-ET) allows to visualize the cellular context at macromolecular level. To date, the impossibility of obtaining a reliable ground truth is limiting the application of ...deep learning-based image processing algorithms in this field. As a consequence, there is a growing demand of realistic synthetic datasets for training deep learning algorithms. In addition, besides assisting the acquisition and interpretation of experimental data, synthetic tomograms are used as reference models for cellular organization analysis from cellular tomograms. Current simulators in cryo-ET focus on reproducing distortions from image acquisition and tomogram reconstruction, however, they can not generate many of the low order features present in cellular tomograms. Here we propose several geometric and organization models to simulate low order cellular structures imaged by cryo-ET. Specifically, clusters of any known cytosolic or membrane bound macromolecules, membranes with different geometries as well as different filamentous structures such as microtubules or actin-like networks. Moreover, we use parametrizable stochastic models to generate a high diversity of geometries and organizations to simulate representative and generalized datasets, including very crowded environments like those observed in native cells. These models have been implemented in a multiplatform open-source Python package, including scripts to generate cryo-tomograms with adjustable sizes and resolutions. In addition, these scripts provide also distortion-free density maps besides the ground truth in different file formats for efficient access and advanced visualization. We show that such a realistic synthetic dataset can be readily used to train generalizable deep learning algorithms.