Synthetic data Patel, Preeti
Business information review,
06/2024, Letnik:
41, Številka:
2
Journal Article
The rise of data-driven businesses poses a number of significant challenges for contemporary organisations. These include legal and ethical considerations arising from the use of personal data, the ...growing challenges of information security, and the difficulty managing the volume of data generated in business transactions of different kinds. The exponential growth of data continues unabated with global data volumes reaching 181 zettabytes by 2025, and with 90% of the world’s data generated in the last two years alone. This massive growth can be mainly attributed to data gathered by Internet of Things (IoT) and related sensory devices, in addition to data generated through the human use of digital tools and applications. Given this abundance of real-world data, in what context could synthetic data be necessary? This paper highlights the growing organisational use of synthetic data and explores where and how it can be optimally used. It examines the ethical aspects of synthetic data usage, the need to garner public perception and acceptance, and the key aspects of traceability, accountability and risk mitigation.
Infrastructure scene understanding from image data aids diverse applications in construction and maintenance. Recently, deep learning models have been employed to extract information regarding ...infrastructure from visual data. The performance of these models depends significantly on the volume of training data. However, preparing the training data is time-consuming and laborious, as it entails labeling numerous images. To address this issue, this paper proposes a method for generating high-quality synthetic data that includes the automatic annotation of infrastructure scenes. The method consists of three steps: 1) translating building information model (BIM) images into real-world images, 2) automatically labeling them using the spatial information contained in the BIM to generate various synthetic datasets, and 3) splicing the selected synthetic datasets together to form the final synthetic dataset. The Mask R-CNN models trained with building and bridge synthetic data achieved average precisions of 71.6% and 84.9%, respectively.
•The proposed method generates high-quality synthetic data on infrastructure scenes.•The annotation on the synthetic image is conducted automatically.•Synthetic images are generated by transforming BIM images using CycleGAN.•Two-stage synthetic data generation is proposed for better segmentation performance.
Data sparsity is one of the challenges for low-resource language pairs in Neural Machine Translation (NMT). Previous works have presented different approaches for data augmentation, but they mostly ...require additional resources and obtain low-quality dummy data in the low-resource issue. This paper proposes a simple and effective novel for generating synthetic bilingual data without using external resources as in previous approaches. Moreover, some works recently have shown that multilingual translation or transfer learning can boost the translation quality in low-resource situations. However, for logographic languages such as Chinese or Japanese, this approach is still limited due to the differences in translation units in the vocabularies. Although Japanese texts contain Kanji characters that are derived from Chinese characters, and they are quite homologous in sharp and meaning, the word orders in the sentences of these languages have a big divergence. Our study will investigate these impacts in machine translation. In addition, a combined pre-trained model is also leveraged to demonstrate the efficacy of translation tasks in the more high-resource scenario. Our experiments present performance improvements up to +6.2 and +7.8 BLEU scores over bilingual baseline systems on two low-resource translation tasks from Chinese to Vietnamese and Japanese to Vietnamese.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, UILJ, UKNU, UL, UM, UPUK
Algorithmic evaluation is a vital step in developing new approaches to machine learning and relies on the availability of existing datasets. However, real-world datasets often do not cover the ...necessary complexity space required to understand an algorithm’s domains of competence. As such, the generation of synthetic datasets to fill gaps in the complexity space has gained attention, offering a means of evaluating algorithms when data is unavailable. Existing approaches to complexity-focused data generation are limited in their ability to generate solutions that invoke similar classification behaviour to real data. The present work proposes a novel method (Sy:Boid) for complexity-based synthetic data generation, adapting and extending the Boid algorithm that was originally intended for computer graphics simulations. Sy:Boid embeds the modified Boid algorithm within an evolutionary multi-objective optimisation algorithm to generate synthetic datasets which satisfy predefined magnitudes of complexity measures. Sy:Boid is evaluated and compared to labelling-based and sampling-based approaches to data generation to understand its ability to generate a wide variety of realistic datasets. Results demonstrate Sy:Boid is capable of generating datasets across a greater portion of the complexity space than existing approaches. Furthermore, the produced datasets were observed to invoke very similar classification behaviours to that of real data.
Recently, convolution neural networks (CNNs) have achieved state-of-the-art performance in infrared small target detection. However, the limited number of public training data restricts the ...performance improvement of CNN-based methods. To handle the scarcity of training data, we propose a method that can generate synthetic training data for infrared small target detection. We adopt the generative adversarial network framework where synthetic background images and infrared small targets are generated in two independent processes. In the first stage, we synthesize infrared images by transforming visible images to infrared ones. In the second stage, target masks are implanted on the transformed images. Then, the proposed intensity modulation network synthesizes realistic target objects that can be diversely generated from further image processing. Experimental results on the recent public dataset show that when we train various detection networks using the dataset composed of both real and synthetic images, detection networks yield better performance than using real data only.
Use of synthetic data has provided a potential solution for addressing unavailable or insufficient training samples in deep learning-based magnetic resonance imaging (MRI). However, the challenge ...brought by domain gap between synthetic and real data is usually encountered, especially under complex experimental conditions. In this study, by combining Bloch simulation and general MRI models, we propose a framework for addressing the lack of training data in supervised learning scenarios, termed MOST-DL. A challenging application is demonstrated to verify the proposed framework and achieve motion-robust <inline-formula> <tex-math notation="LaTeX">\text{T}_{{2}} </tex-math></inline-formula> mapping using single-shot overlapping-echo acquisition. We decompose the process into two main steps: (1) calibrationless parallel reconstruction for ultra-fast pulse sequence and (2) intra-shot motion correction for <inline-formula> <tex-math notation="LaTeX">\text{T}_{{2}} </tex-math></inline-formula> mapping. To bridge the domain gap, realistic textures from a public database and various imperfection simulations were explored. The neural network was first trained with pure synthetic data and then evaluated with in vivo human brain. Both simulation and in vivo experiments show that the MOST-DL method significantly reduces ghosting and motion artifacts in <inline-formula> <tex-math notation="LaTeX">\text{T}_{{2}} </tex-math></inline-formula> maps in the presence of unpredictable subject movement and has the potential to be applied to motion-prone patients in the clinic. Our code is available at https://github.com/qinqinyang/MOST-DL .
We develop metrics for measuring the quality of synthetic health data for both education and research. We use novel and existing metrics to capture a synthetic dataset’s resemblance, privacy, utility ...and footprint. Using these metrics, we develop an end-to-end workflow based on our generative adversarial network (GAN) method, HealthGAN, that creates privacy preserving synthetic health data. Our workflow meets privacy specifications of our data partner: (1) the HealthGAN is trained inside a secure environment; (2) the HealthGAN model is used outside of the secure environment by external users to generate synthetic data. This second step facilitates data handling for external users by avoiding de-identification, which may require special user training, be costly, or cause loss of data fidelity. This workflow is compared against five other baseline methods. While maintaining resemblance and utility comparable to other methods, HealthGAN provides the best privacy and footprint. We present two case studies in which our methodology was put to work in the classroom and research settings. We evaluate utility in the classroom through a data analysis challenge given to students and in research by replicating three different medical papers with synthetic data. Data, code, and the challenge that we organized for educational purposes are available.
The hyperspectral image (HSI) denoising has been widely utilized to improve HSI qualities. Recently, learning-based HSI denoising methods have shown their effectiveness, but most of them are based on ...synthetic dataset and lack the generalization capability on real testing HSI. Moreover, there is still no public paired real HSI denoising dataset to learn HSI denoising network and quantitatively evaluate HSI methods. In this paper, we mainly focus on how to produce realistic dataset for learning and evaluating HSI denoising network. On the one hand, we collect a paired real HSI denoising dataset, which consists of short-exposure noisy HSIs and the corresponding long-exposure clean HSIs. On the other hand, we propose an accurate HSI noise model which matches the distribution of real data well and can be employed to synthesize realistic dataset. On the basis of the noise model, we present an approach to calibrate the noise parameters of the given hyperspectral camera. Besides, on the basis of observation of high signal-to-noise ratio of mean image of all spectral bands, we propose a guided HSI denoising network with guided dynamic nonlocal attention, which calculates dynamic nonlocal correlation on the guidance information, i.e., mean image of spectral bands, and adaptively aggregates spatial nonlocal features for all spectral bands. The extensive experimental results show that a network learned with only synthetic data generated by our noise model performs as well as it is learned with paired real data, and our guided HSI denoising network outperforms state-of-the-art methods under both quantitative metrics and visual quality.
The Glottal Area Waveform (GAW) is an important component in quantitative clinical voice assessment, providing valuable insights into vocal fold function. In this study, we introduce a novel method ...employing Variational Autoencoders (VAEs) to generate synthetic GAWs. Our approach enables the creation of synthetic GAWs that closely replicate real-world data, offering a versatile tool for researchers and clinicians. We elucidate the process of manipulating the VAE latent space using the Glottal Opening Vector (GlOVe). The GlOVe allows precise control over the synthetic closure and opening of the vocal folds. By utilizing the GlOVe, we generate synthetic laryngeal biosignals. These biosignals accurately reflect vocal fold behavior, allowing for the emulation of realistic glottal opening changes. This manipulation extends to the introduction of arbitrary oscillations in the vocal folds, closely resembling real vocal fold oscillations. The range of factor coefficient values enables the generation of diverse biosignals with varying frequencies and amplitudes. Our results demonstrate that this approach yields highly accurate laryngeal biosignals, with the Normalized Mean Absolute Error values for various frequencies ranging from 9.6 ⋅ 10−3 to 1.20 ⋅ 10−2 for different experimented frequencies, alongside a remarkable training effectiveness, reflected in reductions of up to approximately 89.52% in key loss components. This proposed method may have implications for downstream speech synthesis and phonetics research, offering the potential for advanced and natural-sounding speech technologies.