Why data analytics is an art Charles, Vincent; Emrouznejad, Ali; Gherman, Tatiana ...
Significance (Oxford, England),
December 2022, 2022-12-01, 20221201, Letnik:
19, Številka:
6
Journal Article
Odprti dostop
Data analytics projects can be like throwing darts in the dark. Problem‐centric thinking is vital, argue Vincent Charles, Ali Emrouznejad, Tatiana Gherman, and James Cochran
Data analytics projects ...can be like throwing darts in the dark. Problem‐centric thinking is vital, argue Vincent Charles, Ali Emrouznejad, Tatiana Gherman, and James Cochran
The growing field of large-scale time domain astronomy requires methods for probabilistic data analysis that are computationally tractable, even with large data sets. Gaussian processes (GPs) are a ...popular class of models used for this purpose, but since the computational cost scales, in general, as the cube of the number of data points, their application has been limited to small data sets. In this paper, we present a novel method for GPs modeling in one dimension where the computational requirements scale linearly with the size of the data set. We demonstrate the method by applying it to simulated and real astronomical time series data sets. These demonstrations are examples of probabilistic inference of stellar rotation periods, asteroseismic oscillation spectra, and transiting planet parameters. The method exploits structure in the problem when the covariance function is expressed as a mixture of complex exponentials, without requiring evenly spaced observations or uniform noise. This form of covariance arises naturally when the process is a mixture of stochastically driven damped harmonic oscillators-providing a physical motivation for and interpretation of this choice-but we also demonstrate that it can be a useful effective model in some other cases. We present a mathematical description of the method and compare it to existing scalable GP methods. The method is fast and interpretable, with a range of potential applications within astronomical data analysis and beyond. We provide well-tested and documented open-source implementations of this method in C++, Python, and Julia.
Here, we present two galaxy shape catalogues from the Dark Energy Survey Year 1 data set, covering 1500 square degrees with a median redshift of 0:59. The catalogues cover two main fields: Stripe 82, ...and an area overlapping the South Pole Telescope survey region. We also describe our data analysis process and in particular our shape measurement using two independent shear measurement pipelines, METACALIBRATION and IM3SHAPE. The METACALIBRATION catalogue uses a Gaussian model with an innovative internal calibration scheme, and was applied to riz bands, yielding 34.8M objects. The IM3SHAPE catalogue uses a maximum-likelihood bulge/disc model calibrated using simulations, and was applied to r-band data, yielding 21.9M objects. Both catalogues pass a suite of null tests that demonstrate their fitness for use in weak lensing science. Finally, we estimated the 1 uncertainties in multiplicative shear calibration to be 0.013 and 0.025 for the METACALIBRATION and IM3SHAPE catalogues, respectively.
Data Feminism D'Ignazio, Catherine; Klein, Lauren F
03/2020
eBook
Odprti dostop
A new way of thinking about data science and data ethics that is informed by the ideas of intersectional feminism.
The open access edition of this book was made possible by generous funding from the ...MIT Libraries.
Today, data science is a form of power. It has been used to expose injustice, improve health outcomes, and topple governments. But it has also been used to discriminate, police, and surveil. This potential for good, on the one hand, and harm, on the other, makes it essential to ask: Data science by whom? Data science for whom? Data science with whose interests in mind? The narratives around big data and data science are overwhelmingly white, male, and techno-heroic. In Data Feminism, Catherine D'Ignazio and Lauren Klein present a new way of thinking about data science and data ethics—one that is informed by intersectional feminist thought.
Illustrating data feminism in action, D'Ignazio and Klein show how challenges to the male/female binary can help challenge other hierarchical (and empirically wrong) classification systems. They explain how, for example, an understanding of emotion can expand our ideas about effective data visualization, and how the concept of invisible labor can expose the significant human efforts required by our automated systems. And they show why the data never, ever “speak for themselves.”
Data Feminism offers strategies for data scientists seeking to learn how feminism can help them work toward justice, and for feminists who want to focus their efforts on the growing field of data science. But Data Feminism is about much more than gender. It is about power, about who has it and who doesn't, and about how those differentials of power can be challenged and changed.
ABSTRACT
Astronomers have typically set out to solve supervised machine learning problems by creating their own representations from scratch. We show that deep learning models trained to answer every ...Galaxy Zoo DECaLS question learn meaningful semantic representations of galaxies that are useful for new tasks on which the models were never trained. We exploit these representations to outperform several recent approaches at practical tasks crucial for investigating large galaxy samples. The first task is identifying galaxies of similar morphology to a query galaxy. Given a single galaxy assigned a free text tag by humans (e.g. ‘#diffuse’), we can find galaxies matching that tag for most tags. The second task is identifying the most interesting anomalies to a particular researcher. Our approach is 100 per cent accurate at identifying the most interesting 100 anomalies (as judged by Galaxy Zoo 2 volunteers). The third task is adapting a model to solve a new task using only a small number of newly labelled galaxies. Models fine-tuned from our representation are better able to identify ring galaxies than models fine-tuned from terrestrial images (ImageNet) or trained from scratch. We solve each task with very few new labels; either one (for the similarity search) or several hundred (for anomaly detection or fine-tuning). This challenges the longstanding view that deep supervised methods require new large labelled data sets for practical use in astronomy. To help the community benefit from our pretrained models, we release our fine-tuning code zoobot. Zoobot is accessible to researchers with no prior experience in deep learning.
This paper presents and summarizes a software package ("LPipe") for completely automated, end-to-end reduction of both bright and faint sources with the Low Resolution Imaging Spectrometer (LRIS) at ...Keck Observatory. It supports all gratings, grisms, and dichroics, and also reduces imaging observations, although it does not include multislit or polarimetric reduction capabilities at present. It is suitable for on-the-fly quicklook reductions at the telescope, for large-scale reductions of archival data sets, and (in many cases) for science-quality post-run reductions of PI data. To demonstrate its capabilities the pipeline is run in fully automated mode on all LRIS longslit data in the Keck Observatory Archive acquired during the 12-month period between 2016 August and 2017 July. The reduced spectra (of 675 single-object targets, totaling ∼200 hours of on-source integration time in each camera), and the pipeline itself, are made publicly available to the community.