ABSTRACT
Elucidating the connection between the properties of galaxies and the properties of their hosting haloes is a key element in galaxy formation. When the spatial distribution of objects is ...also taken under consideration, it becomes very relevant for cosmological measurements. In this paper, we use machine-learning techniques to analyse these intricate relations in the IllustrisTNG300 magnetohydrodynamical simulation, predicting baryonic properties from halo properties. We employ four different algorithms: extremely randomized trees, K-nearest neighbours, light gradient boosting machine, and neural networks, along with a unique and powerful combination of the results from all four approaches. Overall, the different algorithms produce consistent results in terms of predicting galaxy properties from a set of input halo properties that include halo mass, concentration, spin, and halo overdensity. For stellar mass, the Pearson correlation coefficient is 0.98, dropping down to 0.7–0.8 for specific star formation rate (sSFR), colour, and size. In addition, we apply, for the first time in this context, an existing data augmentation method, synthetic minority oversampling technique for regression with Gaussian noise (SMOGN), designed to alleviate the problem of imbalanced data sets, showing that it improves the overall shape of the predicted distributions and the scatter in the halo–galaxy relations. We also demonstrate that our predictions are good enough to reproduce the power spectra of multiple galaxy populations, defined in terms of stellar mass, sSFR, colour, and size with high accuracy. Our results align with previous reports suggesting that certain galaxy properties cannot be reproduced using halo features alone.
ABSTRACT
The relationship between galaxies and haloes is central to the description of galaxy formation and a fundamental step towards extracting precise cosmological information from galaxy maps. ...However, this connection involves several complex processes that are interconnected. Machine Learning methods are flexible tools that can learn complex correlations between a large number of features, but are traditionally designed as deterministic estimators. In this work, we use the IllustrisTNG300-1 simulation and apply neural networks in a binning classification scheme to predict probability distributions of central galaxy properties, namely stellar mass, colour, specific star formation rate, and radius, using as input features the halo mass, concentration, spin, age, and the overdensity on a scale of 3 h−1 Mpc. The model captures the intrinsic scatter in the relation between halo and galaxy properties, and can thus be used to quantify the uncertainties related to the stochasticity of the galaxy properties with respect to the halo properties. In particular, with our proposed method, one can define and accurately reproduce the properties of the different galaxy populations in great detail. We demonstrate the power of this tool by directly comparing traditional single-point estimators and the predicted joint probability distributions, and also by computing the power spectrum of a large number of tracers defined on the basis of the predicted colour–stellar mass diagram. We show that the neural networks reproduce clustering statistics of the individual galaxy populations with excellent precision and accuracy.
ABSTRACT
In this series of papers, we employ several machine learning (ML) methods to classify the point-like sources from the miniJPAS catalogue, and identify quasar candidates. Since no ...representative sample of spectroscopically confirmed sources exists at present to train these ML algorithms, we rely on mock catalogues. In this first paper, we develop a pipeline to compute synthetic photometry of quasars, galaxies, and stars using spectra of objects targeted as quasars in the Sloan Digital Sky Survey. To match the same depths and signal-to-noise ratio distributions in all bands expected for miniJPAS point sources in the range 17.5 ≤ r < 24, we augment our sample of available spectra by shifting the original r-band magnitude distributions towards the faint end, ensure that the relative incidence rates of the different objects are distributed according to their respective luminosity functions, and perform a thorough modelling of the noise distribution in each filter, by sampling the flux variance either from Gaussian realizations with given widths, or from combinations of Gaussian functions. Finally, we also add in the mocks the patterns of non-detections which are present in all real observations. Although the mock catalogues presented in this work are a first step towards simulated data sets that match the properties of the miniJPAS observations, these mocks can be adapted to serve the purposes of other photometric surveys.
ABSTRACT
Astrophysical surveys rely heavily on the classification of sources as stars, galaxies, or quasars from multiband photometry. Surveys in narrow-band filters allow for greater discriminatory ...power, but the variety of different types and redshifts of the objects present a challenge to standard template-based methods. In this work, which is part of a larger effort that aims at building a catalogue of quasars from the miniJPAS survey, we present a machine learning-based method that employs convolutional neural networks (CNNs) to classify point-like sources including the information in the measurement errors. We validate our methods using data from the miniJPAS survey, a proof-of-concept project of the Javalambre Physics of the Accelerating Universe Astrophysical Survey (J-PAS) collaboration covering ∼1 deg2 of the northern sky using the 56 narrow-band filters of the J-PAS survey. Due to the scarcity of real data, we trained our algorithms using mocks that were purpose-built to reproduce the distributions of different types of objects that we expect to find in the miniJPAS survey, as well as the properties of the real observations in terms of signal and noise. We compare the performance of the CNNs with other well-established machine learning classification methods based on decision trees, finding that the CNNs improve the classification when the measurement errors are provided as inputs. The predicted distribution of objects in miniJPAS is consistent with the putative luminosity functions of stars, quasars, and unresolved galaxies. Our results are a proof of concept for the idea that the J-PAS survey will be able to detect unprecedented numbers of quasars with high confidence.
Abstract
Errors in measurements are key to weighting the value of data, but are often neglected in machine learning (ML). We show how convolutional neural networks (CNNs) are able to learn about the ...context and patterns of signal and noise, leading to improvements in the performance of classification methods. We construct a model whereby two classes of objects follow an underlying Gaussian distribution, and where the features (the input data) have varying, but known, levels of noise—in other words, each data point has a different error bar. This model mimics the nature of scientific data sets, such as those from astrophysical surveys, where noise arises as a realization of random processes with known underlying distributions. The classification of these objects can then be performed using standard statistical techniques (e.g. least squares minimization), as well as ML techniques. This allows us to take advantage of a maximum likelihood approach to object classification, and to measure the amount by which the ML methods are incorporating the information in the input data uncertainties. We show that, when each data point is subject to different levels of noise (i.e. noises with different distribution functions, which is typically the case in scientific data sets), that information can be learned by the CNNs, raising the ML performance to at least the same level of the least squares method—and sometimes even surpassing it. Furthermore, we show that, with varying noise levels, the confidence of the ML classifiers serves as a proxy for the underlying cumulative distribution function, but only if the information about specific input data uncertainties is provided to the CNNs.
The miniJPAS survey quasar selection Pérez-Ràfols, Ignasi; Abramo, Luis Raul; Martínez-Solaeche, Ginés ...
Astronomy and astrophysics (Berlin),
10/2023, Letnik:
678
Journal Article
Recenzirano
Odprti dostop
Aims
. Quasar catalogues from photometric data are used in a variety of applications including those targeting spectroscopic follow-up, measurements of supermassive black hole masses, Baryon Acoustic ...Oscillations, or non-Gaussianities. Here, we present a list of quasar candidates including photometric redshift estimates from the miniJPAS Data Release constructed using SQUEzE. miniJPAS is a small proof-of-concept survey covering 1 deg
2
with the full J-PAS filter system, consisting of 54 narrow filters and 2 broader filters covering the entire optical wavelength range.
Methods
. This work is based on the machine-learning classification of photometric data of quasar candidates using SQUEzE. It has the advantage that its classification procedure can be explained to some extent, making it less of a ‘black box’ when compared with other classifiers. Another key advantage is that the use of user-defined metrics means the user has more control over the classification. While SQUEzE was designed for spectroscopic data, we have adapted it for multi-band photometric data; that is we treat multiple narrow-band filters as very low-resolution spectra. We trained our models using specialised mocks. We estimated our redshift precision using the normalised median absolute deviation,
σ
NMAD
, applied to our test sample.
Results
. Our test sample returns an
f
1
score (effectively the purity and completeness) of 0.49 for high-
z
quasars (with
z
≥ 2.1) down a to magnitude of
r
= 24.3 and 0.24 for low-
z
quasars (with
z
< 2.1), also down to a magnitude of
r
= 24.3. For high-
z
quasars, this goes up to 0.9 for magnitudes of
r
< 21.0. We present two catalogues of quasar candidates including redshift estimates: 301 from point-like sources and 1049 when also including extended sources. We discuss the impact of including extended sources in our predictions (they are not included in the mocks), as well as the impact of changing the noise model of the mocks. We also give an explanation of SQUEzE reasoning. Our estimates for the redshift precision using the test sample indicate a
σ
NMAD
= 0.92% for the entire sample, reduced to 0.81% for
r
< 22.5 and 0.74% for
r
< 21.3. Spectroscopic follow-up of the candidates is required in order to confirm the validity of our findings.
The relationship between galaxies and haloes is central to the description of galaxy formation, and a fundamental step towards extracting precise cosmological information from galaxy maps. However, ...this connection involves several complex processes that are interconnected. Machine Learning methods are flexible tools that can learn complex correlations between a large number of features, but are traditionally designed as deterministic estimators. In this work, we use the IllustrisTNG300-1 simulation and apply neural networks in a binning classification scheme to predict probability distributions of central galaxy properties, namely stellar mass, colour, specific star formation rate, and radius, using as input features the halo mass, concentration, spin, age, and the overdensity on a scale of 3 \(h^{-1}\) Mpc. The model captures the intrinsic scatter in the relation between halo and galaxy properties, and can thus be used to quantify the uncertainties related to the stochasticity of the galaxy properties with respect to the halo properties. In particular, with our proposed method, one can define and accurately reproduce the properties of the different galaxy populations in great detail. We demonstrate the power of this tool by directly comparing traditional single-point estimators and the predicted joint probability distributions, and also by computing the power spectrum of a large number of tracers defined on the basis of the predicted colour-stellar mass diagram. We show that the neural networks reproduce clustering statistics of the individual galaxy populations with excellent precision and accuracy.
Errors in measurements are key to weighting the value of data, but are often neglected in Machine Learning (ML). We show how Convolutional Neural Networks (CNNs) are able to learn about the context ...and patterns of signal and noise, leading to improvements in the performance of classification methods. We construct a model whereby two classes of objects follow an underlying Gaussian distribution, and where the features (the input data) have varying, but known, levels of noise. This model mimics the nature of scientific data sets, where the noises arise as realizations of some random processes whose underlying distributions are known. The classification of these objects can then be performed using standard statistical techniques (e.g., least-squares minimization or Markov-Chain Monte Carlo), as well as ML techniques. This allows us to take advantage of a maximum likelihood approach to object classification, and to measure the amount by which the ML methods are incorporating the information in the input data uncertainties. We show that, when each data point is subject to different levels of noise (i.e., noises with different distribution functions), that information can be learned by the CNNs, raising the ML performance to at least the same level of the least-squares method -- and sometimes even surpassing it. Furthermore, we show that, with varying noise levels, the confidence of the ML classifiers serves as a proxy for the underlying cumulative distribution function, but only if the information about specific input data uncertainties is provided to the CNNs.
Elucidating the connection between the properties of galaxies and the properties of their hosting haloes is a key element in galaxy formation. When the spatial distribution of objects is also taken ...under consideration, it becomes very relevant for cosmological measurements. In this paper, we use machine learning techniques to analyse these intricate relations in the IllustrisTNG300 magnetohydrodynamical simulation, predicting baryonic properties from halo properties. We employ four different algorithms: extremely randomized trees, K-nearest neighbours, light gradient boosting machine, and neural networks, along with a unique and powerful combination of the results from all four approaches. Overall, the different algorithms produce consistent results in terms of predicting galaxy properties from a set of input halo properties that include halo mass, concentration, spin, and halo overdensity. For stellar mass, the Pearson correlation coefficient is 0.98, dropping down to 0.7-0.8 for specific star formation rate (sSFR), colour, and size. In addition, we apply, for the first time in this context, an existing data augmentation method, synthetic minority over-sampling technique for regression with Gaussian noise (SMOGN), designed to alleviate the problem of imbalanced data sets, showing that it improves the overall shape of the predicted distributions and the scatter in the halo-galaxy relations. We also demonstrate that our predictions are good enough to reproduce the power spectra of multiple galaxy populations, defined in terms of stellar mass, sSFR, colour, and size with high accuracy. Our results align with previous reports suggesting that certain galaxy properties cannot be reproduced using halo features alone.