

When a representative training sample is available, machine learning methods offer an alternative to template methods to estimate galaxy redshifts. 2013 for an overview of different techniques). This parametric encoding of the complex stellar physics coupled with the uncertainty of the parameters of the stellar population models combine to produce redshift estimates which are little better than many non-parametric techniques (see e.g. Some templates encode our knowledge of stellar population models which result in predictions for the evolution of galaxy magnitudes and colours. Photometric redshifts are also estimated by parametric techniques, for example from galaxy spectral energy distribution templates. The contamination may be due to incorrectly measured spectroscopic redshifts, or unreliable photometric properties. In this paper, we examine the problem of identifying poorly measured galaxy properties which contaminate the base training set.

The data augmentation process has been shown to improve the redshift estimate of the final test sample. by adding galaxies from simulations, to make the data sets appear more similar (Hoyle et al. Recent work by the current authors shows that if the base training sample is biased compared to the final sample, it may be augmented, e.g. This is the basis of machine learning, and inherently assumes that the galaxies used to construct the mapping form an unbiased and uncontaminated sample of the final data set. This mapping can then be applied to all photometrically identified galaxies to estimate redshifts. For this subsample of galaxies, one may learn a mapping between the measured photometric properties and the spectroscopic redshift. Measuring accurate spectroscopic redshifts is costly and time intensive, and is typically only performed for a small subsample of all galaxies. Photometric surveys can be maximally exploited for large-scale structure analyses once galaxies have been identified and their positions on the sky and in redshift space have been measured. We further describe a method to estimate the contamination fraction of a base data sample.Ĭatalogues, surveys, galaxies: distances and redshifts INTRODUCTION

We find an improvement on all measured statistics of up to 80 per cent when training on the anomaly removed sample as compared with training on the contaminated sample for each of the machine learning routines explored. We then train four machine learning architectures for redshift analysis on both the contaminated sample and on the preprocessed ‘anomaly-removed’ sample and measure redshift statistics on a clean validation sample generated without any preprocessing. We contaminate the clean base galaxy sample with galaxies with unreliable redshifts and attempt to recover the contaminating galaxies using the Elliptical Envelope technique. We select 2.5 million ‘clean’ SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730 ‘anomalous’ galaxies with spectroscopic redshift measurements which are flagged as unreliable. Anomalous training examples may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured photometric quantity. Anomaly detection allows the removal of poor training examples, which can adversely influence redshift estimates. We present an analysis of anomaly detection for machine learning redshift estimation.
