Abstract Study question Can calibrated AI ploidy screening test results provide reliable, biologically-justified estimates of embryo euploidy? Summary answer The AI ploidy screening test, rooted in ...the embryo’s morphokinetic profile and clinical metadata, provides reliable probabilistic estimates of euploidy. What is known already Published ploidy algorithms typically provide a binary classification of embryo ploidy (aneuploidy/euploidy). Algorithmic outputs are thresholded; a value above/below the threshold indicates a euploid/aneuploid label, respectively, and the confidence in the label prediction is tested. Experience shows, however, that decision-makers have difficulty interpreting how well these algorithm predictions match the true prevalence of euploidy in their clinics, especially when taking into consideration patient age and embryo quality. An AI embryo ploidy screening tool that uses biologically-relevant inputs to provide reliable euploidy probabilities is needed. Study design, size, duration The AI tool was trained on 5,000 time-lapse video sequences, along with associated clinical parameters: biological age at time of retrieval, Day-5 embryo quality, and morphokinetic parameters ranging from time of pronuclear fading to time to blastulation. Probability calibration was applied and its performance was evaluated using a blind test dataset (N = 708 embryos; euploid=352; aneuploid=356) with known prevalences of euploidy, aneuploidy, and live-birth. Mean ± SD patient age: 35.9 ± 5.4 years. Participants/materials, setting, methods The AI ploidy screening tool used known embryo ploidy status as ground-truth labels; biological age and visual quality parameters were incorporated as continuous input features to maintain biological validity. Reliability curve analysis, which plots the observed frequencies of euploidy in the clinical input data (y-axis) against the predicted probability frequencies by the screening test (x-axis), was used to assess model confidence. Odds ratios (OR) confirmed significance between associations. Main results and the role of chance Logistic regression analysis shows that AI scores are robustly associated with euploidy probability (OR = 2.79 95% CI = 2.04-3.81 at a threshold of 0.5 when comparing euploid likelihood for high-versus-low AI scores). For embryo cohorts in the blind test set containing ≥1 aneuploid and euploid embryos in the test dataset, (N = 57 cohorts), the highest AI ranked embryo was euploid in 64% of the cohorts. Embryos were divided into four predefined brackets according to their scores (1-32, 33-49, 50-66, 67-99) and euploidy rate per bracket was determined: 28%, 44%, 58%, 71%, respectively. There was a linear association between ascending AI scores and percentage of euploid embryos, with the highest level of model confidence achieved at the tail ends of the scalar; embryos with a score above 66 were 2.5X more likely to be euploid than an embryo with a score below 33. Limitations, reasons for caution This analysis used historical time-lapse sequences. Moving forward, we must evaluate prospective AI use for ploidy screening. Genetic status in utero/birth was not evaluated. Wider implications of the findings A novel AI approach for preimplantation embryo screening using clinical metadata and time-lapse videos can improve our ability to non-invasively predict euploid likelihood prior to confirmatory diagnostic preimplantation genetic testing. Trial registration number Not applicable
Abstract Study question Can an AI-based embryo ploidy screening tool effectively provide a genetic risk assessment on Day 5 embryos? Summary answer AI screening results showed strong predictive value ...for early detection of at-risk embryos warranting confirmatory diagnostic testing, based only on time lapse video assessment. What is known already Euploid embryo transfer is the dogma of IVF success. Embryo evaluation relies extensively on diagnostic preimplantation genetic testing for aneuploidy (PGT-A) by means of trophectoderm biopsy. Though pivotal, this presents substantial practical and financial limitations. There is demand for an alternative genetic screening method that provides a quantitative embryo genetic risk assessment for patients who opt out of invasive diagnostic testing, or who intend to undergo fresh transfer. These results can provide information for understanding your aneuploidy risk for more precise prognosis counseling and testing. This is important since diagnostic tests may incur excess financial burden, inconclusive outcomes, and time-to-results. Study design, size, duration Time lapse videos (N = 5,000) were used for AI training. The AI core architecture was a video transformer model for spatio-temporal video sampling. Known ploidy and live-birth results were used as ground truth, with maternal age and embryo quality scores incorporated as input parameters during calibrated logistic regression. A blind test dataset (N = 708 embryos; euploid=352; aneuploid=356; mean ± SD patient age: 35.9 ± 5.4 years) was used to evaluate predictive accuracy. Participants/materials, setting, methods The AI outputs a scalar (1-99) that associates with euploidy likelihood. The AI’s screening value was determined by its specificity (correct prediction of aneuploid), sensitivity (correct prediction of euploid), false positive, and false negative rate. We determined optimal AI score thresholds that identified high aneuploidy and euploidy risk categories (AI score <33; AI score >66, respectively). Main results and the role of chance Specificity of the AI screening test was 83.0%, with a sensitivity of 56.0%. The false positive rate was 26.7% and the false negative rate was 29.3%. The AI’s high specificity relative to its sensitivity shows its clinical value to deselect embryos with the strongest risk of genetic abnormality, and for the prioritization of embryos for confirmatory PGT. Boxplot visualization of the full AI score distribution showed successful discrimination between aneuploid and euploid embryos (mean AI score for aneuploid and euploid were 51.3 and 43.5, respectively) with appropriately minimized overlapping of the interquartile range. The optimal AI score thresholds for detecting embryos with the highest and lowest likelihoods of euploidy were >66 /<33, respectively, representing 71% likelihood of euploidy for the >66 group and 72% aneuploidy for the <33 group. Limitations, reasons for caution The screening does not provide diagnostic genetic testing on the chromosome level. The influence of mosaicism on the false positive/negative rates was not assessed. Wider implications of the findings Results support the use of an AI ploidy screening test for effective decision-making and triage of at-risk embryos for transfer or confirmatory diagnostic testing via PGT-A. Trial registration number Not applicable
Abstract Study question What is the impact of AI-based embryo triage on the number of IVF cycles needed to reach fetal heartbeat (FH)? Summary answer Simulated cohort ranking via the AIVF Day-5 AI ...model demonstrates a 27.5% reduction in the number of IVF cycles needed to achieve FH. What is known already Published reports show FH rates per first-time IVF/single euploid embryo transfer ranging from 26-43% (age <38 yrs). Likewise, the average number of autologous retrieval cycles required for FH is 2.25±0.8, depending on oocyte age and demographic. While conventional embryo morphology ranking is used as a proxy indicator for IVF success, its use is associated with observer variability and manual workflow. This leads to a non-efficient cycle number-per-FH ratio, preventing the clinic’s ability to service more patients. Study design, size, duration A single center database (N = 10,000 embryos) was used to generate 1,288 simulated cohorts reflecting known maternal age/cohort size distributions. Only cohorts containing ≥1 FH+ embryo(s) were included. The model ranked embryos within each cohort by scalar output (1-9.9). The position of the first FH+ embryo within each AI-ranked cohort was calculated and averaged across cohorts to compute number of cycles required to reach FH. Participants/materials, setting, methods Data was stratified into 5 age categories; the prevalence of top/good/fair graded embryos per cohort were included according to their known (%) prevalences per stratified age group. Data was randomized into 1,288 cohorts; positions of the first FH+ embryo in each cohort was calculated (for example, if the 2nd AI-ranked embryo was FH+, a value of 2 was assigned, respectively). Values were averaged and compared to conventional IVF. Main results and the role of chance A mean ± SD of 1.63 ±1.06 cycles was needed to reach the first FH+ embryo with AI cohort ranking (pooled mean across all age groups <38 years) with a mean cohort size N = 7.8 embryos. The relative percent reduction in cycles needed to reach FH when compared to conventional IVF without AI ranking (2.25±0.8; age <38 yrs) was 27.5%. Ranking and electronic recording time for ranking per embryo with/without AI implementation was 186.0 versus 30.9 seconds per embryo, respectively. Relative ranking time reduction was 83%. Limitations, reasons for caution Results were pooled across age groups and not stratified per age. Prospective validation is ongoing to show the influence of demographics/clinical characteristics on AI utility. Wider implications of the findings This study demonstrates the potential of shortening the time to pregnancy by using AI-based quantitative embryo assessment for cohort ranking. Trial registration number Not applicable
Abstract
Study question
Can EMATM (AIVF, Israel) artificial intelligence (AI) platform provide personalized success estimates based on the patient’s metadata and embryonic development?
Summary answer
...Individual patients can be given an accurate estimation of their chances for a clinical pregnancy using AI-based embryo evaluation and patient metadata.
What is known already: AI models for embryo evaluation are trained on diverse datasets using data from multiple clinics and from patients with varying ages and clinical history. Precision medicine can be attained by AI models that provide individual patients with personalized success rates per embryo transfer based on their characteristics.
Study design, size, duration
A large dataset (9,812 embryos) from 3 geographically diverse IVF Clinics (Israel, Spain, USA) with known clinical pregnancy (fetal heartbeat) and patient characteristics obtained from Electronic Medical Records (EMR) were used to evaluate the importance of individual features contributing to pregnancy.
Participants/materials, setting, methods
Machine learning models were trained to predict clinical pregnancy on a large training and feature set which combines AI scores (EMATM, AIVF) with additional information obtained from EMR systems and intrinsic features that can be derived from the embryo cohort. The importance of each one of the features for the prediction ability of the models was evaluated.
Main results and the role of chance
AI embryo score, maternal age and cohort features were found to be the major contributing factors for the prediction models, thus improving the accuracy of the estimates for pregnancy probability. Significant contributing factors were: cohort size, number of viable embryos, patient age, number of past treatments, BMI and EMA score. The AUC of the AI embryo score model was 0.7, when factoring in the patient metadata and the cohort features, the AUC of the combined model was 0.75.
Limitations, reasons for caution
This study is limited by its retrospective design. A prospective study is needed to validate the results.
Wider implications of the findings
This study validates the use of AI-based scores for embryo evaluation, for relative grading of the embryos within the cohort, and also to estimate the true pregnancy odds of each embryo based on individual patient features. This can provide a superior decision support tool for doctors, embryologists, and patients.
Trial registration number
not applicable
Abstract Study question Can we decipher the underlying visual properties that drive image-based AI embryo classification models to assist clinical decisions and biological discovery? Summary answer ...Our framework interpreted which annotated and non-explicitly-annotated phenotypes impact model predictions and rank their importance. These discoveries were aligned with known blastocyst quality criteria. What is known already Deep learning models have shown great promise for complex pattern recognition when applied to embryo images. The success of these models relies on their ability to perform non-linear optimization of feature extraction during model construction. However, this involves their entanglement of multiple classification-driving image properties, thereby producing ‘black-box’ systems that lack user confidence, trust and interpretability. Therefore, there is an urgent need for an interpretability method that can uncover the semantic image properties that contribute to ‘black box’ embryo image-based AI classification model predictions to assist in blastocyst selection. Study design, size, duration 11,211 time-lapse videos were retrospectively collected from three IVF centers. A deep convolutional neural network is first trained to discriminate high-versus-low quality blastocysts. We then developed DISCOVER, a general-purpose interpretability method designed to discover underlying visual properties driving the classifier. DISCOVER encodes an image to an interpretable lower dimensional representation which is correlated to the classifier and encapsulates a different distinct phenotype in each one of the dimensions. Participants/materials, setting, methods The encoding of embryo images to low dimensional representations enables interpretability globally and locally. Globally, the embryo images are synthetically altered by amplifying subtle properties that affect the classification decision. With our method this can be done one property at a time, therefore separating confounding properties. By evaluating the altered images, embryologists can decipher their meaning. Locally, each one of the discovered properties can be ranked by its importance for a specific embryo instance. Main results and the role of chance Using DISCOVER, we interpreted the classification model driving features. We quantitatively linked the top two classification features as blastocyst size (as proxy to degree of expansion and development) and trophectoderm quality, by embryologists evaluation and annotations. We then asked whether DISCOVER can identify non-explicitly annotated latent features that encode morphologic properties not defined by ASEBIR/Gardner criteria. Expert embryologist interpreted the third top classification feature to be the blastocoel. DISCOVER interpreted high quality embryos as having denser and more granular blastocoelic regions, suggesting that this change in the blastocoel appearance is one of the encoded classification-driving morphologic properties. This visualization indicates that there are additional parameters of the blastocoel beyond its volume expansion associated with its quality. We showed how embryo properties can be weighted differently by the classifier on a per embryo basis, giving clinical insight to which properties influence the classification of a specific instance. These results indicate that DISCOVER enables expert-in-the-loop interpretation of the classification model both globally, discovering the overall main properties driving the classifier, and locally, showing a per instance explanation. Limitations, reasons for caution DISCOVER failed to interpret the inner cell mass (ICM) as a classification-driving feature in its latent representation, though it was explicitly used to label the data for training the classification model. It is possible that other properties collectively contained the discriminative information encoded in the ICM. Wider implications of the findings This deep analysis demonstrates the feasibility of providing interpretability for biomedical image-based classification models for clinical use in the IVF clinic. Trial registration number not applicable
Abstract
Study question
Is there a selection bias against embryos placed in higher-numbered wells inside a multi-well IVF culture dish. Does this selection bias alone impact implantation outcomes?
...Summary answer
Top-quality embryos present in higher-numbered wells are statistically less likely to be selected for transfer, independent of any differences in quality or development between wells.
What is known already
Substantial intra-and inter-observer variability in embryo selection, as well as differences in quality assessment and laboratory environment, have been shown to affect IVF success. Currently many clinics have adopted stringent guidelines to control for human errors and workflow variation. Still, the impact of errors in laboratory and medical procedures was reported as high as 12%. This is particularly relevant for the IVF lab, where high workload and stress influence rate of errors and patient outcome. This groundbreaking study emphasizes how cognitive tendencies are inherent to the embryo selection process.
Study design, size, duration
This study used a retrospective dataset from three highly experienced fertility clinics (1 US and 2 European clinics). A total of 4,275 Fresh IVF cycles were analyzed. For each treatment cycle, embryo quality grades, corresponding embryo well numbers, day 5 selection and implantation outcomes were documented. All cycles were performed using the EmbryoSlide 12-well culture dish and a time-lapse system. All three datasets were analyzed separately and also combined.
Participants/materials, setting, methods
For each dataset, three analyses were conducted: (1) total number of selected embryos were calculated for each corresponding well number; (II) the proportion of implanted embryos, relative to the total number of selected embryos, were quantified to calculate the “success rate” for each well number; (II) the distribution of top-quality embryos between wells were quantified and compared. Results were normalized by total number of transferred embryos and IVF implantation success rates reported for each clinic.
Main results and the role of chance
A negative trend was found between well number, ranging from 1-12, and number of embryos selected for transfer. This trend was significant (p < 0.05) and occurred independently in each dataset. Odds ratios (OR) for the relation between selecting embryos for transfer from wells 1-5, and from 8-12 = Clinic A: 2.16, Clinic B: 1.78, Clinic C: 2.45. Alternative hypotheses were tested: (1) top-quality embryos are clustered in lower-numbered wells during culture; (2) enhanced embryo quality and conditions are found in lower-numbered wells, which should manifest in higher rates of implantation. Results for each clinic showed a statistically even distribution of top-quality embryos between wells (within 2 standard deviations from the mean; not significant), yet ‘success rate’ for transferred embryos increased by well number (by 12-30% between wells 1-5 and wells 8-12; OR = 1.19, 1.06, 1.08 for Clinic A, B, and C, respectively). An inverse trend existed between an embryo’s likelihood of being selected for transfer, and its likelihood of implanting. We conclude that embryologists may tend to select the first acceptable embryo for transfer. Embryos from higher-numbered wells were significantly more likely to implant, since they overcame this bias when equitably evaluated and selected for transfer.
Limitations, reasons for caution
Though our findings were significant, they need to be repeated on larger datasets with more inter-centre variation, and key embryo culture and outcome variables recorded.
Wider implications of the findings
This study emphasizes the inherent human error that exists inside IVF clinics. Machine learning systems that reduce human bias and increase objective standardization, even if they are not inherently better than embryologists, would improve implantation rates. Future studies should be directed toward AI based technologies that can accomplish this.
Trial registration number
Not Applicable
Abstract
Study question
Can an AI deselection model identify distinct morphokinetic patterns in top-quality blastocyst with unknown ploidy that fail to implant?
Summary answer
An AI based deselection ...model was able to predict implantation failure based on morphokinetic features previously found to associate with aneuploidy.
What is known already
Aneuploidy is the most common explanation for implantation failure of high-quality blastocysts. Yet, high-quality blastocysts with unknown ploidy that fail to implant are often morphologically indistinguishable from blastocysts that succeed to implant. Our previously published results (ESHRE 2021) demonstrated that aneuploid blastocysts were more likely to reach development events (t2-t8) later, and that the timing between each event was statistically longer (p < 0.001), when compared to euploid embryos. Given that delayed morphokinetic rates are tightly linked to ploidy, we investigated whether similar known morphokinetic features were associated with implantation failure in top-graded embryos.
Study design, size, duration
Time-lapse sequences of 3,259 top-quality blastocysts from fresh single embryo transfer cycles with known implantation outcomes were analyzed using an AI-based algorithm. The algorithm utilized convolutional neural network extracted temporal features based on multiple morphokinetic parameters known to associate with ploidy.
Participants/materials, setting, methods
time-lapse sequences and morphokinetic events were algorithmically analyzed to measure the rate of mitotic division events and compare the number of embryos in each category (implanted/nonimplanted) that reached each developmental event at least one standard deviation (SD) later than the mean for implanted embryos.
Main results and the role of chance
Results showed statistical differences in the following morphokinetic features between the two categories: t2, t3, t4, and t3-t4 (p < 0.05). Implanted top-graded blastocysts were likely to reach t2, t3, and t4 after 25.23 ± 3.8 SD, 36.06 ± 3.4 SD, and 37.14 hours ±3.6 SD, respectively. The time gap between t3 and t4 was found to be 12.25 hours ± 5.31 SD. Given this, we followed the methodology described above to propose cutoff values (in hours) that differentiated between non-implanted and implanted top-graded blastocysts based on their morphokinetic profiles. Implantation failure was found to be associated with the likelihood of reaching t2 after 28.61 hours (OR = 2.36, CI 0.96-5.77), t3 after 39.46 (OR = 3.48, CI 1.62-7.47), and t4 after 40.79 hours (OR = 2.23, CI 1.09- 4.53). A time gap between t3 and t4 of more than 17.56 hours was also associated with implantation failure (OR = 2.48, CI 1.12-5.48), indicating perturbed mitotic activity. The cutoff values proposed here were incorporated into the algorithm for optimized deselection of morphologically similar top-quality blastocysts with delayed morphokinetic profiles.
Limitations, reasons for caution
This study needs to be validated on a larger, multi-centric dataset that takes into account more morphokinetic features associated with ploidy in order to increase the robustness of our algorithm.
Wider implications of the findings
For the first time, our algorithmic model proposed here demonstrates the utility of an AI tool to deselect top-graded blastocysts that would otherwise be selected for transfer based on conventional morphologic assessment alone.
Trial registration number
Not Applicable