Deep learning may transform health care, but model development has largely been dependent on availability of advanced technical expertise. Herein we present the development of a deep learning model ...by clinicians without coding, which predicts reported sex from retinal fundus photographs. A model was trained on 84,743 retinal fundus photos from the UK Biobank dataset. External validation was performed on 252 fundus photos from a tertiary ophthalmic referral center. For internal validation, the area under the receiver operating characteristic curve (AUROC) of the code free deep learning (CFDL) model was 0.93. Sensitivity, specificity, positive predictive value (PPV) and accuracy (ACC) were 88.8%, 83.6%, 87.3% and 86.5%, and for external validation were 83.9%, 72.2%, 78.2% and 78.6% respectively. Clinicians are currently unaware of distinct retinal feature variations between males and females, highlighting the importance of model explainability for this task. The model performed significantly worse when foveal pathology was present in the external validation dataset, ACC: 69.4%, compared to 85.4% in healthy eyes, suggesting the fovea is a salient region for model performance OR (95% CI): 0.36 (0.19, 0.70) p = 0.0022. Automated machine learning (AutoML) may enable clinician-driven automated discovery of novel insights and disease biomarkers.
Deep learning offers considerable promise for medical diagnostics. We aimed to evaluate the diagnostic accuracy of deep learning algorithms versus health-care professionals in classifying diseases ...using medical imaging.
In this systematic review and meta-analysis, we searched Ovid-MEDLINE, Embase, Science Citation Index, and Conference Proceedings Citation Index for studies published from Jan 1, 2012, to June 6, 2019. Studies comparing the diagnostic performance of deep learning models and health-care professionals based on medical imaging, for any disease, were included. We excluded studies that used medical waveform data graphics material or investigated the accuracy of image segmentation rather than disease classification. We extracted binary diagnostic accuracy data and constructed contingency tables to derive the outcomes of interest: sensitivity and specificity. Studies undertaking an out-of-sample external validation were included in a meta-analysis, using a unified hierarchical model. This study is registered with PROSPERO, CRD42018091176.
Our search identified 31 587 studies, of which 82 (describing 147 patient cohorts) were included. 69 studies provided enough data to construct contingency tables, enabling calculation of test accuracy, with sensitivity ranging from 9·7% to 100·0% (mean 79·1%, SD 0·2) and specificity ranging from 38·9% to 100·0% (mean 88·3%, SD 0·1). An out-of-sample external validation was done in 25 studies, of which 14 made the comparison between deep learning models and health-care professionals in the same sample. Comparison of the performance between health-care professionals in these 14 studies, when restricting the analysis to the contingency table for each study reporting the highest accuracy, found a pooled sensitivity of 87·0% (95% CI 83·0-90·2) for deep learning models and 86·4% (79·9-91·0) for health-care professionals, and a pooled specificity of 92·5% (95% CI 85·1-96·4) for deep learning models and 90·5% (80·6-95·7) for health-care professionals.
Our review found the diagnostic performance of deep learning models to be equivalent to that of health-care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health-care professionals using the same sample. Additionally, poor reporting is prevalent in deep learning studies, which limits reliable interpretation of the reported diagnostic accuracy. New reporting standards that address specific challenges of deep learning could improve future studies, enabling greater confidence in the results of future evaluations of this promising technology.
None.
Abstract
A number of large technology companies have created code-free cloud-based platforms that allow researchers and clinicians without coding experience to create deep learning algorithms. In ...this study, we comprehensively analyse the performance and featureset of six platforms, using four representative cross-sectional and en-face medical imaging datasets to create image classification models. The mean (s.d.) F1 scores across platforms for all model–dataset pairs were as follows: Amazon, 93.9 (5.4); Apple, 72.0 (13.6); Clarifai, 74.2 (7.1); Google, 92.0 (5.4); MedicMind, 90.7 (9.6); Microsoft, 88.6 (5.3). The platforms demonstrated uniformly higher classification performance with the optical coherence tomography modality. Potential use cases given proper validation include research dataset curation, mobile ‘edge models’ for regions without internet access, and baseline models against which to compare and iterate bespoke deep learning approaches.
To apply a deep learning algorithm for automated, objective, and comprehensive quantification of OCT scans to a large real-world dataset of eyes with neovascular age-related macular degeneration ...(AMD) and make the raw segmentation output data openly available for further research.
Retrospective analysis of OCT images from the Moorfields Eye Hospital AMD Database.
A total of 2473 first-treated eyes and 493 second-treated eyes that commenced therapy for neovascular AMD between June 2012 and June 2017.
A deep learning algorithm was used to segment all baseline OCT scans. Volumes were calculated for segmented features such as neurosensory retina (NSR), drusen, intraretinal fluid (IRF), subretinal fluid (SRF), subretinal hyperreflective material (SHRM), retinal pigment epithelium (RPE), hyperreflective foci (HRF), fibrovascular pigment epithelium detachment (fvPED), and serous PED (sPED). Analyses included comparisons between first- and second-treated eyes by visual acuity (VA) and race/ethnicity and correlations between volumes.
Volumes of segmented features (mm3) and central subfield thickness (CST) (μm).
In first-treated eyes, the majority had both IRF and SRF (54.7%). First-treated eyes had greater volumes for all segmented tissues, with the exception of drusen, which was greater in second-treated eyes. In first-treated eyes, older age was associated with lower volumes for RPE, SRF, NSR, and sPED; in second-treated eyes, older age was associated with lower volumes of NSR, RPE, sPED, fvPED, and SRF. Eyes from Black individuals had higher SRF, RPE, and serous PED volumes compared with other ethnic groups. Greater volumes of the majority of features were associated with worse VA.
We report the results of large-scale automated quantification of a novel range of baseline features in neovascular AMD. Major differences between first- and second-treated eyes, with increasing age, and between ethnicities are highlighted. In the coming years, enhanced, automated OCT segmentation may assist personalization of real-world care and the detection of novel structure–function correlations. These data will be made publicly available for replication and future investigation by the AMD research community.
ObjectiveImplementing teleophthalmology into the optometric referral pathway may ease the current pressures on hospital eye services caused by over-referrals from some optometrists. This study aimed ...to understand the practical implications of implementing teleophthalmology by analysing lived experiences and perceptions of teleophthalmology in the optometric referral pathway for suspected retinal conditions.DesignQualitative in-depth interview studySettingFourteen primary care optometry practices and four secondary care hospital eye services from four NHS Foundation Trusts across the UK.ParticipantsWe interviewed 41 participants: patients (17), optometrists (18), and ophthalmologists (6) who were involved in the HERMES study. Through thematic analysis, we collated and present their experiences of implementing teleophthalmology.ResultsAll participants interviewed were positive towards teleophthalmology as it could enable efficiencies in the referral pathway and improve feedback and communication between patients and healthcare professionals. Concerns included setup costs for optometrists and anxieties from patients about not seeing an ophthalmologist face to face. However, reducing unnecessary visits and increasing the availability of resources and capacity were seen as significant benefits.ConclusionsOverall, we report positive experiences of implementing teleophthalmology into the optometric referral pathway for suspected retinal conditions. Successful implementation will require appropriate investment to set up and integrate new technology and remunerate services, and continued evaluation to ensure timely feedback to patients and between healthcare professionals is received.Trial registration number ISRCTN18106677.
Artificial intelligence (AI) has great potential in ophthalmology. We investigated how ambiguous outputs from an AI diagnostic support system (AI-DSS) affected diagnostic responses from optometrists ...when assessing cases of suspected retinal disease. Thirty optometrists (15 more experienced, 15 less) assessed 30 clinical cases. For ten, participants saw an optical coherence tomography (OCT) scan, basic clinical information and retinal photography ('no AI'). For another ten, they were also given AI-generated OCT-based probabilistic diagnoses ('AI diagnosis'); and for ten, both AI-diagnosis and AI-generated OCT segmentations ('AI diagnosis + segmentation') were provided. Cases were matched across the three types of presentation and were selected to include 40% ambiguous and 20% incorrect AI outputs. Optometrist diagnostic agreement with the predefined reference standard was lowest for 'AI diagnosis + segmentation' (204/300, 68%) compared to 'AI diagnosis' (224/300, 75% p = 0.010), and 'no Al' (242/300, 81%, p = < 0.001). Agreement with AI diagnosis consistent with the reference standard decreased (174/210 vs 199/210, p = 0.003), but participants trusted the AI more (p = 0.029) with segmentations. Practitioner experience did not affect diagnostic responses (p = 0.24). More experienced participants were more confident (p = 0.012) and trusted the AI less (p = 0.038). Our findings also highlight issues around reference standard definition.
IntroductionSickle cell disease (SCD) is one of the most common genetic disorders in the UK, with over 15 000 people affected. Proliferative sickle cell retinopathy (SCR) is a well-described ...complication of SCD and can result in significant sight loss, although the prevalence in the UK is not currently known. There are currently no national screening guidelines for SCR, with wide variations in the management of the condition across the UK.Methods and analysisThe Sickle Eye Project is an epidemiological, cross-sectional, non-interventional study to determine the prevalence of visual impairment due to SCR and/or maculopathy in the UK. Haematologists in at least 16 geographically dispersed hospitals in the UK linked to participating eye clinics will offer study participation to consecutive patients meeting the inclusion criteria attending the sickle cell clinic. The following study procedures will be performed: (a) best corrected visual acuity with habitual correction and pinhole, (b) dilated slit lamp biomicroscopy and funduscopy, (c) optical coherence tomography (OCT), (d) OCT angiography where available, (e) ultrawide fundus photography, (f) National Eye Institute Visual Function Questionnaire-25 and (g) acceptability of retinal screening questionnaire. The primary outcome is the proportion of people with SCD with visual impairment defined as logarithm of the minimum angle of resolution ≥0.3 in at least one eye. Secondary outcomes include the prevalence of each stage of SCR and presence of maculopathy by age and genotype; correlation of stage of SCR and maculopathy to severity of SCD; the impact of SCR and presence of maculopathy on vision-related quality of life; and the acceptability to patients of routine retinal imaging for SCR and maculopathy.Ethics and disseminationEthical approval was obtained from the South Central–Oxford A Research Ethics Committee (REC 23/SC/0363). Findings will be reported through academic journals in ophthalmology and haematology.
The fovea is a depression in the center of the macula and is the site of the highest visual acuity. Optical coherence tomography (OCT) has contributed considerably in elucidating the pathologic ...changes in the fovea and is now being considered as an accompanying imaging method in drug development, such as antivascular endothelial growth factor and its safety profiling. Because animal numbers are limited in preclinical studies and automatized image evaluation tools have not yet been routinely employed, essential reference data describing the morphologic variations in macular thickness in laboratory cynomolgus monkeys are sparse to nonexistent. A hybrid machine learning algorithm was applied for automated OCT image processing and measurements of central retina thickness and surface area values. Morphological variations and the effects of sex and geographical origin were determined. Based on our findings, the fovea parameters are specific to the geographic origin. Despite morphological similarities among cynomolgus monkeys, considerable variations in the foveolar contour, even within the same species but from different geographic origins, were found. The results of the reference database show that not only the entire retinal thickness, but also the macular subfields, should be considered when designing preclinical studies and in the interpretation of foveal data.
Deep learning has the potential to transform health care; however, substantial expertise is required to train such models. We sought to evaluate the utility of automated deep learning software to ...develop medical image diagnostic classifiers by health-care professionals with no coding-and no deep learning-expertise.
We used five publicly available open-source datasets: retinal fundus images (MESSIDOR); optical coherence tomography (OCT) images (Guangzhou Medical University and Shiley Eye Institute, version 3); images of skin lesions (Human Against Machine HAM 10000), and both paediatric and adult chest x-ray (CXR) images (Guangzhou Medical University and Shiley Eye Institute, version 3 and the National Institute of Health NIH dataset, respectively) to separately feed into a neural architecture search framework, hosted through Google Cloud AutoML, that automatically developed a deep learning architecture to classify common diseases. Sensitivity (recall), specificity, and positive predictive value (precision) were used to evaluate the diagnostic properties of the models. The discriminative performance was assessed using the area under the precision recall curve (AUPRC). In the case of the deep learning model developed on a subset of the HAM10000 dataset, we did external validation using the Edinburgh Dermofit Library dataset.
Diagnostic properties and discriminative performance from internal validations were high in the binary classification tasks (sensitivity 73·3-97·0%; specificity 67-100%; AUPRC 0·87-1·00). In the multiple classification tasks, the diagnostic properties ranged from 38% to 100% for sensitivity and from 67% to 100% for specificity. The discriminative performance in terms of AUPRC ranged from 0·57 to 1·00 in the five automated deep learning models. In an external validation using the Edinburgh Dermofit Library dataset, the automated deep learning model showed an AUPRC of 0·47, with a sensitivity of 49% and a positive predictive value of 52%.
All models, except the automated deep learning model trained on the multilabel classification task of the NIH CXR14 dataset, showed comparable discriminative performance and diagnostic properties to state-of-the-art performing deep learning algorithms. The performance in the external validation study was low. The quality of the open-access datasets (including insufficient information about patient flow and demographics) and the absence of measurement for precision, such as confidence intervals, constituted the major limitations of this study. The availability of automated deep learning platforms provide an opportunity for the medical community to enhance their understanding in model development and evaluation. Although the derivation of classification models without requiring a deep understanding of the mathematical, statistical, and programming principles is attractive, comparable performance to expertly designed models is limited to more elementary classification tasks. Furthermore, care should be placed in adhering to ethical principles when using these automated models to avoid discrimination and causing harm. Future studies should compare several application programming interfaces on thoroughly curated datasets.
National Institute for Health Research and Moorfields Eye Charity.
To evaluate the impact of injection frequency on yearly visual outcomes of patients treated with intravitreal aflibercept for neovascular age-related macular degeneration (nAMD) over a period of 5 ...years in a tertiary ophthalmic centre.
Single centre, retrospective cohort study.
Consecutive treatment-naive nAMD patients initiated on aflibercept injections 5 years ago.
The Moorfields OpenEyes database was searched for consecutive patients who were initiated on intravitreal aflibercept for nAMD in 2013-14 and the visual acuity (VA) in Early Diabetic Retinopathy Study (ETDRS) letters and injection records per year were recorded for a period of 5 years. Analyses of the whole cohort and a sub-sample of 5-year completers were done. The cohort was further grouped into Group A (on continuous treatment), Group B (early cessation of treatment) and Group C (interrupted treatment) to evaluate the relation between treatment frequency and visual outcomes.
The primary end point was change in VA at 5 years; secondary outcomes included proportion of eyes that gained or maintained VA, number of injections received and the effect of treatment frequency.
Data were collected on 468 patients (512 eyes). Sixty-six percent of the patients completed 5-year follow-up. The mean age of the whole cohort was 79.5 ± 8.5 years and the mean baseline VA was 58.3 ± 15.4 letters. Amongst the completers, final VA change was -2.9 (SD 23.4) ETDRS letters and the cumulative number of injections over 5 years was 24.2 (10.6). Group A had three letter gain and received significantly higher cumulative number of injections over 5 years than Group B and C (31.8, 14.6 and 18.4 respectively, p = 0.001). After adjusting for age and baseline VA, on average, final VA was +8.0 letters higher in the ≥20 injections group than the <20 group (p = 0.001).
Aflibercept therapy results in sustained good visual outcome over 5 years in neovascular AMD eyes when early and persistent treatment is given.