•ISO-FOOD ontology – a new way of representing isotopic data for food research.•Defines metadata needed for isotopic characterization.•Describes a powerful technique for organizing and sharing stable ...isotope data across Food Science.
To link and harmonize different knowledge repositories with respect to isotopic data, we propose an ISO-FOOD ontology as a domain ontology for describing isotopic data within Food Science. The ISO-FOOD ontology consists of metadata and provenance data that needs to be stored together with data elements in order to describe isotopic measurements with all necessary information required for future analysis. The new domain has been linked with existing ontologies, such as Units of Measurements Ontology, Food, Nutrient and the Bibliographic Ontology. To show how such an ontology can be used in practise, it was populated with 20 isotopic measurements of Slovenian food samples. Describing data in this way offers a powerful technique for organizing and sharing stable isotope data across Food Science.
Assessing nutritional content is very relevant for patients suffering from various diseases, professional athletes, and for health reasons is becoming part of everyday life for many. However, it is a ...very challenging task as it requires complete and reliable sources. We introduce a machine learning pipeline for predicting macronutrient values of foods using learned vector representations from short text descriptions of food products. On a dataset used from health specialists, containing short descriptions of foods and macronutrient values: we generate paragraph embeddings, introduce clustering in food groups, using graph-based vector representations, that include food domain knowledge information, and train regression models for each cluster. The predictions are for four macronutrients: carbohydrates, fat, protein and water. The highest accuracy was obtained for carbohydrate predictions – 86%, compared to the baseline – 27% and 36%. The protein predictions yielded the best results across all clusters, 53%–77% of the values fall in the tolerance-level range. These results were obtained using short descriptions, the embeddings can be improved if they are learned on longer descriptions, which would lead to better prediction results. Since the task of calculating macronutrients requires exact quantities of ingredients, these results obtained only from short description are a huge leap forward.
Being both a poison and a cure for many lifestyle and non-communicable diseases, food is inscribing itself into the prime focus of precise medicine. The monitoring of few groups of nutrients is ...crucial for some patients, and methods for easing their calculations are emerging. Our proposed machine learning pipeline deals with nutrient prediction based on learned vector representations on short text–recipe names. In this study, we explored how the prediction results change when, instead of using the vector representations of the recipe description, we use the embeddings of the list of ingredients. The nutrient content of one food depends on its ingredients; therefore, the text of the ingredients contains more relevant information. We define a domain-specific heuristic for merging the embeddings of the ingredients, which combines the quantities of each ingredient in order to use them as features in machine learning models for nutrient prediction. The results from the experiments indicate that the prediction results improve when using the domain-specific heuristic. The prediction models for protein prediction were highly effective, with accuracies up to 97.98%. Implementing a domain-specific heuristic for combining multi-word embeddings yields better results than using conventional merging heuristics, with up to 60% more accuracy in some cases.
This paper addresses the problem of missing data in food composition databases (FCDBs). The missing data can be either for selected foods or for specific components only. Most often, the problem is ...solved by human experts subjectively borrowing data from other FCDBs, for data estimation or imputation. Such an approach is not only time-consuming but may also lead to wrong decisions as the value of certain components in certain foods may vary from database to database due to differences in analytical methods. To ease missing-data borrowing and increase the quality of missing-data selection, we propose a new computer-based methodology, named MIGHT - Missing Nutrient Value Imputation UsinG Null Hypothesis Testing, that enables optimal selection of missing data from different FCDBs. The evaluation on a subset of European FCDBs, available through EuroFIR and complied with the Food data structure and format standard BS EN 16104 published in 2012, proves that, in more than 80% of selected cases, MIGHT gives more accurate results than techniques currently applied for missing value imputation in FCDBs. MIGHT deals with missing data in FCDBs by introducing rules for missing data imputation based on the idea that proper statistical analysis can decrease the error of data borrowing.
Besides the numerous studies in the last decade involving food and nutrition data, this domain remains low resourced. Annotated corpuses are very useful tools for researchers and experts of the ...domain in question, as well as for data scientists for analysis. In this paper, we present the annotation process of food consumption data (recipes) with semantic tags from different semantic resources—Hansard taxonomy, FoodOn ontology, SNOMED CT terminology and the FoodEx2 classification system. FoodBase is an annotated corpus of food entities—recipes—which includes a curated version of 1000 instances, considered a gold standard. In this study, we use the curated version of FoodBase and two different approaches for annotating—the NCBO annotator (for the FoodOn and SNOMED CT annotations) and the semi-automatic StandFood method (for the FoodEx2 annotations). The end result is a new version of the golden standard of the FoodBase corpus, called the CafeteriaFCD (Cafeteria Food Consumption Data) corpus. This corpus contains food consumption data—recipes—annotated with semantic tags from the aforementioned four different external semantic resources. With these annotations, data interoperability is achieved between five semantic resources from different domains. This resource can be further utilized for developing and training different information extraction pipelines using state-of-the-art NLP approaches for tracing knowledge about food safety applications.
Missing data are a common problem in most research fields and introduce an element of ambiguity into data analysis. They can arise due to different reasons: mishandling of samples, measurement error, ...deleted aberrant value or simply lack of analysis. The nutrition domain is no exception to the problem of missing data. This paper addresses the problem of missing data in food composition databases (FCDBs). Missing data in FCDBs results in incomplete FCDBs, which have limited usage, because any dietary assessment can be performed only on a complete dataset. Most often, this problem is resolved by calculating means/medians from excising data in the same database or borrowing data from other FCDBs. These solutions introduce significant error. We focus on missing data imputation techniques based on methods for substituting missing values with statistical prediction: Non-Negative Matrix Factorization (NMF), Multiple Imputations by Chained Equations (MICE), Nonparametric Missing Value Imputation using Random Forest (MissForest), and K-Nearest Neighbors (KNN), and compared them with commonly used approaches - fill-in with mean, fill-in with median. The data used was from national FCDBs collected by EuroFIR (European Food Information Resource Network). The results show that the state-of-the-art methods for imputation yield better results than the traditional approaches.
Display omitted
•Missing food composition data.•Statistical Methods.•Better quality food composition databases.
In this study, we estimate the generalization of the performance of previously proposed predictive models for nutrient value prediction across different recipe datasets. For this purpose, we ...introduce a quantitative indicator that determines the level of generalization of using the developed predictive model for new unseen data not presented in the training process. On a predefined corpus of recipe embeddings from six publicly available recipe datasets (i.e., projecting them in the same meta-feature vector space), we train predictive models on one of the six recipe datasets and test the models on the rest of the datasets. In parallel, we define and calculate generalizability indexes which are numbers that indicate how generalizable a predictive model is i.e., how well will a predictive model learned on one dataset perform on another one not involved in the training. The evaluation results prove the validity of these indexes — their relation with the accuracy of the predictions. Further, we define three sampling techniques for selecting representative data instances that will cover all parts from the feature space uniformly (involving data from all datasets) and further will improve the generalization of a predictive model. We train predictive models with these generalized datasets and test them on instances from the six recipe datasets that are not selected and included in the generalized datasets. The results from the evaluation of these predictive models show improvement compared to the results from the predictive models trained on one recipe dataset and tested on the others separately.
•Generalization in ML is highly related to the distribution of the training data.•Transferring a predictive model learned on one dataset to another dataset.•Similarity between distributions of training and unseen data in the feature space.•Representative datasets are needed in the training process of predictive. modeling•Uniform coverage of the feature space improves predictive modeling generalization.
In this paper, we have proposed a new pipeline for landscape analysis of time-series machine learning datasets that enables us to better understand a benchmarking problem landscape, allows us to ...select a diverse benchmark datasets portfolio, and reduces the presence of performance assessment bias via bootstrapping evaluation. Combining a large multi-domain representation corpus of time-series specific features and the results of a large empirical study of time-series classification (TSC) benchmark, we showcase the capability of the pipeline to point out issues with non-redundancy and representativeness in the benchmark. By observing discrepancy between the empirical results of the bootstrap evaluation and recently adopted practices in TSC literature when introducing novel methods, we warn on the potentially harmful effects of tuning the methods on certain parts of the landscape (unless this is an explicit and desired goal of the study). Finally, we propose a set of datasets uniformly distributed across the landscape space one should consider when benchmarking novel TSC methods.
•Complementary landscape analysis of time-series datasets.•Selecting unbiased benchmark datasets portfolio for comparison study.•Bootstrapping evaluation for reproducible statistical outcomes.
In the last decades, a great amount of work has been done in predictive modeling of issues related to human and environmental health. Resolution of issues related to healthcare is made possible by ...the existence of several biomedical vocabularies and standards, which play a crucial role in understanding the health information, together with a large amount of health-related data. However, despite a large number of available resources and work done in the health and environmental domains, there is a lack of semantic resources that can be utilized in the food and nutrition domain, as well as their interconnections. For this purpose, in a European Food Safety Authority-funded project CAFETERIA, we have developed the first annotated corpus of 500 scientific abstracts that consists of 6407 annotated food entities with regard to Hansard taxonomy, 4299 for FoodOn and 3623 for SNOMED-CT. The CafeteriaSA corpus will enable the further development of natural language processing methods for food information extraction from textual data that will allow extracting food information from scientific textual data. Database URL: https://zenodo.org/record/6683798#.Y49wIezMJJF.