Abstract
Deep learning has brought a dramatic development in molecular property prediction that is crucial in the field of drug discovery using various representations such as fingerprints, SMILES, ...and graphs. In particular, SMILES is used in various deep learning models via character-based approaches. However, SMILES has a limitation in that it is hard to reflect chemical properties. In this paper, we propose a new self-supervised method to learn SMILES and chemical contexts of molecules simultaneously in pre-training the Transformer. The key of our model is learning structures with adjacency matrix embedding and learning logics that can infer descriptors via Quantitative Estimation of Drug-likeness prediction in pre-training. As a result, our method improves the generalization of the data and achieves the best average performance by benchmarking downstream tasks. Moreover, we develop a web-based fine-tuning service to utilize our model on various tasks.
Summary
With the recent emergence of new paradigm, ie, open science and big data, the need for data sharing and collaboration is becoming important in the computational science field as well. The ...EDISON‐DATA platform aims to provide services that computational simulation data can easily published, preserved, shared, reused, discovered, and analyzed. First, this paper analyzed computational science platform‐related issues, obtained during the development of the EDISON‐DATA platform, regarding the sharing and reusing of the computational science data. These issues include data complexity, diversity, reliability, heterogeneity, etc. To solve the above issues and support data analysis in an efficient and integrated manner, this study proposes various ideas used in the EDISON‐DATA platform. First, we suggested an automated preprocessing framework to handle the complexity of computational science data. Second, to solve the diversity issue, we presented ways to develop preprocessing logic and data presentation logic customized for each data type. Third, to improve the reliability of computational science data, some quality control and provenance management techniques were presented. Fourth, we proposed a way to manage related data in groups. Fifth, to solve data heterogeneity problem and to analyze data in an integrated way, we let the preprocessing framework to use controlled vocabularies to express descriptive metadata. Lastly, we demonstrated feasibility and usability of the proposed ideas in this paper by presenting a case study of building a research portal service in the materials field based on the EDISON‐DATA platform.
The importance of kernel-level security mechanisms such as a file system and access control has been increasingly emphasized as weaknesses in user-level applications. However, when using only access ...control, including role-based access control (RBAC), a system is vulnerable to a low-level or physical attack. In addition, when using only a cryptographic file system, a system also has a weakness that it is unable to protect itself. To overcome these vulnerabilities, we integrated a cryptographic file system into the access control, and developed a prototype.
The Korean National Long-Term Ecological Research (KNLTER) project was launched in 2004 and has predicted long-term ecological changes and developed an ecosystem conservation policy for the Korean ...Peninsula, making a contribution to the LTER. However, the data collected through the KNLTER project are yet to be shared because of problems such as data fragmentation. To solve these problems, Korean government has promoted the development of 'K-ecohub', a pilot repository for national long-term ecological data. K-ecohub provides i) a data processing & service model according to predefined, standardized protocols and ii) efficient metadata search & international linkage functions. K-ecohub also suggests iii) a quality assurance plan for the collected data and iv) an integrated analysis & visualization model. This study covers the data model, requirements, features, conceptual design and data processing methods of K-ecohub.
In this paper, we address performance and scalability issues when AMGA (ARDA Metadata Grid Application) is used as a metadata service for task retrieval in the WISDOM (Wide in Silico Docking on ...Malaria) environment, and propose optimization techniques to deal with the issues. First, to deal with the performance problem due to the communication overhead caused by the need for jobs to call a series of AMGA operations in order for them to retrieve a task from the AMGA server in the WISDOM environment, we propose a new AMGA operation which allows jobs deployed on the Grid to retrieve a task in a single operation instead of calling series of existing AMGA operations. According to the performance study that we have done, the throughput of task retrieval using the new AMGA operation can be as much as 70 times higher than the throughput of using the existing AMGA operations. Second, to address the scalability problem when thousands of jobs running have access to the single AMGA server concurrently in an attempt to grab available tasks, we propose the use of multiple AMGA servers for the purpose of task retrieval. Our test results demonstrate that throughput can be improved linearly in proportion to the number of AMGA servers set up for load balancing.