Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite ...efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
Relation extraction from biological publications plays a pivotal role in accelerating scientific discovery and advancing medical research. While vast amounts of this knowledge is stored within the ...published literature, extracting it manually from this continually growing volume of documents is becoming increasingly arduous. Recently, attention has been focused towards automatically extracting such knowledge using pre-trained Large Language Models (LLM) and deep-learning algorithms for automated relation extraction. However, the complex syntactic structure of biological sentences, with nested entities and domain-specific terminology, and insufficient annotated training corpora, poses major challenges in accurately capturing entity relationships from the unstructured data. To address these issues, in this paper, we propose a Knowledge-based Intelligent Text Simplification (KITS) approach focused on the accurate extraction of biological relations. KITS is able to precisely and accurately capture the relational context among various binary relations within the sentence, alongside preventing any potential changes in meaning for those sentences being simplified by KITS. The experiments show that the proposed technique, using well-known performance metrics, resulted in a 21% increase in precision, with only 25% of sentences simplified in the Learning Language in Logic (LLL) dataset. Combining the proposed method with BioBERT, the popular pre-trained LLM was able to outperform other state-of-the-art methods.
Obtaining meaning-rich representations of social media inputs, such as Tweets (unstructured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs ...typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classification accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capability for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.
Dynamic Bayesian network (DBN) is among the mainstream approaches for modeling various biological networks, including the gene regulatory network (GRN). Most current methods for learning DBN employ ...either local search such as hill-climbing, or a meta stochastic global optimization framework such as genetic algorithm or simulated annealing, which are only able to locate sub-optimal solutions. Further, current DBN applications have essentially been limited to small sized networks.
To overcome the above difficulties, we introduce here a deterministic global optimization based DBN approach for reverse engineering genetic networks from time course gene expression data. For such DBN models that consist only of inter time slice arcs, we show that there exists a polynomial time algorithm for learning the globally optimal network structure. The proposed approach, named GlobalMIT+, employs the recently proposed information theoretic scoring metric named mutual information test (MIT). GlobalMIT+ is able to learn high-order time delayed genetic interactions, which are common to most biological systems. Evaluation of the approach using both synthetic and real data sets, including a 733 cyanobacterial gene expression data set, shows significantly improved performance over other techniques.
Our studies demonstrate that deterministic global optimization approaches can infer large scale genetic networks.
For dispute resolution in daily life, tamper-proof data storage and retrieval of log data are important with the incorporation of trustworthy access control for the related users and devices, while ...giving access to confidential data to the relevant users and maintaining data persistency are two major challenges in information security. This research uses blockchain data structure to maintain data persistency. On the other hand, we propose protocols for the authentication of users (persons and devices) to edge server and edge server to main server. Our proposed framework also provides access to forensic users according to their relevant roles and privilege attributes. For the access control of forensic users, a hybrid attribute and role-based access control (ARBAC) module added with the framework. The proposed framework is composed of an immutable blockchain-based data storage with endpoint authentication and attribute role-based user access control system. We simulate authentication protocols of the framework in AVISPA. Our result analysis shows that several security issues can efficiently be dealt with by the proposed framework.
Unicellular diazotrophic cyanobacteria such as Cyanothece sp. ATCC 51142 (henceforth Cyanothece), temporally separate the oxygen sensitive nitrogen fixation from oxygen evolving photosynthesis not ...only under diurnal cycles (LD) but also in continuous light (LL). However, recent reports demonstrate that the oscillations in LL occur with a shorter cycle time of ~11 h. We find that indeed, majority of the genes oscillate in LL with this cycle time. Genes that are upregulated at a particular time of day under diurnal cycle also get upregulated at an equivalent metabolic phase under LL suggesting tight coupling of various cellular events with each other and with the cell's metabolic status. A number of metabolic processes get upregulated in a coordinated fashion during the respiratory phase under LL including glycogen degradation, glycolysis, oxidative pentose phosphate pathway, and tricarboxylic acid cycle. These precede nitrogen fixation apparently to ensure sufficient energy and anoxic environment needed for the nitrogenase enzyme. Photosynthetic phase sees upregulation of photosystem II, carbonate transport, carbon concentrating mechanism, RuBisCO, glycogen synthesis and light harvesting antenna pigment biosynthesis. In Synechococcus elongates PCC 7942, a non-nitrogen fixing cyanobacteria, expression of a relatively smaller fraction of genes oscillates under LL condition with the major periodicity being 24 h. In contrast, the entire cellular machinery of Cyanothece orchestrates coordinated oscillation in anticipation of the ensuing metabolic phase in both LD and LL. These results may have important implications in understanding the timing of various cellular events and in engineering cyanobacteria for biofuel production.
Blockchain technology (BCT) has been gaining popularity due to its benefits for almost every industry. However, despite its benefits, the organizational adoption of BCT is rather limited. This lack ...of uptake motivated us to identify the factors that influence the adoption of BCT from an organizational perspective. In doing this, we reviewed the BCT literature, interviewed BCT experts, and proposed a research model based on the TOE framework. Specifically, we theorized the role of technological (perceived benefits, compatibility, information transparency, and disintermediation), organizational (organization innovativeness, organizational learning capability, and top management support), and environmental (competition intensity, government support, trading partners readiness, and standards uncertainty) factors in the organizational adoption of BCT in Australia. We confirmed the model with a sample of adopters and potential adopter organizations in Australia. The results show a significant role of the proposed factors in the organizational adoption of BCT in Australia. Additionally, we found that the relationship between the influential factors and BCT adoption is moderated by “perceived risks”. The study extends the TOE framework by adding factors that were ignored in previous studies on BCT adoption, such as perceived information transparency, perceived disintermediation, organizational innovativeness, organizational learning capability, and standards uncertainty.
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue ...classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.
We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets.
For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
Protein Structure Prediction (PSP) from the primary amino acid sequence, even using a simplified Hydrophobic-Polar (HP) lattice model, continues to be extremely challenging. Finding an optimal ...conformation, even for a small sequence, by any of the currently known evolutionary approaches is computationally extensive and time consuming. Although Memetic Algorithms (MAs) have shown success in finding the optimal solution for PSP, no significant work on the incorporation of domain or problem specific knowledge into the search process to significantly improve their performance is reported. In this paper, we present an approach to incorporate such knowledge into the initial population to enhance the effectiveness of MA for PSP. The domain knowledge we propose to use is based on the concept of maximal ‘core’ formation by exploiting the fundamental property of the H residues to be at the core of the minimum energy optimal protein structure. A generic technique is proposed for estimating the maximal Hydrophobic core (H-core) in a protein sequence for 2D Square, 3D Cubic and a more complex and realistic 3D FCC (Face Centered Cubic) lattice models. Subsequently, the knowledge of this estimated core is incorporated in an MA. The experiments conducted using HP benchmark sequences for 2D Square, 3D Cubic and 3D FCC lattice models show that the proposed MA with the new core-based population initialization technique has superior performance to the existing methods in terms of convergence speed as well as minimal energy.