Depression is a significant global health challenge. Still, many people suffering from depression remain undiagnosed. Furthermore, the assessment of depression can be subject to human bias. Natural ...Language Processing (NLP) models offer a promising solution. We investigated the potential of four NLP models (BERT, Llama2-13B, GPT-3.5, and GPT-4) for depression detection in clinical interviews. Participants (N = 82) underwent clinical interviews and completed a self-report depression questionnaire. NLP models inferred depression scores from interview transcripts. Questionnaire cut-off values for depression were used as a classifier for depression. GPT-4 showed the highest accuracy for depression classification (F1 score 0.73), while zero-shot GPT-3.5 initially performed with low accuracy (0.34), improved to 0.82 after fine-tuning, and achieved 0.68 with clustered data. GPT-4 estimates of symptom severity PHQ-8 score correlated strongly (r = 0.71) with true symptom severity. These findings demonstrate the potential of AI models for depression detection. However, further research is necessary before widespread deployment can be considered.
•Exploring the Potential of LLMs for Depression Detection in Clinical Data.•Models compared: BERT, GPT-3.5, GPT-4, and Llama2 13b — in depression classification.•Various approaches explored: Zero-shot testing, fine-tuning, and novel cluster idea.•Zero-shot GPT-4 and Fine-tuned GPT-3.5 exhibited superior performance.•Llama2 13b as an open-source model showcases significant potential.
•An automated data mining framework is proposed for building energy conservation.•Maximal frequent itemset mining is applied to extract building operation patterns.•GPT-3.5 and GPT-4 are used to ...analyze the extracted building operation patterns.•This framework can detect most of building energy waste patterns effectively.•The response time and cost of GPT are 6747.60 s and $17.68, respectively.
Data mining technologies have showed promising capabilities in extracting building operation patterns from massive amounts of building operational data for energy conservation. However, the number of the extracted operation patterns is always large. It is time-consuming and tedious for users to find valuable operation patterns among them. To overcome this barrier, this study proposes an automated data mining framework based on maximal frequent itemset mining and generative pre-trained transformers (GPT). An improved maximal frequent itemset mining-based data mining method is developed to extract non-redundant operation patterns from numerous building operational data for reducing the number of the extracted operation patterns. A template-based prompt generation method is proposed to transform the extracted operation patterns into prompts. The prompts are inputted into GPT to determine whether there are energy waste patterns hidden in the extracted operation patterns. It liberates humans from the tedious work on analyzing the extracted operation patterns. The framework is applied to analyze the one-year operational data from a real-world building chiller plant system for verifying its performance. Most of the energy waste patterns in this system are detected successfully, such as valve faults, low chilled water outlet temperature, small chilled and cooling water temperature differences, and improper coordinated control among devices. The detection accuracy of GPT is 89.17 % for energy waste patterns and 99.48 % for normal operation patterns. The response time and cost of GPT are 6747.60 s and $17.68, respectively.
•The knowledge capabilities of LLMs in the HVAC industry are evaluated.•The performance of LLMs of knowledge is revealed.•Limitations of LLMs are noted in field of HVAC systems.•Future research ...directions of LLMs in the HVAC industry are concluded.
Large language models (LLMs) have shown human-level capabilities in solving various complex tasks. However, it is still unknown whether state-of-the-art LLMs master sufficient knowledge related to heating, ventilation and air conditioning (HVAC) systems. It will be inspiring if LLMs can think and learn like professionals in the HVAC industry. Hence, this study investigates the performance of LLMs on mastering the knowledge and skills related to the HVAC industry by letting them take the ASHRAE Certified HVAC Designer examination, an authoritative examination in the HVAC industry. Three key knowledge capabilities are explored: recall, analysis and application. Twelve representative LLMs are tested such as GPT-3.5, GPT-4 and LLaMA. According to the results, GPT-4 passes the ASHRAE Certified HVAC Designer examination with scores from 74 to 78, which is higher than about half of human examinees. Besides, GPT-3.5 passes the examination twice out of five times. It demonstrates that some LLMs such as GPT-4 and GPT-3.5 have great potential to assist or replace humans in designing and operating HVAC systems. However, they still make some mistakes sometimes due to the lack of knowledge, poor reasoning capabilities and unsatisfactory equation calculation abilities. Accordingly, four future research directions are proposed to reveal how to utilize and improve LLMs in the HVAC industry: teaching LLMs to use design tools or software in the HVAC industry, enabling LLMs to read and analyze the operational data from HVAC systems, developing tailored corpuses for the HVAC industry, and assessing the performance of LLMs in real-world HVAC design and operation scenarios.
Display omitted
Large language models (LLMs) are a special class of pretrained language models (PLMs) obtained by scaling model size, pretraining corpus and computation. LLMs, because of their large size and ...pretraining on large volumes of text data, exhibit special abilities which allow them to achieve remarkable performances without any task-specific training in many of the natural language processing tasks. The era of LLMs started with OpenAI’s GPT-3 model, and the popularity of LLMs has increased exponentially after the introduction of models like ChatGPT and GPT4. We refer to GPT-3 and its successor OpenAI models, including ChatGPT and GPT4, as GPT-3 family large language models (GLLMs). With the ever-rising popularity of GLLMs, especially in the research community, there is a strong need for a comprehensive survey which summarizes the recent research progress in multiple dimensions and can guide the research community with insightful future research directions. We start the survey paper with foundation concepts like transformers, transfer learning, self-supervised learning, pretrained language models and large language models. We then present a brief overview of GLLMs and discuss the performances of GLLMs in various downstream tasks, specific domains and multiple languages. We also discuss the data labelling and data augmentation abilities of GLLMs, the robustness of GLLMs, the effectiveness of GLLMs as evaluators, and finally, conclude with multiple insightful future research directions. To summarize, this comprehensive survey paper will serve as a good resource for both academic and industry people to stay updated with the latest research related to GLLMs.
Display omitted
•First survey paper to present a comprehensive review of GLLMs with 350+ papers.•Discusses various foundation concepts from transformers to large language models.•Presents GPT-3 family large language models in detail.•Discusses the performances of GLLMs in various downstream tasks.•Presents multiple insightful future research directions to improve GLLMs further.
Abstract This study was designed to assess how different prompt engineering techniques, specifically direct prompts, Chain of Thought (CoT), and a modified CoT approach, influence the ability of ...GPT-3.5 to answer clinical and calculation-based medical questions, particularly those styled like the USMLE Step 1 exams. To achieve this, we analyzed the responses of GPT-3.5 to two distinct sets of questions: a batch of 1000 questions generated by GPT-4, and another set comprising 95 real USMLE Step 1 questions. These questions spanned a range of medical calculations and clinical scenarios across various fields and difficulty levels. Our analysis revealed that there were no significant differences in the accuracy of GPT-3.5's responses when using direct prompts, CoT, or modified CoT methods. For instance, in the USMLE sample, the success rates were 61.7% for direct prompts, 62.8% for CoT, and 57.4% for modified CoT, with a p-value of 0.734. Similar trends were observed in the responses to GPT-4 generated questions, both clinical and calculation-based, with p-values above 0.05 indicating no significant difference between the prompt types. The conclusion drawn from this study is that the use of CoT prompt engineering does not significantly alter GPT-3.5's effectiveness in handling medical calculations or clinical scenario questions styled like those in USMLE exams. This finding is crucial as it suggests that performance of ChatGPT remains consistent regardless of whether a CoT technique is used instead of direct prompts. This consistency could be instrumental in simplifying the integration of AI tools like ChatGPT into medical education, enabling healthcare professionals to utilize these tools with ease, without the necessity for complex prompt engineering.
Students' inappropriate use of ChatGPT is a concern. There is also, however, the potential for academics to use ChatGPT inappropriately. After explaining ChatGPT's "hallucinations" regarding citing ...and referencing, this commentary illustrates the problem by describing the detection of the first known Medical Teacher submission using ChatGPT inappropriately, the lessons that can be drawn from it for journal editors, reviewers, and teachers, and then the wider implications if this problem is left unchecked.
ChatGPT is a new artificial intelligence-powered language model of chatbot able to help otolaryngologists in clinical practice and research. We investigated the ability of ChatGPT-4 in the editing of ...a manuscript in otolaryngology. Four papers were written by a nonnative English otolaryngologist and edited by a professional editing service. ChatGPT-4 was used to detect and correct errors in manuscripts. From the 171 errors in the manuscripts, ChatGPT-4 detected 86 errors (50.3%) including vocabulary (N = 36), determiner (N = 27), preposition (N = 24), capitalization (N = 20), and number (N = 11). ChatGPT-4 proposed appropriate corrections for 72 (83.7%) errors, while some errors were poorly detected (eg, capitalization 5% and vocabulary 44.4% errors. ChatGPT-4 claimed to change something that was already there in 82 cases. ChatGPT demonstrated usefulness in identifying some types of errors but not all. Nonnative English researchers should be aware of the current limits of ChatGPT-4 in the proofreading of manuscripts.
English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps ...what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4's estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants' estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.