In the context of the rapidly evolving AI-generated translation landscape, this study explores the viewer experience of GPT-4-generated humor translations compared to human translations for dubbing ...purposes. Despite advances in AI technology, challenges persist in capturing the nuances of humor and cultural references, which are essential for effective humor translation. A group of cinephiles was provided with humor dialogues translated using both GPT-4-generated and human translations and were then asked to fill out a questionnaire evaluating five constructs: quality, comprehension, enjoyment, perception, and suggestions for improvement. The results indicate that GPT-4-generated translations can offer an equivalent or superior viewer experience to human translations. However, the results also highlight the potential of GPT-4-generated translation in humor translation and emphasize the need to refine AI algorithms, foster collaboration between AI and human translators, and establish quality standards for AI-generated translations based on audience preferences and evolving trends in humor.
•This study compares the viewer experience of GPT-4-generated humor translations to human translations for dubbing purposes.•Participants evaluated translations based on quality, comprehension, enjoyment, perception, and suggestions for improvement.•GPT-4-generated translations were found to offer an equivalent or superior viewer experience compared to human translations.•Collaboration between AI and human translators and establishing quality standards are recommended for improved translation outcomes.
•GPT-4 API as a Complementary Reviewer: Novel integration method for systematic literature reviews, enhancing efficiency and rigor.•Feasibility and Reliability Evaluation: Assessing GPT-4′s potential ...as a primary screening tool in healthcare information retrieval.•Comprehensive Inter-rater Agreement Analysis: Utilizing Cohen’s kappa to assess agreement between human reviewers, GPT-4, and consensus, employing distinct methodologies for various parameter types.•Full-text Extraction Advancements: Overcoming limitations to extend AI-based reviewer capabilities in evidence synthesis.
PRISMA-based literature reviews require meticulous scrutiny of extensive textual data by multiple reviewers, which is associated with considerable human effort.
To evaluate feasibility and reliability of using GPT-4 API as a complementary reviewer in systematic literature reviews based on the PRISMA framework.
A systematic literature review on the role of natural language processing and Large Language Models (LLMs) in automatic patient-trial matching was conducted using human reviewers and an AI-based reviewer (GPT-4 API). A RAG methodology with LangChain integration was used to process full-text articles. Agreement levels between two human reviewers and GPT-4 API for abstract screening and between a single reviewer and GPT-4 API for full-text parameter extraction were evaluated.
An almost perfect GPT–human reviewer agreement in the abstract screening process (Cohen’s kappa > 0.9) and a lower agreement in the full-text parameter extraction were observed.
As GPT-4 has performed on a par with human reviewers in abstract screening, we conclude that GPT-4 has an exciting potential of being used as a main screening tool for systematic literature reviews, replacing at least one of the human reviewers.
Depression is a significant global health challenge. Still, many people suffering from depression remain undiagnosed. Furthermore, the assessment of depression can be subject to human bias. Natural ...Language Processing (NLP) models offer a promising solution. We investigated the potential of four NLP models (BERT, Llama2-13B, GPT-3.5, and GPT-4) for depression detection in clinical interviews. Participants (N = 82) underwent clinical interviews and completed a self-report depression questionnaire. NLP models inferred depression scores from interview transcripts. Questionnaire cut-off values for depression were used as a classifier for depression. GPT-4 showed the highest accuracy for depression classification (F1 score 0.73), while zero-shot GPT-3.5 initially performed with low accuracy (0.34), improved to 0.82 after fine-tuning, and achieved 0.68 with clustered data. GPT-4 estimates of symptom severity PHQ-8 score correlated strongly (r = 0.71) with true symptom severity. These findings demonstrate the potential of AI models for depression detection. However, further research is necessary before widespread deployment can be considered.
•Exploring the Potential of LLMs for Depression Detection in Clinical Data.•Models compared: BERT, GPT-3.5, GPT-4, and Llama2 13b — in depression classification.•Various approaches explored: Zero-shot testing, fine-tuning, and novel cluster idea.•Zero-shot GPT-4 and Fine-tuned GPT-3.5 exhibited superior performance.•Llama2 13b as an open-source model showcases significant potential.
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot ...reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT’s capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool’s usefulness to society and how the learning and validation procedures for such systems should be established.
•The results of ChatGPT and GPT-4 evaluation on 25 tasks using 48k+ prompts.•Context-awareness and personalization are valuable capabilities of ChatGPT.•ChatGPT and GPT-4 are always worse compared to SOTA methods from 4% to over 70%.•ChatGPT loss tends to be higher for more difficult reasoning problems.•ChatGPT can boost AI development and change our daily lives.
Large‐scale artificial intelligence (AI) models such as ChatGPT have the potential to improve performance on many benchmarks and real‐world tasks. However, it is difficult to develop and maintain ...these models because of their complexity and resource requirements. As a result, they are still inaccessible to healthcare industries and clinicians. This situation might soon be changed because of advancements in graphics processing unit (GPU) programming and parallel computing. More importantly, leveraging existing large‐scale AIs such as GPT‐4 and Med‐PaLM and integrating them into multiagent models (e.g., Visual‐ChatGPT) will facilitate real‐world implementations. This review aims to raise awareness of the potential applications of these models in healthcare. We provide a general overview of several advanced large‐scale AI models, including language models, vision‐language models, graph learning models, language‐conditioned multiagent models, and multimodal embodied models. We discuss their potential medical applications in addition to the challenges and future directions. Importantly, we stress the need to align these models with human values and goals, such as using reinforcement learning from human feedback, to ensure that they provide accurate and personalized insights that support human decision‐making and improve healthcare outcomes.
This review provides an overview of large‐scale AI models, including language models (e.g., ChatGPT), vision‐language models, and language‐conditioned multiagent models, and discusses their potential applications in medicine, as well as their limitations and future trends. We also propose how large‐scale AI models can be integrated into various scenarios of clinical applications.
ChatGPT is an artificial intelligence model that has the potential to revolutionize the field of endoscopy. It can rapidly summarize medical records, assist with diagnosis, provide patient ...communication support, and even understand endoscopic images. However, there are limitations, including the risk of inaccurate or inappropriate responses, privacy and security issues, and the potential to limit doctors' thinking. Further research and improvements are needed to ensure ChatGPT's safe use in the medical field. Overall, the potential benefits of ChatGPT in endoscopy are vast, and it has the ability to greatly improve the efficiency of diagnosis and treatment.
English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps ...what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4's estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants' estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.