In the context of the rapidly evolving AI-generated translation landscape, this study explores the viewer experience of GPT-4-generated humor translations compared to human translations for dubbing ...purposes. Despite advances in AI technology, challenges persist in capturing the nuances of humor and cultural references, which are essential for effective humor translation. A group of cinephiles was provided with humor dialogues translated using both GPT-4-generated and human translations and were then asked to fill out a questionnaire evaluating five constructs: quality, comprehension, enjoyment, perception, and suggestions for improvement. The results indicate that GPT-4-generated translations can offer an equivalent or superior viewer experience to human translations. However, the results also highlight the potential of GPT-4-generated translation in humor translation and emphasize the need to refine AI algorithms, foster collaboration between AI and human translators, and establish quality standards for AI-generated translations based on audience preferences and evolving trends in humor.
•This study compares the viewer experience of GPT-4-generated humor translations to human translations for dubbing purposes.•Participants evaluated translations based on quality, comprehension, enjoyment, perception, and suggestions for improvement.•GPT-4-generated translations were found to offer an equivalent or superior viewer experience compared to human translations.•Collaboration between AI and human translators and establishing quality standards are recommended for improved translation outcomes.
Exploring the integration of artificial intelligence in clinical settings, this study examined the feasibility of using Generative Pretrained Transformer 4 (GPT-4), a large language model, as a ...consultation assistant in a hand surgery outpatient clinic.
The study involved 10 simulated patient scenarios with common hand conditions, where GPT-4, enhanced through specific prompt engineering techniques, conducted medical history interviews, and assisted in diagnostic processes. A panel of expert hand surgeons, each board-certified in hand surgery, evaluated GPT-4’s responses using a Likert Scale across five criteria with scores ranging from 1 (lowest) to 5 (highest).
Generative Pretrained Transformer 4 achieved an average score of 4.6, reflecting good performance in documenting a medical history, as evaluated by the hand surgeons.
These findings suggest that GPT-4 can effectively document medical histories to meet the standards of hand surgeons in a simulated environment. The findings indicate potential for future application in patient care, but the actual performance of GPT-4 in real clinical settings remains to be investigated.
This study provides a preliminary indication that GPT-4 could be a useful consultation assistant in a hand surgery outpatient clinic, but further research is required to explore its reliability and practicality in actual practice.
In the face of escalating oral cancer rates, the application of large language models like Generative Pretrained Transformer (GPT)-4 presents a novel pathway for enhancing public awareness about ...prevention and early detection. This research aims to explore the capabilities and possibilities of GPT-4 in addressing open-ended inquiries in the field of oral cancer.
Using 60 questions accompanied by reference answers, covering concepts, causes, treatments, nutrition, and other aspects of oral cancer, evaluators from diverse backgrounds were selected to evaluate the capabilities of GPT-4 and a customized version. A P value under .05 was considered significant.
Analysis revealed that GPT-4 and its adaptations notably excelled in answering open-ended questions, with the majority of responses receiving high scores. Although the median score for standard GPT-4 was marginally better, statistical tests showed no significant difference in capabilities between the two models (P > .05). Despite statistical significance indicated diverse backgrounds of evaluators have statistically difference (P < .05), a post hoc test and comprehensive analysis demonstrated that both editions of GPT-4 demonstrated equivalent capabilities in answering questions concerning oral cancer.
GPT-4 has demonstrated its capability to furnish responses to open-ended inquiries concerning oral cancer. Utilizing this advanced technology to boost public awareness about oral cancer is viable and has much potential. When it's unable to locate pertinent information, it will resort to their inherent knowledge base or recommend consulting professionals after offering some basic information. Therefore, it cannot supplant the expertise and clinical judgment of surgical oncologists and could be used as an adjunctive evaluation tool.
GPT-4 passes the bar exam
Philosophical transactions - Royal Society. Mathematical, Physical and engineering sciences/Philosophical transactions - Royal Society. Mathematical, physical and engineering sciences,
04/2024
Journal Article
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot ...reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT’s capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool’s usefulness to society and how the learning and validation procedures for such systems should be established.
•The results of ChatGPT and GPT-4 evaluation on 25 tasks using 48k+ prompts.•Context-awareness and personalization are valuable capabilities of ChatGPT.•ChatGPT and GPT-4 are always worse compared to SOTA methods from 4% to over 70%.•ChatGPT loss tends to be higher for more difficult reasoning problems.•ChatGPT can boost AI development and change our daily lives.
Virtual assistants, broadly defined as digital services designed to simulate human conversation and provide personalized responses based on user input, have the potential to improve health care by ...supporting clinicians and patients in terms of diagnosing and managing disease, performing administrative tasks, and supporting medical research and education. These tasks are particularly helpful in vascular surgery, where the clinical and administrative burden is high due to the rising incidence of vascular disease, the medical complexity of the patients, and the potential for innovation and care advancement. The rapid development of artificial intelligence, machine learning, and natural language processing techniques have facilitated the training of large language models, such as GPT-4 (OpenAI), which can support the development of increasingly powerful virtual assistants. These tools may support holistic, multidisciplinary, and high-quality vascular care delivery throughout the pre-, intra-, and postoperative stages. Importantly, it is critical to consider the design, safety, and challenges related to virtual assistants, including data security, ethical, and equity concerns. By combining the perspectives of patients, clinicians, data scientists, and other stakeholders when developing, implementing, and monitoring virtual assistants, there is potential to harness the power of this technology to care for vascular surgery patients more effectively. In this comprehensive review article, we introduce the concept of virtual assistants, describe potential applications of virtual assistants in vascular surgery for clinicians and patients, highlight the benefits and drawbacks of large language models, such as GPT-4, and discuss considerations around the design, safety, and challenges associated with virtual assistants in vascular surgery.
ChatGPT, an artificial intelligence generated content (AIGC) model developed by OpenAI, has attracted world-wide attention for its capability of dealing with challenging language understanding and ...generation tasks in the form of conversations. This paper briefly provides an overview on the history, status quo and potential future development of ChatGPT, helping to provide an entry point to think about ChatGPT. Specifically, from the limited open-accessed resources, we conclude the core techniques of ChatGPT, mainly including large-scale language models, in-context learning, reinforcement learning from human feedback and the key technical steps for developing Chat-GPT. We further analyze the pros and cons of ChatGPT and we rethink the duality of ChatGPT in various fields. Although it has been widely acknowledged that ChatGPT brings plenty of opportunities for various fields, mankind should still treat and use ChatGPT properly to avoid the potential threat, e.g., academic integrity and safety challenge. Finally, we discuss several open problems as the potential development of ChatGPT.
Interest surrounding generative large language models (LLMs) has rapidly grown. Although ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, ...the performance of ChatGPT or its successor GPT-4 on specialized examinations and the factors affecting accuracy remain unclear. This study aims to assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written board examination.
The Self-Assessment Neurosurgery Examinations (SANS) American Board of Neurological Surgery Self-Assessment Examination 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests were used to assess performance differences in relation to question characteristics.
ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% CI: 69.3%-77.2%) and 83.4% (95% CI: 79.8%-86.5%), respectively, relative to the user average of 72.8% (95% CI: 68.6%-76.6%). Both LLMs exceeded last year's passing threshold of 69%. Although scores between ChatGPT and question bank users were equivalent ( P = .963), GPT-4 outperformed both (both P < .001). GPT-4 answered every question answered correctly by ChatGPT and 37.6% (50/133) of remaining incorrect questions correctly. Among 12 question categories, GPT-4 significantly outperformed users in each but performed comparably with ChatGPT in 3 (functional, other general, and spine) and outperformed both users and ChatGPT for tumor questions. Increased word count (odds ratio = 0.89 of answering a question correctly per +10 words) and higher-order problem-solving (odds ratio = 0.40, P = .009) were associated with lower accuracy for ChatGPT, but not for GPT-4 (both P > .005). Multimodal input was not available at the time of this study; hence, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based on contextual context clues alone.
LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.