News currently spreads rapidly through the internet. Because fake news stories are designed to attract readers, they tend to spread faster. For most readers, detecting fake news can be challenging ...and such readers usually end up believing that the fake news story is fact. Because fake news can be socially problematic, a model that automatically detects such fake news is required. In this paper, we focus on data-driven automatic fake news detection methods. We first apply the Bidirectional Encoder Representations from Transformers model (BERT) model to detect fake news by analyzing the relationship between the headline and the body text of news. To further improve performance, additional news data are gathered and used to pre-train this model. We determine that the deep-contextualizing nature of BERT is best suited for this task and improves the 0.14 F-score over older state-of-the-art models.
The most commonly used algorithm in recommendation systems is collaborative filtering. However, despite its wide use, the prediction accuracy of this algorithm is unexceptional. Furthermore, whether ...quantitative data such as product rating or purchase history reflect users’ actual taste is questionable. In this article, we propose a method to utilise user review data extracted with opinion mining for product recommendation systems. To evaluate the proposed method, we perform product recommendation test on Amazon product data, with and without the additional opinion mining result on Amazon purchase review data. The performances of these two variants are compared by means of precision, recall, true positive recommendation (TPR) and false positive recommendation (FPR). In this comparison, a large improvement in prediction accuracy was observed when the opinion mining data were taken into account. Based on these results, we answer two main questions: ‘Why is collaborative filtering algorithm not effective?’ and ‘Do quantitative data such as product rating or purchase history reflect users’ actual tastes?’
Recommendation systems are very important in various applications and e-commerce environments. A representative method is collaborative filtering (CF), which models the user preference by means of ...feedback from the user. CF-based methods made better recommendations than did previous studies because CF captures the interactions between the user and the item. However, despite the advantages of working with high-density data, these methods are vulnerable to the data sparsity that often exists in real data sets. In addressing this issue, we combine similarity-based approaches (which clearly serve product recommendations that are similar products) with knowledge-based similarity and provide individualized top-N recommendations. This approach, called UK (Unifying user preference and item knowledge-based similarity models), further exploits knowledge-based similarity ideas along with user preferences to extend the item interactions. We assume strong independence between various factors. We quantitatively demonstrate that by applying our method to real data sets of various sizes or types, UK works better than cutting-edge methods. In terms of qualitative discovery, UK also understands individual interactions and can provide meaningful recommendations according to the goal.
Most textual analysis-based trading approaches in cryptocurrency (crypto) involve lexical, rule-based methods for extracting news sentiments. Furthermore, general purpose language models (LMs) are ...not always suitable for the crypto domain due to jargons that are not covered in general purpose texts. This study answers the question of "Is it possible that the LMs can profit by effectively applying the sentiment score of the natural language processing task with chart score in the BTC trading system?" by focusing on the effectiveness of both scores, which significantly affect the profit of the trading system. We introduce CBITS: Cryptocurrency BERT Incorporated Trading System based on pre-trained LMs for Korean crypto sentiment analysis to aid Bitcoin (BTC) trading models.We pre-trained crypto-specific LMs, which are transformer encoder-based architectures. Along with our pre-trained LMs, we also present our custom fine-tuning dataset used to train our LMs on the BTC sentiment classifier and show that using sentiment scores along with BTC chart data boosts the performance of BTC trading models and allows us to create a market neutral trading strategy.
In this study, we conduct a pioneering and comprehensive examination of ChatGPT’s (GPT-3.5 Turbo) capabilities within the realm of Korean Grammatical Error Correction (K-GEC). Given the Korean ...language’s agglutinative nature and its rich linguistic intricacies, the task of accurately correcting errors while preserving Korean-specific sentiments is notably challenging. Utilizing a systematic categorization of Korean grammatical errors, we delve into a meticulous, case-specific analysis to identify the strengths and limitations of a ChatGPT-based correction system. We also critically assess influential parameters like temperature and specific error criteria, illuminating potential strategies to enhance ChatGPT’s efficacy in K-GEC tasks. Our findings offer valuable contributions to the expanding domain of NLP research centered on the Korean language.
Video scene segmentation is very important research in the field of computer vision, because it helps in efficient storage, indexing and retrieval of videos. Achieving this kind of scene segmentation ...cannot be done by just calculating the similarity of low-level features presented in the video; high-level features should also be considered to achieve a better performance. Even though much research has been conducted on video scene segmentation, most of these studies failed to semantically segment a video into scenes. Thus, in this study, we propose a Deep-learning Semantic-based Scene-segmentation model (called DeepSSS) that considers image captioning to segment a video into scenes semantically. First, the DeepSSS performs shot boundary detection by comparing colour histograms and then employs maximum-entropy-applied keyframe extraction. Second, for semantic analysis, using image captioning that benefits from deep learning generates a semantic text description of the keyframes. Finally, by comparing and analysing the generated texts, it assembles the keyframes into a scene grouped under a semantic narrative. That said, DeepSSS considers both low- and high-level features of videos to achieve a more meaningful scene segmentation. By applying DeepSSS to data sets from MS COCO for caption generation and evaluating its semantic scene-segmentation task results with the data sets from TRECVid 2016, we demonstrate quantitatively that DeepSSS outperforms other existing scene-segmentation methods using shot boundary detection and keyframes. What’s more, the experiments were done by comparing scenes segmented by humans and scene segmented by the DeepSSS. The results verified that the DeepSSS’ segmentation resembled that of humans. This is a new kind of result that was enabled by semantic analysis, which was impossible by just using low-level features of videos.
Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization ...to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.
Fake News (disinformation with malicious intent) has emerged as a major social problem. To address this issue, previous studies mainly utilized single information, the news content, to detect fake ...news. However, using only news content in training is insufficient. Moreover, most studies did not consider the propagation aspect of fake news as a training feature. Thus, in an attempt to incorporate the ability to learn representation based on textual information and social context, this study proposed a fake news detection algorithm that thoroughly utilizes user graph in Korean fake news and dataset construction methods. In addition, a training strategy was proposed for utilizing user graph in Korean fake news detection through comparative and ablation studies. The experimental results showed that K-FANG outperformed the baseline in detecting fake news. Moreover, user engagements were found to be useful for detecting fake news even if the data contained hate speech. Finally, the validity of using stance information by expanding its class and controlling the class imbalance issues was also verified. This study provided useful implications for utilizing user information in fake news detection.
Translation of the languages of ancient times can serve as a source for the content of various digital media and can be helpful in various fields such as natural phenomena, medicine, and science. ...Owing to these needs, there has been a global movement to translate ancient languages, but expert minds are required for this purpose. It is difficult to train language experts, and more importantly, manual translation is a slow process. Consequently, the recovery of ancient characters using machine translation has been recently investigated, but there is currently no literature on the machine translation of ancient Korean. This paper proposes the first ancient Korean neural machine translation model using a Transformer. This model can improve the efficiency of a translator by quickly providing a draft translation for a number of untranslated ancient documents. Furthermore, a new subword tokenization method called the Share Vocabulary and Entity Restriction Byte Pair Encoding is proposed based on the characteristics of ancient Korean sentences. This proposed method yields an increase in the performance of the original conventional subword tokenization methods such as byte pair encoding by 5.25 BLEU points. In addition, various decoding strategies such as n-gram blocking and ensemble models further improve the performance by 2.89 BLEU points. The model has been made publicly available as a software application.
The commonsense question and answering (CSQA) system predicts the right answer based on a comprehensive understanding of the question. Previous research has developed models that use QA pairs, the ...corresponding evidence, or the knowledge graph as an input. Each method executes QA tasks with representations of pre-trained language models. However, the ability of the pre-trained language model to comprehend completely remains debatable. In this study, adversarial attack experiments were conducted on question-understanding. We examined the restrictions on the question-reasoning process of the pre-trained language model, and then demonstrated the need for models to use the logical structure of abstract meaning representations (AMRs). Additionally, the experimental results demonstrated that the method performed best when the AMR graph was extended with ConceptNet. With this extension, our proposed method outperformed the baseline in diverse commonsense-reasoning QA tasks.