We propose a method to rank and retrieve image sequences from a natural language text query, consisting of multiple sentences or paragraphs. One of the method's key applications is to visualize ...visitors' text-only reviews on TRIPADVISOR or YELP, by automatically retrieving the most illustrative image sequences. While most previous work has dealt with the relations between a natural language sentence and an image or a video, our work extends to the relations between paragraphs and image sequences. Our approach leverages the vast user-generated resource of blog posts and photo streams on the Web. We use blog posts as text-image parallel training data that co-locate informative text with representative images that are carefully selected by users. We exploit large-scale photo streams to augment the image samples for retrieval. We design a latent structural SVM framework to learn the semantic relevance relations between text and image sequences. We present both quantitative and qualitative results on the newly created DISNEYLAND dataset.
We propose an approach that utilizes large collections of photo streams and blog posts, two of the most prevalent sources of data on the Web, for joint story-based summarization and exploration. ...Blogs consist of sequences of images and associated text; they portray events and experiences with concise sentences and representative images. We leverage blogs to help achieve story-based semantic summarization of collections of photo streams. In the opposite direction, blog posts can be enhanced with sets of photo streams by showing interpolations between consecutive images in the blogs. We formulate the problem of joint alignment from blogs to photo streams and photo stream summarization in a unified latent ranking SVM framework. We alternate between solving the two coupled latent SVM problems, by first fixing the summarization and solving for the alignment from blog images to photo streams and vice versa. On a newly collected large-scale Disneyland dataset of 10K blogs (120K associated images) and 6K photo streams (540K images), we demonstrate that blog posts and photo streams are mutually beneficial for summarization, exploration, semantic knowledge transfer, and photo interpolation.
Proactive learning extends active learning by considering multiple labelers with different accuracies and costs, thus optimizing labeler selection as well as instance selection. In this paper, we ...propose a novel method to estimate labeler accuracy per class and to select labelers based on both cost and estimated accuracy, combined with an ensemble approach called multi-class information density (MCID) as a selection criterion. Our approach relaxes the common assumption found in past work that labeler accuracy is independent of class for multi-class learning, and by estimating the class-conditional accuracy better assigns instances to labelers. Results on several datasets with both real and simulated experts strongly demonstrate the efficacy of these methods.
This paper introduces the Tenth Dialog System Technology Challenge (DSTC-10). This edition of the DSTC focuses on applying end-to-end dialog technologies for five distinct tasks in dialog systems, ...namely 1. Incorporation of Meme images into open domain dialogs, 2. Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations, 3. Situated Interactive Multimodal dialogs, 4. Reasoning for Audio Visual Scene-Aware Dialog, and 5. Automatic Evaluation and Moderation of Open-domain Dialogue Systems. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.
This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, ...namely, 1. Task-oriented dialog Modeling with Unstructured Knowledge Access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog and 4. Situated interactive multimodal dialog. This paper describes the task definition, provided datasets, baselines, and evaluation setup for each track. We also summarize the results of the submitted systems to highlight the general trends of the state-of-the-art technologies for the tasks.
Inertial Measurement Units (IMUs) are small, low-cost sensors that can measure accelerations and angular velocities, making them valuable tools for a variety of applications, including robotics, ...virtual reality, and healthcare. With the advent of deep learning, there has been a surge of interest in using IMU data to train DNN models for various applications. In this paper, we survey the state-of-the-art ML models including deep neural network models and applications for IMU sensors. We first provide an overview of IMU sensors and the types of data they generate. We then review the most popular models for IMU data, including convolutional neural networks, recurrent neural networks, and attention-based models. We also discuss the challenges associated with training deep neural networks on IMU data, such as data scarcity, noise, and sensor drift. Finally, we present a comprehensive review of the most prominent applications of deep neural networks for IMU data, including human activity recognition, gesture recognition, gait analysis, and fall detection. Overall, this survey provides a comprehensive overview of the stateof-the-art deep neural network models and applications for IMU sensors and highlights the challenges and opportunities in this rapidly evolving field.
Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on ...domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.
Recent years have seen an increasing trend in the volume of personal media captured by users, thanks to the advent of smartphones and smart glasses, resulting in large media collections. Despite ...conversation being an intuitive human-computer interface, current efforts focus mostly on single-shot natural language based media retrieval to aid users query their media and re-live their memories. This severely limits the search functionality as users can neither ask follow-up queries nor obtain information without first formulating a single-turn query. In this work, we propose dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation. Towards this, we collect a new task-oriented dialog dataset COMET, which contains \(11.5k\) user<->assistant dialogs (totaling \(103k\) utterances), grounded in simulated personal memory graphs. We employ a resource-efficient, two-phase data collection pipeline that uses: (1) a novel multimodal dialog simulator that generates synthetic dialog flows grounded in memory graphs, and, (2) manual paraphrasing to obtain natural language utterances. We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines, in order to highlight the multimodal challenges captured by our dataset.
Virtual Intelligent Assistants take user requests in the voice form, perform actions such as setting an alarm, turning on a light, and answering a question, and provide answers or confirmations in ...the voice form or through other channels such as a screen. Assistants have become prevalent in the past decade, and users have been taking services from assistants like Amazon Alexa, Apple Siri, Google Assistant, and Microsoft Cortana.
The emergence of AR/VR devices raised many new challenges for building intelligent assistants. The unique requirements have inspired new research directions such as (a) understanding users' situated multi-modal contexts (e.g. vision, sensor signals) as well as language-oriented conversational contexts, (b) personalizing the assistant services by grounding interactions on growing public and personal knowledge graphs and online search engines, and (c) on- device model inference and training techniques that satisfy strict resource and privacy constraints.
In this tutorial, we will provide an in-depth walk-through of techniques in the afore-mentioned areas in the recent literature. We aim to introduce techniques for researchers and practitioners who are building intelligent assistants, and inspire research that will bring us one step closer to realizing the dream of building an all-day accompanying assistant. Additionally, we will highlight the significant role that Large Language Models (LLMs) play in enhancing these strategies, underscoring their potential to reshape the future landscape of intelligent assistance.