Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In ...this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360<inline-formula><tex-math notation="LaTeX">^\circ</tex-math></inline-formula> video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics, mono, and stereo. Experimental results show a 10% improvement on AVSA for the first order ambisonics intensity vector (FOA-IV) in comparison with log-mel spectrogram features; the addition of object-oriented crops also brings significant performance increases for the human action recognition downstream task. A number of audio-only downstream tasks are devised for testing the effectiveness of the learnt audio feature representation, obtaining performance comparable to state-of-the-art methods on acoustic scene classification from ambisonic and binaural audio.
For classical Philips audio retrieval, the short duration and the long silent period in inserted template audio make a major challenge to the robustness in actual environments. In this study, a novel ...audio retrieval method is proposed to handle the challenge by modifying both the fingerprinting stage and the matching stage. While extracting audio fingerprints, the silent segments are firstly detected. Then, a specific fingerprint is arranged to the silent segments for distinguishment. In the matching stage, a window-by-window search is performed to figure out the inserted audio templates. Moreover, the searching window is divided into several segments for precise comparison between the template audio and the test audio. A testing dataset is made by randomly arranging the duration of the inserted template audio to be from 3 to 5 s. Experiment results show that mean average precision and recall are significantly improved by the proposed method.
The term "immersive audio" is frequently used to describe an audio experience that provides the listener the sensation of being fully immersed or "present" in a sound scene. This can be achieved via ...different presentation modes, such as surround sound (several loudspeakers horizontally arranged around the listener), 3D audio (with loudspeakers at, above, and below listener ear level), and binaural audio to headphones. This article provides an overview of two recent standards that support the bitrate-efficient carriage of high-quality immersive sound. The first is MPEG-H 3D audio, which is a versatile standard that supports multiple immersive sound signal formats (channels, objects, and higher order ambisonics) and is now being adopted in broadcast and streaming applications. The second is MPEG-I immersive audio, an extension of 3D audio, currently under development, which is targeted for virtual and augmented reality applications. This will support rendering of fully user-interactive immersive sound for three degrees of user movement three degrees of freedom (3DoF), i.e., yaw, pitch, and roll head movement, and for six degrees of user movement six degrees of freedom (6DoF), i.e., 3DoF plus translational <inline-formula> <tex-math notation="LaTeX">{x} </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">{y} </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">{z} </tex-math></inline-formula> user position movements.
Short-range audio channels have appealing distinguishing characteristics: ease of use, low deployment costs, and easy to tune frequencies, to cite a few. Moreover, thanks to their seamless ...adaptability to the security context, many techniques and tools based on audio signals have been recently proposed. However, while the most promising solutions are turning into valuable commercial products, acoustic channels are also increasingly used to launch attacks against systems and devices, leading to security concerns that could thwart their adoption. To provide a rigorous, scientific, security-oriented review of the field, in this paper we survey and classify methods, applications, and use-cases rooted on short-range audio channels for the provisioning of security services-including Two-Factor Authentication techniques, pairing solutions, device authorization strategies, defense methodologies, and attack schemes. Moreover, we also point out the strengths and weaknesses deriving from the use of short-range audio channels. Finally, we provide open research issues in the context of short-range audio channels security, calling for contributions from both academia and industry.
Wav2CLIP: Learning Robust Audio Representations from Clip Wu, Ho-Hsiang; Seetharaman, Prem; Kumar, Kundan ...
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2022-May-23
Conference Proceeding
Open access
We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks ...including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ∼10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.
Institutions that have collected video testimonies from the few remaining Holocaust survivors are grappling with how to continue their mission to educate and commemorate. Noah Shenker calls attention ...to the ways that audiovisual testimonies of the Holocaust have been mediated by the institutional histories and practices of their respective archives. Shenker argues that testimonies are shaped not only by the encounter between interviewer and interviewee, but also by technical practices and the testimony process. He analyzes the ways in which interview questions, the framing of the camera, and curatorial and programming preferences impact how Holocaust testimony is molded, distributed, and received.