Synchronization attack is one of the key issues of digital audio watermarking. In this paper, a robust digital audio watermarking algorithm in DWT (Discrete Wavelet Transform) and DCT (Discrete ...Cosine Transform) domain is presented, which can resist synchronization attack effectively. The features of the proposed algorithm are as follows: (1)More steady synchronization code and new embedded strategy are adopted to resist the synchronization attack effectively. (2)The multi-resolution characteristics of DWT and the energy- compression characteristics of DCT are combined to improve the transparency of digital watermark. (3)The algorithm can extract the watermark without the help of the original digital audio signal.
The recent rise of adversarial machine learning highlights the vulnerabilities of various systems relevant in a wide range of application domains. This paper focuses on the important domain of ...automatic space surveillance based on the acoustic modality. After setting up a state of the art solution using log-Mel spectrogram modeled by a convolutional neural network, we systematically investigate the following four types of adversarial attacks: a) Fast Gradient Sign, b) Projected Gradient Descent, c) Jacobian Saliency Map, and d) Carlini & Wagner \ell_{\infty} . Experimental scenarios aiming at inducing false positives or negatives are considered, while attacks' efficiency are thoroughly examined. It is shown that several attack types are able to reach high success rate levels by injecting relatively small perturbations on the original audio signals. This underlines the need of suitable and effective defense strategies, which will boost reliability in machine learning based solution.
This study presents an approach for emotion classification of speech utterances based on ensemble of support vector machines. We considered feature level fusion of the MFCC, total energy and F0 as ...input feature vectors, and choose bagging method for the classification. Additionally, we also present a new emotional dataset based on a popular animation film, Finding Nemo where emotions are much emphasized to attract attention of spectators. Speech utterances are directly extracted from video audio channel including all background noise. Totally 2054 utterances from 24 speakers were annotated by a group of volunteers based on seven emotion categories. We concentrated on perceived emotion. Our approach has been tested on our newly developed dataset besides publically available datasets of DES and EmoDB. Experiments showed that our approach achieved 77.5% and 66.8% overall accuracy for four and five class classification on EFN dataset respectively. In addition, we achieved 67.6% accuracy on DES (five classes) and 63.5% on EmoDB (seven classes) dataset using ensemble of SVM’s with 10 fold cross-validation.
The extraction of information from recorded meetings is a very important yet challenging task. The problem lies in the inability of speech recognition systems to be directly applied onto meeting ...speech data, mainly because meeting participants speak concurrently and head-mounted microphones record more than just their wearers’ utterances – crosstalk from his neighbours are inevitably recorded as well. As a result, a degree of preprocessing of these recordings is needed. The current work presents an approach to segment meetings into four audio classes: Single speaker, crosstalk, single speaker plus crosstalk and silence. For this purpose, we propose Two-Layer Cascaded Subband Filters, which spread according to the pitch and formant frequency scales. This filters are able to detect the presence or absence of pitch and formants in an audio signal. In addition, the filters can determine how many numbers of pitches and formants are present in an audio signal based on the output subband energies. Experiments conducted on the ICSI meeting corpus, show that although an overall recognition rate of up to 57% was achieved, rates for crosstalk and silence classes are as high as 80%. This indicates the positive effect and potential of this subband feature in meeting segmentation tasks.
Various features generated from raw audio signals can be used as an input of a deep learning model. They include hand-crafted features such as mel-frequency cepstral coefficients, two-dimensional ...time-frequency representations and raw audio data. In most cases, the time-frequency representations are related to so-called spectrogram-based images. Having an image at the deep learning input enables to apply performance improvement accumulated in video and image processing. However, spectrogram-based images have some specific properties that should be taken into account when a deep learning model is designed. This paper deals with mapping of audio signals into the most common spectrogram-based images. Some unique properties of these images as well as the way how they are generated are analyzed here for a particular case of fridge sounds.
In this paper, to automatically generate musical thumbnails that con tain the main part of the original tune, we propose a new estimation method for identifying structure changes in stereo tunes ...based on localization information. The proposed method can estimate the main parts of a musical tune by analyzing the specific timing when localization information changes under the assumption that the changing time of the localization approximately corresponds to the timing of the musical structure change. We evaluate the effectiveness of the proposed method by objective and subjective assessments. The experimental results show that the proposed method is effective in automating musical structure analysis for generating musical thumb nails.
Audio Secret Sharing for 1-Bit Audio Nishimura, Ryouichi; Fujita, Norihiro; Suzuki, Yôiti
Knowledge-Based Intelligent Information and Engineering Systems
Book Chapter
Peer reviewed
In this paper, we propose a new secret sharing scheme (SSS) 1 for audio signals, called as “Binary audio secret sharing (BASS).” SSS is an encryption method and produces n shared data from an ...original data to hide useful information. Applying SSS to audio communications on the Internet can help to make it more robust against theft and the tapping of information. Thus, we focused on the 1-bit audio format and applied SSS to 1-bit audio signals to realize audio secret sharing. Moreover, we propose a method to make each shared data heard as its intended sound.
An audio watermarking scheme with neural network is presented in this paper. The hiding watermark, which is the combination of the chaotic sequence and the watermark sequence related to original ...watermark image, is embedded into the DCT coefficients of the original audio signal. Meanwhile, to improve the ability of de-synchronization attack, the synchronous code is embedded into the original audio signal in the time domain. To extract the watermark sequence, first we select DCT coefficients corresponding to the pseudorandom sequence as the training sample, which can be used to train the neural network. Then the DCT coefficients relating to the watermark sequence is taken as the validation sample. Experimental results show that the proposed method has good robustness under general signal manipulations.
Recently, many audio search sites headed by Google have used audio fingerprinting technology to search for the same audio and protect the music copyright using one part of the audio data. However, if ...there are fingerprints per audio file, then the amount of query data for the audio search increases. In this paper, we propose a novel method that can reduce the number of fingerprints while providing a level of performance similar to that of existing methods. The proposed method uses the difference of Gaussians which is often used in feature extraction during image signal processing. In the experiment, we use the proposed method and dynamic time warping and undertake an experimental search for the same audio with a success rate of 90%. The proposed method, therefore, can be used for an effective audio search.