Cepstral voices activation

In this approach, instead of using an HMM or a similar model, the audio track is embedded in the feature space. ( 2015) discuss the possibility of using DTW even for the case where a keyword can be subjected to declensions, conjugations, or even word order permutations.įinally we would like to mention the discriminative keyword spotting, an approach that was introduced in Keshet et al. ( 2014) discuss the different underlying metrics of the similarity to use in DTW framework. ( 1992), Zeppenfeld and Waibel ( 1992), Kosonocky and Mammone ( 1995), Kurniawati et al. Systems using such approaches are described in Morgan et al. DTW is one way to calculate the measure of similarity of two time series, possibly of different length. The task to eliminate all the noise and disimilarity in environments by appropriate choice of features and similarity measure has proven to be difficult. The quality of its operation depends on how well the similarity measure is chosen and what features are used. However, in practice this approach is not very robust. In addition, in this approach, it is natural to use personalization: indeed, one can argue that recorded patterns reflect the specific features of the user pronunciation, which allow to distinguish it from other users if appropriate similarity metric is used.

The advantages of this approach include the simplicity of both learning (memorization) and operation. In such systems, the user first records one or several keywords pronunciations, and then the necessary sound fragments are compared with the recordings and the triggering is announced if the selected similarity measure exceeds some prespecified threshold. We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 (Rohlicek et al., 1989), the use of neural networks since 1990 (Morgan et al., 1990, 1991 Naylor et al., 1992), the use of pattern matching approaches, in particular, dynamic time wrapping (Zeppenfeld and Waibel, 1992) optimization of a loss functions specific to a voice activation (as opposed to the common metrics such as accuracy and similar this enables the system to become more attractive in terms of user experience) (Chang and Lippmann, 1994 Szöke et al., 2010), attempts to get rid of a garbage model (Junkawitsch et al., 1997), building systems of voice activation for non-English languages such as Chinese (Zheng et al., 1999 Hao and Li, 2002), Japanese (Ida and Yamasaki, 1998), Persian (Shokri et al., 2011), construction of discriminative systems (Keshet et al., 2009 Tabibian et al., 2011, 2013), publications describing voice activation systems in mass products (Chen et al., 2014a Gruenstein et al., 2017 Guo et al., 2018 Wu et al., 2018), as well as publishing open datasets to compare different approaches (Warden, 2018).įirst of all it worth to mention approaches of comparison with a template, for example using DTW. The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition. In addition, we point to a number of open questions in this problem. We describe the principle of various voice activation systems’ operation, the characteristic representation of sound in such systems, consider in detail the acoustic modelling and, finally, describe the approaches used to assess the models’ quality. This work is a systematic literature review on voice activation systems that satisfy the above properties. The voice activation system must have the following properties: high accuracy, ability to work entirely on the device (without using remote servers), consumption of a small amount of resources (primarily CPU and RAM), noise resistance and variability of speech, as well as a small delay between the pronunciation of the key phrase and the system activation. Therefore, most of these devices use a voice activation system, whose task is to find the specified in advance word or phrase in the audio stream (for example, Ok, Google) and to activate the voice request processing system when it is found. Automatic recognition of the entire audio stream, however, is undesirable for the reasons of the resource consumption and privacy. A large number of modern mobile devices, embedded devices and smart home devices are equipped with a voice control.