Authors: Dubnov, Shlomo dubnov@cs.huji.ac.il Fine, Shai fshai@cs.huji.ac.il Address: Dept. of Computer Science The Hebrew University 91904 Jerusalem ISRAEL Phone Office: +972 (2) 6585775/6584933 Fax: +972 (2) 6585439 Title : Stochastic Modeling and Recognition of Solo Instruments using Harmonic and Multi Band Noise Features. Keywords: Markov chains, Universal Compression, Sinusoidal Modeling Multi-Band Voicing Features, Instruments Recognition Abstract: In this work we extend previous results on modeling spectro-temporal behavior of musical sounds, with applications for solo musical instrument recognition. Sound dynamics are important for instrument characterization, especially when considering sounds in real performance conditions, e.g. recording of actual musical pieces. Matching dynamic sounds is a difficult problem, since normal sequence matching methods (e.g. DTW) require approximately similar temporal sequences. This situation does not exist when considering excerpts from different music pieces, even when played by the same instrument. Statistical modeling of the temporal behavior is done by effectively "quantizing" the sound into a reduced representation of sequences of features and considering the sound dynamics as a stochastic process over the feature domain. Having sequences of quantized features, temporal statistics may be obtained by looking at the number of appearances of different substrings in the original sequence. Applying information-theoretic methods, matching of random time-series can be done by estimating cross entropy between the corresponding stochastic models. We conduct our research along the following milestones: 1. Better Representation using High Quality Harmonic and Noise Coding: --------------------------------------------------------------------------------------------- In earlier work we have considered instantaneous sound features, such as cepstral envelopes and cepstral derivatives. However, cepstral features take into account the spectral envelope dynamics, while neglecting the important sound characteristics related to the excitation or the noise part. In this work we consider sinusoidal modeling with multi band voicing features. Complex distance measures that take into account phase information were developed and applied for quantization. These new features improve upon classification and resynthesis performance. 2. (Observable) Markov modeling and Universal Sequence Matching --------------------------------------------------------------------------------------------- In recent years Hidden Markov Models (HMM) form the basis for the most successful speech recognition systems. However, there are several deficiencies with the learning (parameters estimations) of HMMs. To name a few: (*) The structure of the model (at least the number of states) should be a-priori known. (*) A large number of parameters that must be reliably estimated enforce the need for a large training set, which in-turn increases the training time. (*) Learning HMM is known to be hard and indeed the learning algorithm (EM) converges only to local extremum. These deficiencies are often considered to be technicalities and as such they are overcome by heuristic means or just ignored. We suggest to work around these problems by modeling the recognition task using (observable) Markov chains. We argue that although the expression capability of Markov chains is inferior to HMMs, in some cases its sufficient. Moreover, the learning task becomes efficiently solvable using smaller training sets. By applying universal compression techniques to sequences of harmonic features (described above), we demonstrate that Markov chains are sufficient for the reliable modeling of solo instruments. 3. Applications for Resynthesis ----------------------------------------- Having at hand a high quality analysis/resynthesis representation, the stochastic modeling can now be used for resynthesis purposes. In a natural sound we observe a great amount of spectro-temporal variations, which are difficult to capture and model. Our method allows for resynthesis using stochastic models of the sound temporal evolution. Manipulations such as morphing an hybridization between different instruments can be done by mixing their stochastic models. The mixing is in the sense of blending the temporal behavior as well as the spectral characteristics. 4. Conclusion: ----------------------------------------- We consider advanced coding methods both for feature extraction (lossy coding) and feature dynamics modeling (lossless coding). Testing our method on recordings of several solo musical pieces of different instruments, excellent matching results were obtained. The synthesis results, although still suffering from quantization related distortion , reveal interesting sound effects and give a "new meaning" to the problem of sound dynamics.