Authors:
Dubnov, Shlomo dubnov@cs.huji.ac.il
Fine, Shai fshai@cs.huji.ac.il
Address:
Dept. of Computer Science
The Hebrew University
91904 Jerusalem
ISRAEL
Phone Office:    +972 (2) 6585775/6584933
Fax:             +972 (2) 6585439


Title :
Stochastic Modeling and Recognition of Solo Instruments using Harmonic
and Multi Band Noise Features.
Keywords:
Markov chains, Universal Compression, Sinusoidal Modeling
Multi-Band Voicing Features, Instruments Recognition

Abstract:
In this work we extend previous results on modeling spectro-temporal
behavior of musical sounds, with applications for solo musical instrument
recognition. Sound dynamics are important for instrument characterization,
especially when considering sounds in real performance conditions, e.g.
recording of actual musical pieces. Matching dynamic sounds is
a difficult problem, since normal sequence matching methods (e.g. DTW)
require approximately similar temporal sequences. This situation does
not exist when considering excerpts from different music pieces, 
even when played by the same instrument.

Statistical modeling of the temporal behavior is done by effectively 
"quantizing" the sound into a reduced representation of sequences of 
features and considering the sound dynamics as a stochastic process over 
the feature domain. Having sequences of quantized features, temporal 
statistics may be obtained by looking at the number of appearances of 
different substrings in the original sequence.
Applying information-theoretic methods, matching of random time-series can be 
done by estimating cross entropy between the corresponding stochastic models.

We conduct our research along the following milestones:

1. Better Representation using High Quality Harmonic and Noise Coding:
---------------------------------------------------------------------------------------------

In earlier work we have considered instantaneous sound features, such as
cepstral envelopes and cepstral derivatives. However, cepstral features 
take into account the spectral envelope dynamics, while neglecting the 
important sound characteristics related to the excitation or the noise part. 
In this work we consider sinusoidal modeling with multi band voicing features.
Complex distance measures that take into account phase information were 
developed and applied for quantization. These new features improve upon
classification and resynthesis performance.

2. (Observable) Markov modeling and Universal Sequence Matching
---------------------------------------------------------------------------------------------

In recent years Hidden Markov Models (HMM) form the basis for the most 
successful speech recognition systems. However, there are several
deficiencies with the learning (parameters estimations) of HMMs. 
To name a few: 
(*) The structure of the model (at least the number of states) 
    should be a-priori known.
(*) A large number of parameters that must be reliably estimated enforce 
    the need for a large training set, which in-turn increases the training time.
(*) Learning HMM is known to be hard and indeed the learning algorithm (EM) converges
    only to local extremum.
These deficiencies are often considered to be technicalities and as such
they are overcome by heuristic means or just ignored.
We suggest to work around these problems by modeling the recognition task using 
(observable) Markov chains. We argue that although the expression capability 
of Markov chains is inferior to HMMs, in some cases its sufficient. Moreover,
the learning task becomes efficiently solvable using smaller training sets.
By applying universal compression techniques to sequences of harmonic features
(described above), we demonstrate that Markov chains are sufficient for the 
reliable modeling of solo instruments.

   
3. Applications for Resynthesis
-----------------------------------------
Having at hand a high quality analysis/resynthesis representation, the
stochastic modeling
can now be used for resynthesis purposes. In a natural sound we observe
a great amount of spectro-temporal variations, which are difficult to
capture and model. Our method allows for  resynthesis using stochastic
models of the sound temporal evolution. Manipulations such as morphing
an hybridization between different instruments can be done by mixing
their stochastic models. The mixing is in the sense of blending the
temporal behavior as well as the spectral characteristics.

4. Conclusion:
-----------------------------------------
We consider advanced coding methods both for feature
extraction (lossy coding) and feature dynamics modeling (lossless coding). 
Testing our method on recordings of several solo musical pieces of different
instruments, excellent matching results were obtained. The synthesis
results, although still suffering from quantization related distortion ,
reveal interesting sound effects and give a "new meaning" to the problem
of sound dynamics.