Folk song segmentation and transcription

Western folk music has in most cases simple structure, which usually consists of single repeated melodic part - stanza. As part of our research we developed an automatic segmentation approach. We use this approach also for improving the transcription of a representative stanza from the recording.


The approach is based on a distance measure that uses dynamic time warping to cope with tempo variations and a dynamic programming approach to handle pitch dri ing for finding similarities and estimating the length of repeating segment. A probabilistic framework based on HMM is used to nd segment boundaries, searching for optimal match between the expected segment length, between-segment similarities, and likely locations of segment beginnings. Evaluation of several current state-of-the-art approaches for segmentation of commercial music is presented and their weaknesses when dealing with folk music are exposed, such as intolerance to pitch drift and variable tempo. The proposed method is evaluated and its performance analyzed on a collection of 206 folk songs of different ensemble types: solo, two- and three-voiced, choir, instrumental, and instrumental with singing. It outperforms current commercial music segmentation methods for noninstrumental music and is on a par with the best for instrumental recordings. The method is also comparable to a more specialized method for segmentation of solo singing folk music recordings.


We also developed the transcription method for folk music that exploits its specifics to improve transcription accuracy. In contrast to most commercial music, folk music recordings may contain various inaccuracies as they are usually performed by amateur musicians and recorded in the field. If we use standard approaches for transcription, these inaccuracies are reflected in erroneous pitch estimates. On the other hand, the structure of western folk music is usually simple as songs are often composed of repeated melodic parts. In our approach we make use of these repetitions to increase transcription robustness and improve its accuracy. The proposed method fuses three sources of information: (1) frame-based multiple F0 estimates, (2) song structure, and (3) pitch drift estimates. It first selects a representative segment of the analyzed song and aligns all the other segments to it considering temporal as well as frequency deviations. Information from all segments is summarized and used in a two-layer probabilistic model based on explicit duration HMMs, to segment frame-based information into notes. The method is evaluated with state-of-the-art transcription methods where we show that significant improvement in accuracy can be achieved.


The collections for both researches are publically available:

For more details, see:

  • [PDF] C. Bohak and M. Marolt, "Transcription of polyphonic vocal music with a repetitive melodic structure," AES, vol. 64, iss. 9, pp. 664-672, 2016.
    author={Ciril Bohak and Matija Marolt},
    title={Transcription of polyphonic vocal music with a repetitive melodic structure},
  • [PDF] C. Bohak and M. Marolt, "Probabilistic segmentation of folk music recordings," Mathematical problems in engineering, iss. 2016, pp. 1-11, 2016.
    author={Ciril Bohak and Matija Marolt},
    title={Probabilistic segmentation of folk music recordings},
    journal={Mathematical problems in engineering},