The Magic of Denoise: subjective methods of audio quality evaluation

Machine Intelligence Laboratory
4 min readDec 11, 2020


All of us get to know dozens of audial information ranging from voice messages in Telegram to the lectures recorded by your classmates. Wouldn’t it be nice to “turn off” all the irrelevant noises and enjoy the essence? Well, that’s possible! The programs based on machine learning algorithms are the key solution. How can we evaluate their performance, though?

By the end of the article you will know about:

  • the advantages of manual annotation for audio quality evaluation over automatic methods;
  • two main subjective metrics applied in the audio denoising task;
  • the features of noise intensity level evaluation;
  • the ways to fight bots and dishonest annotators.

Why subjective methods are sometimes better than “objective” ones?

There are a lot of quantitative metrics for audio evaluation that are claimed to provide objective results. However, in some cases, e.g. when we need to evaluate the intelligibility and naturalness of the speech, these metrics present the values being far from human perception. Moreover, due to the high diversity of quantitative metrics, it is quite challenging to determine the significance of the specific metric. Subjective methods of evaluation based on the manual annotation help to solve the problem.


Manual data annotation is a labour-intensive task. However, there are many crowd-sourced platforms like Yandex.Toloka which help to hire a desirable number of employees. A large number of users, high level of automatization and customization brings quick, effective and high-quality annotation of big data.

Main subjective metrics

Manual annotation is based on applying various subjective metrics of audio quality evaluation. In the denoise task, the most significant subjective metric is thought to be the Mean Opinion Score (MOS), where:

  • Opinion Score — subjective audio quality evaluation score ranging from 1 to 5 points given by an annotator;
  • Mean Opinion Score — average audio quality evaluation combined by N number of annotators.

The bigger number of annotators subjectively evaluates the audio-recording the more objective is the average grade.

Another metric for subjective audio quality evaluation is a comparative metric. It is based on a paired comparison of audio-recordings preprocessed by two different models. The goal of annotator is to decide which audio sounds better. This way the more effective model is selected.

How and why the intensity of noise is evaluated?

Noise intensity level plays an important role in the denoise task. Usually, it is calculated by the following formula:

  • Signal — the level of the desired signal (desired audio information);
  • Noise — the level of background noise;
  • SNRSignal-to-Noise-Ratio.

High meanings of SNR correspond to hardly hearable and detectable noises and reach 10–15. Low meanings of SNR (-10 and lower) points to more intensive noises which make it difficult to detect the desired signal.

It is necessary that a model is capable of cleaning the audio both from loud intensive noises and not so obvious ones while saving the quality of the audio. In other words, a model should not recognize the desired signal as noise. In this case, manual data annotation is a solution either as it helps to reveal the drop in quality of audio with imperceptible or absent noise.

How can we protect the data from bots and dishonest annotators?

Before annotators get to work, the well-balanced dataset should be created. It consists of preprocessed audio-recordings with SNR between -10 and 15 (the step is 5) and “clean” audio-recordings without any noise. Besides, the “obvious” examples are added, they are called control examples or fillers. For getting the MOS, these examples are presented by audio without any noise MOS-rated at 5 points or audio with SNR=-60 and MOS=1, respectively. As for calculating the dominance metric, it is proposed to compare “clean” audio with a noisy one. In order to avoid the accidental guess of correct answers, the control examples are additionally being shuffled.

Fillers or control examples help to reveal bots and unscrupulous annotators, who give the answers out of a hat. Subsequently, their annotation is getting reset and the data are transferred to other annotators.


As a result of using Yandex.Toloka and applying subjective methods of audio quality evaluation we obtain:

  1. MOS for every model in the studied SNR range;
  2. With the help of dominance metric — the best model in its SNR range;
  3. The perfect situation: a dominated model with the highest MOS in the studied SNR range.

In this article you discovered:

  • why subjective metrics may be better than objective ones;
  • the usefulness of Yandex.Toloka in audio quality evaluation;
  • two main subjective metrics of audio quality evaluation;
  • what are the noise intensity and its importance;
  • how can we protect ourselves from bots and cheating annotators;
  • the way to discover the best denoise model.

Text by Alexander Markelov, MIL.Researcher

Ilya Jarikov, MIL.Tech Lead

Vasilisa Dyomina, MIL.copywriter



Machine Intelligence Laboratory

MIL. Team is the united and professional group of researchers, developers and engineers conducting R&D projects in the field of Artificial Intelligence.