I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,
Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.
Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.
Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.
(2) We won't know the variance of means before the test. Instead, imagine how much accuracy we need. 3.0=Slightly Annoying 4.0=Perceptible but not annoying 5.0=Imperceptible, so I feel it's accurate enough when we determine the average score by only 0.1 of error margin. (Can we imagine the difference between 3.3 and 3.4?)
(4) Rather, we want the SEM to be small enough to fill the requirement.
The rough estimation is done this way. First, score is between 1.0 and 5.0. So the Standard Deviation(SD) can't be more than 2.0. SD being 2.0 is highly unlikely because the score would be either 1.0 or 5.0, both 50% of the time and in that case, 1.0=Very Annoying so that the developers would get tons of bug reports. Let's say SD = 1.0. Standard Error of the Mean (SEM) = SD/sqrt(sample size). If we get independent 100 results, SEM=1.0/sqrt(100) = 0.1, which is small enough.