The value of content filters
My main frustration with content-based sites like YouTube is that it’s hard to filter the content that interests me from all the rest. This led me to start thinking about two things: How valuable filtering is to users, and how to make better filters. I’ll try to explain some ideas I had about the first question.
I’ve been playing with the following simple model of preferences and filtering. Suppose that a user’s preferences for content on a site are represented by a number between 0 and 1. The number, call it x, represents the user’s ideal content, if only she could locate it on the site. However, the site has a range of different content. Without any filter, a user watches or listens to a random piece of content represented also by a number between 0 and 1, call it y. Receiving content that doesn’t exactly match the user’s preference makes her unhappy. She receives disutility (unhappiness) measured by the absolute distance between x and y.
Now we can calculate, for a user with a given x, their expected disutility if they receive a random piece of content, in the absence of any filter. For example, consider the user with preference x = 0.5. She has a 50% chance of receiving y < 0.5 and getting disutility 0.5 - y, and a 50% chance of receiving y > 0.5 and getting disutility y - 0.5. Evaluating over all possible values of y between 0 and 1, her expected disutility is 1/8.
For a user with an arbitrary x between 0 and 1, the disutility from a random piece of content turns out to be given by 0.5(x3 + (1 - x)3). Here’s a graph of this for users with different preferences:

From this we can observe one thing about the value of filters:
Those with the most to gain from filtering content are those with the most extreme preferences.
In other words, users with x close to 0 or 1 suffer a lot from receiving a randomly chosen piece of content. This is because the content they receive can be quite far from their actual preference. Those in the middle, on the other hand, are not likely to suffer as much. Because their preferences are “middle of the road”, a random piece of content is less likely to be very far from their ideal.
Now suppose that we use a filter to split the content up into equal-sized chunks. Then suppose the users can choose the chunk that most closely matches their preference, but still receive a random piece of content within that chunk. Obviously if we could split the content into infinitely many chunks, each user would be able to receive her ideal content, and the gross value of this filtering to a user would equal the red line in the graph above.
However, it is likely to be costly (in terms of time and effort) for users to compare segments that are separated out by the filter. For example if a music site separated its content into 1,000 different genres, the burden on users to compare these choices would be high. Therefore:
Too fine filtering can also be bad, if it imposes too much evaluation cost on users.
We can combine the previous two conclusions to reach another one. The value of filtering differs across users and is highest with those for more extreme preferences. To the extent that it’s possible, we should customise our filtering to suit these preferences:
If possible, present finer segmentation to users with extreme preferences, and coarser segmentation to others.
Users with extreme preferences have more to gain from better segmentation, and so are willing to spend more effort evaluating a finer segmentation to find something that’s closer to their preferences.
Now let’s look at exactly what happens in the model I described above when we segment the content. Suppose we split the content into two groups, those located between 0 and 0.5, and those between 0.5 and 1. Then a user with x between 0 and 0.5 will choose content from the first group, and a user with x between 0.5 and 1 will choose content from the second group. Considering all random pieces of content from within each group, we can calculate the expected disutility of users with this filter applied.
Here’s a graph showing the expected disutility with the content divided into two groups (blue line) versus the disutility with no filtering (red line). The green line shows the difference between the two, ie the value to users of the filter:

From this we see a slightly surprising thing: Those with “middle of the road” preferences are actually made worse off by this filter. The reason is that with this filter, those in the middle with no filter become the extremes with the content filtered into two groups, and thus when they receive a random piece of content from their chosen group, it’s more likely to be further away from their preference than a random piece of content when no filter was applied.
However, this effect goes away if we apply finer segmentation. Here’s the results with the content filtered into three groups:

And here’s 10 groups:

So we can say:
Coarse filtering can actually make users with non-extreme preferences worse off.
Thus, taking all the analysis together, it’s probably better to lean towards finer filtering, but not so fine that users’ costs of comparing the segments are too high. And customise the segmentation to people’s preferences, if that’s possible.