The statistics in this chart come from a sample of 9,612,469 messages found via NNTP in the latter half of 2011 which had either an identifiable user-agent or an identifiable character set. The exact biases in the input are not known.
Labels from charsets are normalized only by converting to lowercase, and the columns are ordered in decreasing order of charset occurrence. The blank column refers to messages without an identified charset, which notably includes all messages that were not of text/* type.
Labels on the User-Agent were normalized by a script that attempted to match either User-Agent or X-Newsreader headers to determine the user agent. The category "Unattributed" had neither property. "Outlook Express" includes results from Windows Mail and Windows Live Mail (the Windows Vista and 7 replacements, respectively). "Thunderbird" includes essentially any User-Agent that contains "Mozilla", and thus includes Thunderbird, SeaMonkey, Netscape, Icedove, and Iceape in its results as well. Attempting to be more specific here is futile, due to UA spoofing and known extensions that fix the UA string to incorrect values.
Other than the three groups explicitly mentioned, User-Agents are normalized according to standard formatting techniques for these header (many of which fail to follow the specification). The initial order of User-Agents is in decreasing order of prevalence (G2 refers to Google Groups).
Squares are colored by the percentage of messages from a given User-Agent use a different charset, in effect representing distributions of charsets per User-Agent. The darker the color, the higher the percent, although the darkness is exaggerated due to the presence of very long tails for the most popular user-agents.