Why Webbrowser statistics lie and just don't say anything

Recent webserver statistics say that Microsoft's webbrowser has a portion of more than 90% of all browser used in the word wide web. They want to make one believe that even Netscape is no longer relevant at all. Well, most people I know do not use MSIE but Mozilla, Netscape or other alternative browsers, even the majority of people that use Windows I know don't use MSIE.

So what might be the reason for this discrepancy of available statistics and personal experience? One reason of course is that the people I know are mostly technically more experienced than the average people surfing through the net. It's a fact that technically more experienced people don't like to use Internet Explorer due to its many known security holes and due to its ignorance regarding web standards like Stylesheets and so on.

Webservers' User-Agent statistics also vary very much on the site which is chosen. Statistics from www.heise.de for example show much higher portions of Netscape user-agents than many other sites; one of the reasons is that Heise is targeted to more advanced users with technical skills, which more often use secure Browsers and dislike MSIE for the mentioned reasons.

Can that be the only reason for the overwhelming numbers of IEs? There are a number of pure technical reasons which also have to be taken into account: Some sites do in fact require an MSIE User-Agent header to allow access to the site, though other browsers would display the site just fine, too. As a result of that many browsers have a feature to fake the User-Agent they send to the server and simply say that they are an MSIE, though it might be an Opera or Konqueror browser - just to get access to sites whose webmasters are just focused on Microsoft. This has a strong influence on the statistics, there are at the moment round about 10% of the Broswers out there which claim to be an MSIE but aren't really. Tools like Webalizer, which produce the statistics for usage of browsers in their default configuration don't handle such faked log entries correctly, though it's possible to detect this in many cases. An Opera browser which claims to be MSIE for exampe has such a User-Agent header:

Mozilla/4.0 (compatible; MSIE 5.0; Linux 2.5.68 i686) Opera 6.02 [en] .

As you can see Opera at the end of the string says what it really is but Webalizer doesn't care about that. To teach Webalizer to subtract this log entry from the other MSIE log entries, one has to use a webalizer configuration like this:

GroupAgent    Opera    Opera (grouped)
HideAgent Opera
GroupAgent    MSIE     MSIE (grouped)
HideAgent MSIE
# and turn off MangleAgents !

This way first all log entries which match "Opera" are being subtracted and counted all together as Opera and after that the rest of the entries which match "MSIE" are being counted. You will see that MSIE will have several percent less after doing this change to the webalizer config. This is just an example how to make webalizer's statistics a bit more correct. Other browsers like Konqueror for example are not as nice as Opera and do not differ from the original MSIE User-Agent string and thus it's almost impossible to subtract faked entries of them. There are also many crawlers (address collectors etc.) that pretend to be MSIE, most probably to be as inconspicuous as possible. However they are not inconspicuous enough: Such nasty crawlers do not care about the robots.txt file at all but you can see typical robot behaviour like hitting every (also hidden) link on the site, mostly in a short time. Many hits by this spammers for the statistically leading browser.

Another technical reason is the way IE behaves. Since a long time IE requests a file called "favicon.ico" used to display a shortcut icon in the URL line and in the bookmarks. The file is being requested in any case, whether it exists or whether it doesn't exist. This gives just another hit in the logs for MSIE, while most other browsers do not request this file and cannot do this hit. For a page request of one html page without any images the statistic is being manipulated by 100% by doing this. Mozilla or Opera (since version 7) also support shortcut icons but they require the html code to contain a special tag pointing to an icon:

<link href="/favicon.ico" rel="shortcut icon">

Netscape and Opera since many years have a feature to disable the loading of pictures. That makes browsing often much more convenient because the big parts of the sites don't have to be loaded and just the plain html code is being loaded. MSIE misses such a feature. (I was told that there is a possibility to disable images and even a so called Toggle Images power toy, which makes this a one-click-operation but this possibilities are used very rarely.) Again, all this pictures not being loaded by Opera etc. are hits which are not being done and do not count in the statistics, while IE users can't turn off pictures and always have to download them (and make log entries pushing that browser in the stats). A site containing just 10 pictures, being loaded with Opera and pictures turned off produces one hit in the logs. The same site being loaded with IE always produces 11 hits in the logs. The statistic of a Webalizer will show that MSIE has a portion of more than 90%, though it was just one of two. That means the browser statistic is being distorted by 1000% ! Text-only browsers like w3m or lynx usually never download any images - to get the count of these browsers right you would have to multiply their hits with the average number of images on your sites. To get more accurate results which are less manipulated by this factor make your webalizer ignore any image files and watch how the statistics change after that!

IgnoreURL       *.ico
IgnoreURL       *.ICO
IgnoreURL       *.gif
IgnoreURL       *.GIF
IgnoreURL       *.jpg
IgnoreURL       *.JPG
IgnoreURL       *.png
IgnoreURL       *.PNG
IgnoreURL       *.css

The same category of stat manipulation is a "feature" of newer IEs. They allow to start multiple instances of a download of one and the same file (mostly zipped archives) at a time using http status code 206 (partial content). Doing this IE starts to download a file for example in 5 chunks at a time, which will also produce 5 access logs, which manipulates the statistic by 400% in this case. Starting multiple downloads for one file is by the way a good way to stress web servers which are already at their limit even more. Since I saw that I just call MSIE the "Egoistbrowser", because it does this to achieve more download bandwidth without respect for other users downloading stuff from that server. A great way to make internet even slower than it already is for the rest of the world, thanks Microsoft! Some other download managers also misuse code 206 to retrieve files faster, many of them pretend to be IE. Webmasters of high traffic sites like mirror servers should consider to ban such agents, flooding the webservers with such unnecessary requests.

Last but not least one more important factor is this: As mentioned above technically more experienced users often don't use MSIE but other browsers instead. Taking a look at the visit path length one can see, that IE users are surfing much more around to see what's hosted on the server without really knowing what they look for. Visit paths of non-IE users are much shorter. What does this say? Well, the qualified users go straight to the place they wanted to go and leave the site after that while the less qualified users often are surfing around on that site aimlessly. No, that is not a joke, this can be proven from httpd log entries. This will of course produce again many hits from MSIE User-Agents, pushing the stats for IE enormously. Things which are very hit-producing like web discussion forums (where more or less non-sense talk is being done) are also much more frequented by IE users than others.

As I already mentioned, most of the clueless people are only able to use the preinstalled browser, what do you think, how many of them are able to configure a proxy server or know what that should be good for? Less clueless people who are able to download other browsers and use this usually also know what proxies are for and use them. What is characteristic for a proxy? ... right! Many clients get the page but the page is just loaded once from the webserver. This is by the way not pure speculation of mine. I analyzed some web server log files and the ratio of Netscape clients which came via a host with "proxy" or "cache" in its name and Netscape clients which came directly was always higher than this proxy/directconnection ratio of MSIE clients. Do you see that there are lots of reasons why MSIE has so many hits in webserver log files?

Some people might think this is an anti-IE rant, you may think so, okay. I am arguing a lot with IE, because I want to explain the reasons for the apparently statistical superiority of this Browser. I try to give some hints how to make webserver statistics more meaningful by keeping some things in mind like to only involve certain parts of the logs into the statistic. People who do that will see that the numbers and graphs (you cannot really call that statistics) that tools like Webalizer give back look much different.

What do we conclude from all this? The number of hits in a log file doesn't say anything, it says nothing about how many people are using a certain browser. Ready made statistics published by so called "analysts" say even less - they lie. To get statistics which are just a little bit near reality it's not enough to have a program which analyzes a log file, it needs some mathematical background and a good understanding of what is going on there at all. I never saw a webserver statistic that was not totally dumb. The fact that MSIE is the only preinstalled Browser on MS Windows makes it a default browser for less qualified users; if you want to reach qualified users you should make sure to make your sites standard conform so that also non-IEs can see your sites. Do not use proprietary extensions of html and try to avoid JavaSript and Flash. Blind people and other handicapped people also use the WWW but have to use text-only browsers like Lynx. Design your sites in a way that also such browsers can be used to navigate through it. Use CSS, avoid frames, use ALT tags for images and take a look at the Web Content Accessibility Guidelines.

b j o e r n [at] j 3 e . d e