Velocity Reviews - Computer Hardware Reviews

Velocity Reviews > Newsgroups > Programming > HTML > Interpreting Web statistics

Reply
Thread Tools

Interpreting Web statistics

 
 
JDS
Guest
Posts: n/a
 
      02-20-2006
Hi, all. I am constantly butting heads with others in my department about
the interpretation of web log statistics regarding viewership of a
website. "Page views" "path through a site" "exit points" and that sort
of thing.

On the web, there are two diametrically opposing views on the value of web
server log stats.

1) Web server log analysis is very useful and can provide detailed,
usable, accurate statistics

AND

2) It can't

Well, which is it?

Typically, companies (e.g. Webtrends) that sell analysis software say
the first. However, there are a number of articles pointing to the
second. Notably, the author of "analog", one of the original web log
analysis tools, says that you can't *really* get too much meaningful
analysis out of your server logs.

Well, the problem that I see is that the articles pointing to the
uselessness of web log analysis tend to be OLD. REALLY REALLY old in
internet years -- ca. 1994 and 1995!

examples:
http://www.analog.cx/docs/webworks.html
http://www.goldmark.org/netrants/webstats/#whykeep
http://www.ario.ch/etc/webstats.html


Now, technology has moved along since the WWW first hit the streets, so to
speak, and my question(s) is(are) simple:

What techniques exist to overcome the problems inherent in Web Server Log
Analysis?

I know there *must* be some techniques! Things like tracking users via
cookies and using "tracker" URLs (a server script that gets URLS and
redirects the browser, thus writing a log of what was clicked and where),
that sort of thing.

If anyone can provide some insight on the following, that's be great:

What techniques exist to improve Web Sever Log analysis?

How good are they?

What can I do to implement them?

How do different log analysis tools compare? (Examples I have considered
using are Analog, AWStats, Webtrends, and Sawmill.)

(All these factors are important to me in gauging a tools quality:
accuracy and usefulness of reports, "prettiness" of reports, ease of use,
flexibility, speed, and cost)

Golly, thanks, web denizens. I look forward to your responses. Have a
nice day.

--
JDS | http://www.velocityreviews.com/forums/(E-Mail Removed)lid
| http://www.newtnotes.com
DJMBS | http://newtnotes.com/doctor-jeff-master-brainsurgeon/

 
Reply With Quote
 
 
 
 
Mark Parnell
Guest
Posts: n/a
 
      02-20-2006
Deciding to do something for the good of humanity, JDS
<(E-Mail Removed)> declared in
comp.infosystems.www.servers.unix,alt.html:

> Well, the problem that I see is that the articles pointing to the
> uselessness of web log analysis tend to be OLD.


http://karlcore.com/articles/article.php?id=26

--
Mark Parnell

Now implementing http://blinkynet.net/comp/uip5.html
 
Reply With Quote
 
 
 
 
Steve Pugh
Guest
Posts: n/a
 
      02-20-2006
JDS wrote:
> Hi, all. I am constantly butting heads with others in my department about
> the interpretation of web log statistics regarding viewership of a
> website. "Page views" "path through a site" "exit points" and that sort
> of thing.


Those things are useful if set up and interpreted with care, but are
not 100% definitive.

> On the web, there are two diametrically opposing views on the value of web
> server log stats.
>
> 1) Web server log analysis is very useful and can provide detailed,
> usable, accurate statistics
>
> AND
>
> 2) It can't
>
> Well, which is it?


Both.

> Typically, companies (e.g. Webtrends) that sell analysis software say
> the first.


Read the small print. WebTrends use cookies and JavaScript instead
of/as well as server logs. They have a number of products and services
which offer differing levels of accuracy. But at the end of the day
they can not be 100% accurate. Think of them as providing information
on general trends rather than precise detail on every user (if a user
has a static IP and/or accepts and keeps cookies and enables JavaScript
then you can study them very accurately).

> However, there are a number of articles pointing to the
> second. Notably, the author of "analog", one of the original web log
> analysis tools, says that you can't *really* get too much meaningful
> analysis out of your server logs.


Yes, Analog reads server logs alone. It doesn't try to do anything with
JavaScript, cookies, etc.

> Well, the problem that I see is that the articles pointing to the
> uselessness of web log analysis tend to be OLD. REALLY REALLY old in
> internet years -- ca. 1994 and 1995!


Server logs haven't changed.

> Now, technology has moved along since the WWW first hit the streets, so to
> speak, and my question(s) is(are) simple:
>
> What techniques exist to overcome the problems inherent in Web Server Log
> Analysis?


Cookies, JavaScript, guesswork.

> I know there *must* be some techniques! Things like tracking users via
> cookies and using "tracker" URLs (a server script that gets URLS and
> redirects the browser, thus writing a log of what was clicked and where),
> that sort of thing.
>
> If anyone can provide some insight on the following, that's be great:
>
> What techniques exist to improve Web Sever Log analysis?
>
> How good are they?
>
> What can I do to implement them?
>
> How do different log analysis tools compare? (Examples I have considered
> using are Analog, AWStats, Webtrends, and Sawmill.)


How much money do you have to spend?

Steve

 
Reply With Quote
 
Alan J. Flavell
Guest
Posts: n/a
 
      02-20-2006
On Mon, 20 Feb 2006, Steve Pugh wrote:

> Read the small print. WebTrends use cookies and JavaScript instead
> of/as well as server logs.


Both of which, discerning users have been selectively blocking for
many years. What was that program we had back in Win95 days, which
blocked such things from any browser? I've actually forgotten its
name, and my old '95 PC has long since gone to the knacker's yard, but
it was definitely there; and nowadays such functions come built-in to
any decent browser.

However, servers which insist on using such techniques are inhibiting
cacheability, and thus ensuring a less responsive web, and thus are
interfering in a negative way with the results which their users
experience (*all* of their users - not only those discerning users who
block these attempts to peek into their activities).

This is, in effect, the Heisenberg law of web statistics - the harder
you try to get accurate answers, the more you interfere with the way
that the web works (recalling that HTTP was quite deliberately
designed to be "stateless"), and the worse you are able to serve the
requests of your users. And so, you end up getting more-accurate
measurements of something that would be working much better if only
you'd stop trying so hard to measure it.

> They have a number of products and services which offer differing
> levels of accuracy. But at the end of the day they can not be 100%
> accurate.


Worse than that: they aren't just "inaccurate", they are seriously
"biased", but you have no way of estimating the bias.

For example, if you improved your cacheability, your users would get
faster responses, and you might get more users sticking around to read
your site, whereas your server statistics would show fewer hits thanks
to all those folks who were getting the pages out of an intermediate
cache. And would show gaps in your statistics because they revisit
pages in their *own* browser cache, whereas previously they were
having to wait to re-fetch the same page from your server on every
revisit.

> Think of them as providing information on general trends


Yeah, such as when a certain large ISP deployed a new bank of cache
servers, and the "trends" apparently showed that users had
mysteriously lost interest in the web site in question. Strangely,
each popular page that was hit on the server was being hit exactly
once every 24 hours, after which nothing was heard again from that ISP
for another 24 hours. Yup, that ISP was callously ignoring everything
that the server told it in terms of this page is uncacheable, expires
in January 1970, etc. etc., and was cacheing each page for 24 hours
without appeal. No, I'm sorry: those "trends" don't really show very
much, unless and until you really know what's happening OUT THERE.
But your server statistics have no way to tell you what's happening
out there. They're selective, and biased, and, often enough, if
interpreted to show what people demand to know - rather than
interpreted in terms of the information they really contain - can
appear to show the opposite of the truth.

Let us consider for example those misguided folks who notice that >70%
of their users appear (according to the logged user agent) to be using
MSIE, so they "optimise" their site specifically for MSIE, and,
surprise surprise, the proportion of MSIE users rises. So would you
say they acted correctly, when most everyone else reports that the
proportion appearing to use MSIE is falling? For one thing, Opera
users are starting to stand up for themselves - many of them are no
longer willing to hide behind a user agent string which pretends to be
MSIE.

Many other changes are happening "out there", which make those numbers
viewed down the wrong end of the telescope at your server log into
highly misleading indicators of anything - except your server load,
and possibly a handy way to identify broken links.

> > However, there are a number of articles pointing to the second.
> > Notably, the author of "analog", one of the original web log
> > analysis tools, says that you can't *really* get too much
> > meaningful analysis out of your server logs.


The author of Analog works in statistics, AIUI, and is determined to
tell the truth about web servers, no matter how much some web server
operators insist that they prefer to be fooled by convincing-looking
numbers about the behaviour of their visitors. Good for him.
 
Reply With Quote
 
Ed Mullen
Guest
Posts: n/a
 
      02-21-2006
JDS wrote:
> Hi, all. I am constantly butting heads with others in my department about
> the interpretation of web log statistics regarding viewership of a
> website. "Page views" "path through a site" "exit points" and that sort
> of thing.
>


The very simplest thing that occurs to me is from a user's standpoint.
As one user, I sometimes hit a given page many times but for a variety
of reasons. Stats won't tell you /why/ I hit that page. It could be
because I got distracted and went somewhere else for some totally
different purpose. It could be because the page didn't load fully
(images, etc.) and I left and came back. Maybe I looked at it on Tuesday
and thought "Crap, I just don't have time now, I'll bookmark it in my
"temps" folder and check it tomorrow (or in a month). Perhaps I landed
there by accident, by clicking on the wrong link in a Google result page
or the wrong link in someone else's page. Or, maybe, I did a Google
search, went to that particular page and thought: Ohmigod! just what I
was looking for!!! Or not. How do any of the page stats tell you that?

I look at some stats for my site and don't take them all that seriously
other than aggregate changes from one month to the next, figuring that,
given all the variables, at least I can see what page is the most
accessed, the second-most accessed, the third-most, from month to month
... but that's about it. Heck, my checking my own site can skew the
stats depending on the total number of hits. At some point it becomes a
bit silly to chase after the chimera.

"There are lies, damned lies, and then there are statistics."

--
Ed Mullen
http://edmullen.net
http://mozilla.edmullen.net
http://abington.edmullen.net
 
Reply With Quote
 
Nick Kew
Guest
Posts: n/a
 
      02-21-2006
Steve Pugh wrote:

> Those things are useful if set up and interpreted with care, but are
> not 100% definitive.


Treat them as you would viewing figures for a TV show.

>>Typically, companies (e.g. Webtrends) that sell analysis software say
>>the first.

>
>
> Read the small print. WebTrends use cookies and JavaScript instead
> of/as well as server logs.


Spammers. 'nuff said (or should be - Alan expanded on some
more technical reasons).

>> However, there are a number of articles pointing to the
>>second. Notably, the author of "analog", one of the original web log
>>analysis tools, says that you can't *really* get too much meaningful
>>analysis out of your server logs.

>
>
> Yes, Analog reads server logs alone. It doesn't try to do anything with
> JavaScript, cookies, etc.


No spam, no snake oil. No surprise.

I rather suspect the author of analog may even understand the subject.
Unlike those outfits where anyone who understands the issues is firmly
ignored and probably laughed at as a nerdy loser behind their backs.

>>What techniques exist to improve Web Sever Log analysis?
>>
>>How good are they?
>>
>>What can I do to implement them?


Hire a statistician. And make it someone who understands the
infrastructure of the Web. There are very few people who
qualify on both counts.

Now you need to add *knowledge* of the web's infrastructure.
That's different from the *principles*, and much harder to
collect. In fact it's impossible to collect at the level
that would be required for the likes of webtrends to work -
you have to apply the kind of techniques that broadcasters
use. I haven't worked for a broadcaster myself, but I
strongly suspect *they* rely on some pretty ropey assumptions,
too[1].

[1] I have worked as a statistician, and I've seen how things
happen when there is *no data* to validate some part of the
underlying model used. It goes like this:
- Someone picks a figure effectively at random on a 'seems
reasonable' basis just to have something to work with.
That enables them to derive numbers from the model.
- They also try the model with different figures, to test
the effect of varying the unknown. This leads to a perfectly
valid set of "if [value1] then [result1]" results.
- BUT that's too complex for a soundbite culture, so only the
first figure gets reported as a headline conclusion.
- Now, a future practitioner has NO DATA to validate this part
of the model, but has the first paper as a reference to cite.
The assumption is peripheral to the study, so the 'headline'
figure is simply used without question.
- Over time it is much-cited because nobody wants to get involved
in something that cannot be verified. The first researcher's
still totally untested working hypothesis becomes common knowledge,
and 'obviously correct' because everyone uses it.

--
Nick Kew
 
Reply With Quote
 
JDS
Guest
Posts: n/a
 
      02-21-2006
On Tue, 21 Feb 2006 11:17:31 +0000, Nick Kew wrote:

> [1] I have worked as a statistician, and I've seen how things happen when
> there is *no data* to validate some part of the underlying model used. It
> goes like this:
> - Someone picks a figure effectively at random on a 'seems
> reasonable' basis just to have something to work with. That enables
> them to derive numbers from the model.
> - They also try the model with different figures, to test
> the effect of varying the unknown. This leads to a perfectly valid
> set of "if [value1] then [result1]" results.
> - BUT that's too complex for a soundbite culture, so only the
> first figure gets reported as a headline conclusion.
> - Now, a future practitioner has NO DATA to validate this part
> of the model, but has the first paper as a reference to cite. The
> assumption is peripheral to the study, so the 'headline' figure is
> simply used without question.
> - Over time it is much-cited because nobody wants to get involved
> in something that cannot be verified. The first researcher's still
> totally untested working hypothesis becomes common knowledge, and
> 'obviously correct' because everyone uses it.



Riiiight. Well, that's what I am afraid of, altough that scenario sounds
all too realistic.

Well, I've come to the conclusion (and my new boss agrees) that we can
(and will) use (read: "shamelessly manipulate") server log stats to help
justify any direction we decide to take with our web presence. Frankly,
it's not like we are going to be making such huge mistakes or life and
death decisions based on server log stats, so a smidge of data
manipluation in our favor isn't too much of a problem. A lot of the
marketing, user interface, and design decisions to be made for the web are
often common-sensical[1] anyways.

But I guess an important point is that server log statistics are only one
part of a complex whole when trying to make decisions about one's web
presence or infrastructure.

Allrighty, all, thanks for the information! This was a helpful Usenet
dialog.

Later...



[1] Not that "common" sense is all that common.
--
JDS | (E-Mail Removed)lid
| http://www.newtnotes.com
DJMBS | http://newtnotes.com/doctor-jeff-master-brainsurgeon/

 
Reply With Quote
 
Richard Sexton
Guest
Posts: n/a
 
      02-21-2006
>Well, I've come to the conclusion (and my new boss agrees) that we can
>(and will) use (read: "shamelessly manipulate") server log stats to help
>justify any direction we decide to take with our web presence.


Then you need the book "How to Lie with Statistics"
by Darrell Huff. ISBN: 0393310728

For example, the average daily temerature in Death Valley
is 72F. (It's 0 at night and 144 by day).

(Yes I know the difference bewteen average, meadian and mean, but it's
a joke; work with me here)

--
Need Mercedes parts ? - http://parts.mbz.org
Richard Sexton | Mercedes stuff: http://mbz.org
1970 280SE, 72 280SE | Home page: http://rs79.vrx.net
633CSi 250SE/C 300SD | http://aquaria.net http://killi.net
 
Reply With Quote
 
Mark Parnell
Guest
Posts: n/a
 
      02-21-2006
Deciding to do something for the good of humanity, JDS
<(E-Mail Removed)> declared in
comp.infosystems.www.servers.unix,alt.html:

> This was a helpful Usenet dialog.


Isn't that an oxymoron?

--
Mark Parnell

Now implementing http://blinkynet.net/comp/uip5.html
 
Reply With Quote
 
Pete Gray
Guest
Posts: n/a
 
      02-24-2006
In article <(E-Mail Removed). ac.uk>,
(E-Mail Removed) says...

> However, servers which insist on using such techniques are inhibiting
> cacheability, and thus ensuring a less responsive web, and thus are
> interfering in a negative way with the results which their users
> experience (*all* of their users - not only those discerning users who
> block these attempts to peek into their activities).
>
> This is, in effect, the Heisenberg law of web statistics - the harder
> you try to get accurate answers, the more you interfere with the way
> that the web works (recalling that HTTP was quite deliberately
> designed to be "stateless"), and the worse you are able to serve the
> requests of your users. And so, you end up getting more-accurate
> measurements of something that would be working much better if only
> you'd stop trying so hard to measure it.
>


[snipped]

Audit Scotland didn't listen when I said as much in the consultation on
the new Statutory Performance Indicator for museums in Scotland:
<http://www.scottishmuseums.org.uk/areas_of_work/spi_intro.asp>

I believe they're also sending us the 'Magic Eye' mind-reader plugin so
we'll know what the purpose of a visit to the web site was. And what can
you say about an indicator that talks about 'hits' as a measure? Idiots.

--
Pete Gray

notes from a small bedroom
<http://www.redbadge.co.uk/notes>
 
Reply With Quote
 
 
 
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reading and interpreting data from web page royg567 Javascript 9 02-20-2011 07:38 PM
Interpreting JSP code yzzzzz Java 4 04-21-2005 08:51 AM
Pre-Interpreting a Request =?Utf-8?B?QWxleCBNYWdoZW4=?= ASP .Net 3 11-14-2004 11:59 PM
? Need help interpreting this suspicious HTML code Alec S. HTML 5 09-11-2004 02:32 AM
Interpreting the error message? Rob Meade ASP .Net 2 01-28-2004 08:52 AM



Advertisments