Saturday, August 13, 2011

Assignment 7: Link Checking and Log Analysis

If you are happy with the assignments you have currently done, then you do not need to complete this assignment. I have made a "site" (a haystack, http://www.courses.fas.harvard.edu/~cscie12/haystack/) that has approximately 200 pages with a total of approximatley 2000 links. Five of these links do not work -- you need to find these 5 needles in the haystack.

Your task is to find the links that result in a non-200 HTTP status response (e.g. that result in 404 Not Found, 301 Moved Permanently, 302 Moved Temporarily). I recommend that you use checkbot to do this. ice% ~cscie12/bin/chekbot --helpCheckbot 1.66 command line options: --debug Debugging mode: No pauses, stop after 25 links. --verbose Verbose mode: display many messages about progress. --url url Start URL --match match Check pages only if URL matches `match' If no match is given, the start URL is used as a match --exclude exclude Exclude pages if the URL matches 'exclude' --ignore ignore Do not list error messages for pages that the URL matches 'ignore' --file file Write results to file, default is checkbot.html --mailto address Mail brief synopsis to address when done. --note note Include Note (e.g. URL to report) along with Mail message. --proxy URL URL of proxy server for external http and ftp requests. --internal-only Only check internal links, skip checking external links. --sleep seconds Sleep for secs seconds between requests (default 2) --timeout seconds Timeout for http requests in seconds (default 120) --interval seconds Maximum time interval between updates (default 10800) --dontwarn codes Do not write warnings for these HTTP response codes --enable-virtual Use only virtual names, not IP numbers for serversOptions --match, --exclude, and --ignore can take a perl regular expressionas their argumentUse 'perldoc checkbot' for more verbose documentation.Checkbot WWW page : http://degraaff.org/checkbot/Mail bugs and problems: checkbot@degraaff.orgCheckbot will produce output in HTML (filename of "checkbot.html") in the directory in which you start checkbot. So, as an example, you could do the following: make a directory for your checkbot results cd to the directory change permissions for the directory run checkbot --verbose will let you see what checkbot is doing --sleep 0 will cause checkbot to not pause between requests (it will finish faster) change permissions on the HTML files that checkbot produceed view the results from a web browser. ice% mkdir ~/public_html/checkbotice% cd ~/public_html/checkbotice% chmod a+rx ./ice% ~cscie12/bin/checkbot \? --verbose --sleep 0 \? --url http://www.courses.fas.harvard.edu/~cscie12/haystack/...output not shown...ice% lscheckbot-www.courses.fas.harvard.edu.htmlcheckbot.html ice% chmod a+r *.htmland now, view the "checkbot.html" page with a web browser. The report details will be in the "checkbot-www.courses.fas.harvard.edu.html" page.

For each link that does not give a "200 OK" HTTP response, you will need to report: the URL that gave a non-200 response the status code the URL of the page that contained the linkThe haystack is located at: http://www.courses.fas.harvard.edu/~cscie12/haystack/ You will analyze the log file of this course (/home/c/s/cscie12/logs/cscie12.log.gz) from September through November and provide an "executive" summary (i.e. be short and to the point) of your analysis. Where appropriate, you should link to any reports generated from Analog. Draw some conclusions about the use of the site, don't simply cite numbers.

Note that this log does not contain log entries for the Discussion Group or the Lecture Videos. Also, the hostnames of the machines have been changed to protect privacy -- for example, heitmeyer.mediaone.net might be changed to something like gibnax.mediaone.net

How much was the site used?What areas of the sites were most heavily used? Does this correspond to how you used the site?What weeks, days, times of the day and days of the week showed the most activity? Why?What sites and pages outside of Harvard referred people to the sites?What browsers and versions (and operating systems) were widely used? Based on these statistics, would you recommend reliance on CSS?Log files /home/c/s/cscie12/logs/cscie12.log.gz ("combined log format") I have already generated an Analog Report from the above log file. (note the only report you will need to generate is for the "browser summary"). Hints and reminders You should run analog on the "ice" machines only.I have configured analog (~cscie12/bin/analog) to know where the log file for the course is located -- you do not need to specify a location. The logs are in gzipped format -- analog knows how to handle decompressing these files. If you are curious (good for you!) and want to look at the contents of the file, you can do that with the "zcat" command. Be careful, these files are roughly a quarter million lines long -- you'll want to pipe them through "more" if you just want to look at a few pages. ice% zcat ~cscie12/logs/cscie12.log.gz | moreSimply hit "CTRL-C" when you've had enough. Analog is in ~cscie12/bin/analog You do not need to copy Analog to your directory.You do not need to copy the log file to your directory.The "-A" command turns off all analog reports The "+A" command turns on all analog reports Analog Reports and command line options For example, to turn off all reports ("-A") and produce a text output ("+a") of the "browser summary" report ("+b" see http://www.analog.cx/docs/output.html), the following command would work: ice% ~cscie12/bin/analog -A +a +b | more...output not shown...To turn off all reports and produce an HTML output ("-a") of the "browser summary" report and to direct the HTML output to a file called "browser_summary.html": ice% ~cscie12/bin/analog -A -a +b > browser_summary.html ice% chmod a+r browser_summary.html

View the original article here

No comments:

Post a Comment

Popular Posts