Web-Crawling the State of Washington, Office of Superintendent of Public Instruction web site.


I’ve been meaning to spend some time clicking around through the OSPI web site, to get some idea of the scope and depth of the information therein.  My assumption is that there must be a lot of good and useful stuff there.  However, random clicking on links seemed a little hit-or-miss.  That’s not a good way to reach my goal, namely a comprehensive view of what is on that site.

Then it occurred to me that a web-crawler program might help.  The basic idea is that you point this web-crawler program to the root of the web site, and it will traverse down all the branches, i.e. pages, of that web site, hitting every leaf in the tree and creating a list.

Here’s the file (PDF) of all the web pages on the OSPI web site, as of 10/24/2010.

And here’s a quick sample, I was looking for “Race to the Top” and then quickly figured out that RTTT was being used as the abbreviation.  So I filtered the Excel file by those terms and came up with these URLs.  Note this is just a search by the name of the URL not a “usual” search by the content of the web page.


Note:  you could also do this just by using the Search function on the OSPI web site, which will search content, but sometimes it helps to know that you will find something *if* you search, i.e. that you know how the web site is organized or named.

The tool that I used to create the map is http://www.winwebcrawler.com (V2.0 is a 15-day free trial).

One serendipitous thing I found during this little exercise is a web search engine that combines search results from Yahoo-Google-Bing.  Give it a try sometime!

Trackbacks are closed, but you can post a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: