Note: I KNOW this is all rather pointless, because you can always get a complete database dump from Wikimedia to play with. But for my needs, this would have been just overkill.
This started out with me being disappointed with all the uninteresting pages cropping up, whenever I used the feature
Random article in the English (and German) Wikipedia. Mostly it returned pages for i.e.
some insignificant politician,
some village in Pakistan,
some short channel closed 100 years ago,
some dragonfly subspecies or
some obscure music album (probably a classic). But I was not interested in
stubs. I wanted to procrastinate properly. I remembered that periodicaly, Wikipedia articles that are found to be both interesting and well-written are given the predicate "featured". So I headed to the
page listing them all... Woha! Hours... no
days of
mindless surfing general education. But still, a random article function for featured articles would be great. But there seems to be no such thing. I decided that I would write a short
Python script to parse the page and spit out one article at random.
The first obstacle that had to be overcome was that requests made by Python's
urllib are not served by Wikipedia. It makes sense, because they
don't want to serve crawlers in general. But I did not intend to crawl. At least for now. I therefore circumvented their blocking by subclassing urllib and setting my own http user agent string to something resembling that of a browser. That opened the door to Wikipedia access via Python.
Next, I needed a way to parse the page. I used the excellent
BeautifulSoup html parser for the first time to do this. Fun. Nuff said. The regular expression to exclude internal Wikipedia pages (those URLs containing a colon) was also easy enough.
The result is this:
Edit: Whoops. Apparently I am too stupid to embed proper python code in here. So download the script instead
Wow, that was easy.
Then I went ahead and expanded this script into a fully blown Wikipedia crawler.
Note: Currently there seems to be a bug in Python's sgmllib... I used the fix locally before getting the following to work At first my crawler just followed random links and printed the page titles (disregarding any robots.txt in case you ask.
EDIT: It is not intended to be malicious in any way and only used on a very small scale.). On one occasion the trajectory seemed rather uncanningly meaningful, as my crawler seemed to research Nazi war tactics and weapons.
But what if the crawler just traverses those pages which have backlinks to the page where it came from? Wouldn't it generate more meaningful results? Couldn't we map Wikipedia in this way? Automatically generate assiociative maps like
visuwords.com? Like I stated above, the use of crawlers is probably inefficient. But playing around with the database could lead to most interesting semantic maps. Could we perhaps create a partially artificial mind by crowdsourcing the immense knowledge and interconnections exhibited by Wikipedia articles?
Anyway my new crawler makes more meaningful trajectories than the old one.
Here is the code: I am sorry, here is the code, I am still too stupid to embed it properly