ScraperWiki.com and You

This is a child page of http://wiki.pyconuk.org/This%20is%20what%20we%20did

*ScraperWiki.com*

An example of a good scrape is gathering building planning applications from the websites of all of the UK's Local Authorities. These are presented in many different forms on the authorities' sites. What could we do with such data in a database? :-)

http://www.planningalerts.com aggregates them all with a web scraper, and then integrates their data, and then provides a view of it after a postcode2latitudelongitude lookup, so you can subscribe to an RSS feed for all applications within a given distance from your postcode.

Web scrapers are NEEDED for all public information. There is a clear public remit for public database-level access to that data. The core idea is focused on public sector data, and there was a "Young Rewired State" event at Google London campus recently about this. Obviously, though, the scraper wiki can be used for any site.

ScraperWiki.com is principly by Julian Todd - http://www.goatchurch.org.uk/ - who wrote much of the initial web scraper for the Parliament website used at http://www.theyworkforyou.com

This is a Django site to make scrapers together; a wiki for python programmers who collaborate on writing scrapers, and a hosting service for scraped databases.

Once someone has written a scraper script and run it successfully against a site, the data is stored on the scraperwiki.com site for any other user to write queries against, in SPARQL (RDF) or whatever.

The wiki-style collaboration is important as scrapers get broken every time the sites which they target are updated and their URL trees change, and often only a small change is needed to keep it working.

Anyone can write a scraper using the site, you can clone scrapers like ning.com, and all data sets are tied to specific version-stamped scrapers. The quality control process is as wikipedia - as freeform as possible.

Channel 4's "4IP Ventures" funded the project; it just started last month and it will launch in February 2010, and its for python hackers like you! :) Funding from 4ip.org.uk will all be gone by Feb, all code, no PR, so if you like web scraping then doing PR for this is very needed!

Examples of sites that could be usefully scraped:

Are you legally allowed to republish data?

British government data, served on sites with a gov.uk TLD, is now pretty much a free-for-all. Public data in the UK is typically published under Crown Copyright. EGIF - [http://www.govtalk.gov.uk/schemasstandards/egif.asp Electronic Government Interoperability Framework] (is this the right eGIF? e-Government Metadata Standard (e-GMS) is part of eGIF: is that what' being referred to?) - is a Dublin Core style metadata schema, and it says that all UK public web pages should be accessible, in XHTML, and semantically marked up so the public can make full use of it. But not all sites meet this, and also, some civil servants don't know the public's rights here.

The "reuse of public sector information [regulations?] 2005" [http://www.opsi.gov.uk/si/si2005/20051515 statutory instrument] say[s] that government-published information is basically reusable. If civil servants complain, well, they must have an essay policy explaining how they will try to charge, what the justification is for, charge the same amount to everyone, and so on, and almost always this isn't the case, and its more pain for them than for you for them to stop you.

Also, the parliament scraper was perhaps breaking the "parliamentary copyright" that restricts the data on the parliament website, but if they had tried to stop the republishing on a site which is better than theirs, they would have to explain why their site wasn't that good, which is obviously never going to happen ;-)

Many websites' publishers are private organisations who stake a proprietary claim on the data they are publishing. For example, the UK train fares are bizarrely organised, so 2 tickets from A2B and B2C may be cheaper than an A2C fare; a scraper would reveal such things. Some sites try to block scrapers, with IP blacklists, speed throttling, and so on, but all that stuff is very rare. Many of these sites have barely produced a website at all ;-) But fortunately most sites are not trying to block scraping actively; for example companies house just has an opaque CMS.

You need not install anything to use this; its a true web app. It is Affero GPL licensed, so you CAN install a copy on a local server, such as your laptop, if you want to. The Channel 4 sponsored site will not be hosting any controversial scrapers, but other sites running the codecase will probably appear.

There are businesses that make money from scraping; http://www.kapowtech.com charges around $3,200/month, and Google sells indexes of websites, but do not provide database-style data access. Also, Django has a plugin for a Google XML Scraping system, so you can invite Google to scrape your data raw in a certain XML form.

Shown was a flow diagram explaining the process that the program walks as it is used: a "reader" object is a page of code, version controlled, and edited by users. This executes on a VM that connects to the web, sucks down the start pages, and starts the "Detector" that applies the scraper to them, parses the output and sticks it into the "Collector" …

Its all python, Django based. The text editor part of the application uses "CodeMirror", like a lightweight BeSpin - http://marijn.haverbeke.nl/codemirror/

Example: Online Auto Insurance

Back to http://wiki.pyconuk.org/This%20is%20what%20we%20did