Some of you may know that I’ve been spending the summer interning at ScraperWiki; an interesting data-science startup in Liverpool, England. I started at ScraperWiki with nothing but a cursory understanding of how to scrape web pages. Perhaps that’s because there are very few places online where you can safely learn about data science and scraping.
To me, this completely bizarre. If you want to learn how to code, you have a wealth of options available to you. There are a plethora of sandboxed environments where you can invent and experiment without any adverse effects. CodeSchool, CodeAcademy and the KhanAcademy are three great examples of this. Surprisingly, there’s nothing of the sort for aspiring data scientists. I decided to change that.
Over the weekend, I’ve been working on a new pet project: Really Scrapable Web App (RSWA). Borrowing concepts from Project Euler and Code Academy, RSWA runs locally on the users computer and contains a number of challenges which aim to gently introduce core skills used by data scientists.
I decided to write RSWA using Flask, which is a Sinatra-like web framework for Python. In addition to that, it makes heavy use of LESS; a rather pleasant styling language that compiles down to CSS. RSWA is licensed using the permissive, open-source MIT license and can be grabbed from Github.
The front-end work was done by Nadil Bourkadi, a rather talented web designer, WordPress developer, and overall nice guy based in Essex and is working for Drift Innovation in the capacity of an intern. I’m incredibly obliged to him, as he spent half of his weekend working on this project with me. On a nice weekend, too. You should check out his blog.
Challenges and Progress
Before I continue, it’s worth stressing that RSWA is not a finished product. There’s an awful lot of room for improvement. There are a great many bugs to be fixed. There are challenges to be added. In short, it’s in alpha and is therefore liable to change a lot over the next few months.
Challenges start off really easily. The first warm-up exercise requires the user to grab the contents of a ‘h1′ element and then print it onto the screen. This is a task that can be accomplished in just a few lines of Python.
After that, things start to get harder. Much harder. The user is introduced to scraping tables tables and data sources that need to be cleaned before being stored. There are problems that depend on the use of Selenium and Regex in order to be solved. There are components that simulate how public APIs work that shoot out JSON and XML. This introduces the user to the requests library.
The aim is to offer a number of challenges that get incrementally more difficult, and that introduce new concepts and technologies with each level.
The next step in the development process is to add some problems that are kept behind authentication. I started work on that this weekend, but I came across some pretty awful cross-platform bugs in Flask that made this impossible.
Are you interested in this project? I would be honored if you would be willing to have a look at my work. The code is stored on Github. To run it, just create a new Virtualenv, install the dependencies in ‘requirements.txt’ and execute ‘run.py’.
If you have any questions about the product, I’d love to hear them! Getting in touch is easy. Just leave a comment below or send me an email.
Are you interested in the development of RSWA? Want to keep in touch? You can sign up to get each blog post delivered to your inbox. Just pop your email address in the box on the top right of the page.