Amnesty International is without a doubt my favorite charity. Since the 60s, they’ve perfected the art of making life difficult for anyone who would seek to infringe upon basic human rights. Sadly, winding up dictators and despots costs a fair bit of money, and they’re dependent on donations of time and money by their supporters.
Next month, six members of Liverpool Hope University’s Amnesty International society (of which I am a member) will take to the streets of Merseyside. It is here they’ll run a five kilometer marathon, dressed as everyone’s favorite Turkish gift-giver with the aim of raising a good chunk of cash for Amnesty International so that they can continue doing their essential work.
Whilst I won’t be joining them on the run*, I’d be really honored if my friends could chuck a few quid in the way of this JustGiving page. It’d mean a lot to me, as this is a cause I’m really passionate about.
p.s. If you do donate, please send me an email with your postal address. I’d love to send you a little thank-you note.
Some of you may know that I’ve been spending the summer interning at ScraperWiki; an interesting data-science startup in Liverpool, England. I started at ScraperWiki with nothing but a cursory understanding of how to scrape web pages. Perhaps that’s because there are very few places online where you can safely learn about data science and scraping.
To me, this completely bizarre. If you want to learn how to code, you have a wealth of options available to you. There are a plethora of sandboxed environments where you can invent and experiment without any adverse effects. CodeSchool, CodeAcademy and the KhanAcademy are three great examples of this. Surprisingly, there’s nothing of the sort for aspiring data scientists. I decided to change that.
Over the weekend, I’ve been working on a new pet project: Really Scrapable Web App (RSWA). Borrowing concepts from Project Euler and Code Academy, RSWA runs locally on the users computer and contains a number of challenges which aim to gently introduce core skills used by data scientists.
I decided to write RSWA using Flask, which is a Sinatra-like web framework for Python. In addition to that, it makes heavy use of LESS; a rather pleasant styling language that compiles down to CSS. RSWA is licensed using the permissive, open-source MIT license and can be grabbed from Github.
The front-end work was done by Nadil Bourkadi, a rather talented web designer, WordPress developer, and overall nice guy based in Essex and is working for Drift Innovation in the capacity of an intern. I’m incredibly obliged to him, as he spent half of his weekend working on this project with me. On a nice weekend, too. You should check out his blog.
Challenges and Progress
Before I continue, it’s worth stressing that RSWA is not a finished product. There’s an awful lot of room for improvement. There are a great many bugs to be fixed. There are challenges to be added. In short, it’s in alpha and is therefore liable to change a lot over the next few months.
Challenges start off really easily. The first warm-up exercise requires the user to grab the contents of a ‘h1’ element and then print it onto the screen. This is a task that can be accomplished in just a few lines of Python.
After that, things start to get harder. Much harder. The user is introduced to scraping tables tables and data sources that need to be cleaned before being stored. There are problems that depend on the use of Selenium and Regex in order to be solved. There are components that simulate how public APIs work that shoot out JSON and XML. This introduces the user to the requests library.
The aim is to offer a number of challenges that get incrementally more difficult, and that introduce new concepts and technologies with each level.
The next step in the development process is to add some problems that are kept behind authentication. I started work on that this weekend, but I came across some pretty awful cross-platform bugs in Flask that made this impossible.
Are you interested in this project? I would be honored if you would be willing to have a look at my work. The code is stored on Github. To run it, just create a new Virtualenv, install the dependencies in ‘requirements.txt’ and execute ‘run.py’.
If you have any questions about the product, I’d love to hear them! Getting in touch is easy. Just leave a comment below or send me an email.
Are you interested in the development of RSWA? Want to keep in touch? You can sign up to get each blog post delivered to your inbox. Just pop your email address in the box on the top right of the page.
HTML5 is something I’m very deeply interested in. So, when MakeUseOf asked me to write an introductory guide to it, I jumped at the opportunity. The end result is something that I feel gives readers a brief overview of what HTML5 is all about, and why it’s changing the way we use the internet.
The Social media wars are over. Facebook and Twitter won. Everyone else lost.
Or did they? There are some places in the world where Twitter and Facebook don’t reign supreme. In Brazil, you’ll find that Orkut is the place where pimply-faced teens post their ‘selfies’. In Russia, VKontakte is dominant and that doesn’t seem like it’s going to change any time soon.
In a huge swathe of South East Asia including India, Taiwan, the Philippines and Indonesia, you’ll find that it’s not Twitter which has won the affections of local internet users but rather Plurk, a rather curious Danish microblogging site.
The history of Plurk is actually pretty interesting. Released in 2008, it allowed people to post their usual 140 character whinges. In addition, all messages are posted on a timeline, allow threaded comments and contain a verb modifier such as ‘feels’, ‘loves’ and ‘says’. As a result, all postings feel inherently more conversational than Twitter. Whilst it sadly failed to get much success in the West, it became hugely popular elsewhere and to this day has a large contingent of die-hard users who log on to the service on a regular basis.
Since starting work at ScraperWiki, I’ve rapidly became addicted to writing scrapers and tools. It’s really, really damn good fun. You get to see the idiosyncrasies that are found in various APIs and you start to get an understanding to see how applications that use them work. If you’ve not tried it yet, I highly recommend you give it a go.
I wanted to write a scraper for Plurk because I want to discover what 1.64 billion people are up to. Here’s how I did it, and what I learned.
The Plurk API is the basis of everything I talk about in this article. It’s a curious mix of good and bad. Firstly, getting an API key is really, really easy and ‘web scraper’ is one of the acceptable categories for applications that make use of the Plurk API. There are also a wealth of libraries and bindings for whatever language you prefer, including Python, Perl, C# and PHP.
It does have its share of warts and carbuncles, however. Firstly, searching Plurk with the API only returns 20 results. This means that if you want to create a large dataset, you have to periodically run your scraper. You can do this by putting your scraper in a cron job which executes as often as necessary.
Another problem I encountered was a number of libraries for Plurk that are inadequately documented or documented in languages such as Traditional Chinese that I unfortunately do not speak. This is understandable, given Plurk’s popularity in East Asia. However, it made the process of scraping Plurk that bit harder.
I’ve uploaded the source code to my Github account. The code itself is reasonably easy to understand, thanks to the expressive nature of the Python programming language and the overwhelmingly simplicity of the actions that I was performing.
I start off by importing a bunch of libraries I want to use. These include plurk-oauth, json, sys, traceback and most crucially ScraperWiki, which provides a simple way to shoot the output of my query into an SQLite file. I also define my API key and secret as variables and instantiate plurk-oauth with them.
Next, I want to define some behavior which checks if I have passed a parameter when I executed my code. If I have, ‘search_plurk’ is called, which communicates our request to the Plurk API. If our query fails, this code catches the error and lets me know what went wrong.
Finally, we’re going define ‘search_plurk’ and iterate over the results that are returned. For each result, we select what attributes we want to keep and ignore the ones we don’t. In this case, we want the ID of the individual post, the ID of the user, the verb used and the content of the posting. We then shove them in an ordered dictionary and then put it in an SQLite file.
Presenting Our Findings
So, we’ve got all the data we want and we’ve stored it into a database. What does it look like in ScraperWiki? Well, we’ve got our search page which we describe here. And it looks like this.
We also have the ‘View in a table’ tool. This rather nicely places all the data you slurped from Plurk into a nice table, which you can aggregate by any of its rows. It’s also important to note that any links to photos that are present in the dataset will be presented here.
The ‘Summarize this data’ tool is pretty handy too! This goes through your dataset and finds trends which are then visualized in a number of rather beautiful ways. This is done automatically and without any further interaction with the user.
Scraping Plurk was pretty fun. Despite being a pretty small fry in the microblogging world, it has a pretty well documented API and a plethora of user generated libraries that make interacting with it a doddle. It does have its annoyances however, namely its absurdly low search result limits.
In the future, I’d love to create a massive dataset and see what Plurk’s usage of verb qualifiers means for sentiment analysis on the platform. How about you? Are there any websites you’d like to scrape? Let me know in the comments below
Today, I used a Ruby Library that I did not personally create.
This library itself was incredibly complete. Documentation was clear, well written and thorough. The library solved a hugely challenging problem. It was also entirely free, and the author expected no financial recompense. Downloading it meant that I was able to complete a difficult task in just a few lines of code, saving me a great deal of time and effort.
We often take open source software for granted. We’re happy to accept the cost benefits of FLOSS. We love that we get to use robust, powerful libraries and programs without having to open our wallets. On the whole though, we’re less keen to express gratitude to the men and women who sacrifice their free time to create them, however.
When was the last time you donated to someone’s GitTip or emailed a developer just to express your gratitude?
Next time you use someone else’s code, say thank you. It’s only polite.