When I first started learning to code, I thought writing a program to scrape, or fetch information, from a website was the coolest thing ever. With just a few simple commands, I could systematically gather pretty much any piece of text or data from any website I wanted. I still think it’s just as cool and useful today. And it’s a great way to get your hands on a fun, new dataset to explore and analyze.
For example, I love ordering food from Seamless.com. Huge selection of restaurants, and quick and easy to place and pay for my order.
But there is one feature on Seamless that bugs me – when you first start searching for restaurants near your place, the default ordering of the results is alphabetical. I know a lot of things in life are sorted this way, but Seamless has better signals to use than the first letter of the restaurant name, like estimated delivery time or average rating or number of reviews or even better, relevancy based on my prior orders.
That said, maybe this default ordering is moot if users immediately filter down to a particular cuisine, or perhaps users are even reassured by the fact that they see 10 restaurants that all start with ‘A’ as it implies a large offering on Seamless which inspires confidence and leads to more exploring and order completions.
This got me wondering, do restaurants that begin with letters at the start of the alphabet get more sales on Seamless just because they’re higher on the page? I wanted to try and find out, but that kind of restaurant sales data isn’t released by Seamless. But maybe I could estimate it – Seamless does tell you how many times each restaurant has been reviewed. This definitely isn’t a perfect proxy for sales (e.g., assumes all restaurants have been on site for same amount of time and likelihood to review across different cuisines and price ranges is same) but it’s maybe fairly decent and I think it’s the best I’m gonna do. Time to scrape the 51 pages of results for Brooklyn restaurants and gather and store the relevant data. I’m using Ruby to do this, but there are a lot of great helper libraries out there for other languages too.
With that, I now have a file with 1,260 entries for each Brooklyn restaurant on Seamless along with their average rating (1-5), total number of reviews, and price (1-5). The data file looks like this:
Now it’s time to drop it into R, or your data analysis tool of choice, and do some exploring! Pretty plots and answers (or observations at least) to come in a future post. :)