9 Feb 2015

My first scrape: what I learned from getting my own data

I've just produced my first report based on a scrape. I scraped data from Bloomberg's company profiles, in order to find out the demographic composition of company boards.

It was for a post on Project Ada, a blog covering the topic of women in tech. We wanted to find out how the UK's top tech companies were performing in terms of women representation.

I scraped the data from Bloomberg's company profiles on the sixty companies on the TECHmark - a specific part of the London Stock Exchange which focuses on innovative technology.


I used OutWit Hub to get the data - using its simple GUI to plug in the relevant selectors to scrape the data I wanted.

First of all, I needed a list of URLs to process with OutWit Hub - one URL for each company I wanted to scrape data for. Then I could process the list with one scraper and get all of the information. 

This was easy enough to get - as the URLs of Bloomberg's company pages just change with the company code. I could use a simple Excel formula to plug in each company's code into the URL (see below).


The right tags

So then I just needed to find the tags which enclosed the data I needed. As you can see below, OutWit scrapes the page for the data in between the tags you specify. 

Using one of the company pages as an example to make sure the scraper worked, I examined the source code to find the right tags. The sector was easy enough to find - the <div>s were marked with a simple 'sub_industry' class.

The names, however, were slightly harder. I needed to grab the 'Michael Rake' from the line below:
<td class='name'><a href="/profiles/people/1725655-michael-rake">Michael Rake</a></td>
As the link changed with every different name and ID number, I initially thought I needed to use a regular expression to disregard the parts which varied with each name. However, after experimenting with the scraper on one of the pages I needed to scrape, I found that the marker before would work with a simple <td class='name'><ahref. Much better.

OutWit Hub scraper

This scraper then trawled through the list of 60 URLs and grabbed the sector and names of all the board members for each company profile.

One issue - I only had the light version of OutWit Hub, so I could only scrape 100 rows of data each time. I had to run the scraper eight times to get all the data, but in the future, I'll have to invest in the paid-for version.


I then brought the data into the same spreadsheet using Excel's vlookup function to bring them all together with a mutual component - the company's code.  I had to find out whether they were male or female - with the help of the ONS.

By using Excel's vlookup function, I could compare the names against the ONS's data on boy's and girl's names. This would look up the names in my spreadsheet against the government's list of boys and girls' names, giving me the gender of the majority of the board members. 

Excel vlookup data

Once I'd sorted the gender column, the spreadsheet was complete, ready to analyse.

This was the end product. A spreadsheet with all the relevant data that I needed: 673 rows scraped from 60 web pages. With a little bit of cleaning, I could find out the total number of men and women on top tech companies' executive boards, and could find out which companies were the most representative.

Microsoft excel spreadsheetI found out that less than one in five tech company bosses were women, which led to this post on Project Ada.

1 comment: