Monday, February 24, 2014

Scraping Citation Counts from a Google Scholar Profile

Google Scholar has no published API.  This means that if you want to extract information from someone's public Google Scholar profile, you have to write a script to scrape the web page.

I was recently asked to come up with a way to extract citation counts by year from a Google Scholar profile. This information is visible in a chart on the profile page, and it wasn't too hard to write a Python script to extract it. You can see my script at

Note that the citation counts are actually plotted as percentages of the citation counts in the highest year, and these percentages are rounded to one digit after the decimal point.  e.g. if an author's maximum citation year had 70 citations and a particular year had only 3 citations, then this is encoded in the table as 4.3%.  The script rounds to the nearest integer, but this can easily be off when the total number of citations is high.  

Similar plots are produced by Thompson's "MyResearcherID" profile, but in that case the chart is produced by a process that is opaque- you can get at the produced GIF, but in order to extract counts by year you'd have to run this graph through a "graph digitizer" tool.

It's a sad commentary on the state of the world wide web that information like this is presented in web pages in formats that are very hard to decode.