Building a scientific article abstract database for free

The amount of knowledge we have produced as a species grows every year. With so much content, it’s becoming impossible for us to completely absorb and digest the possibilities within it, making us limited in our ability to make as many inferences as would be possible if we could actually fit everything we produced inside our heads. Systematically studying the scientific literature with machine learning tools is going to be extremely important if we are to exploit our current scientific findings to their full potential.

import http.client as httplib
import urllib
from bs4 import BeautifulSoup
import re
import numpy as np
from time import sleep
from requests import get

class GoogleScholarCrawler:
    """
    @brief This class searches Google Scholar (http://scholar.google.com)
    """
    def __init__(self):
        """
        @brief Empty constructor.
        """

    def crawl(self, terms, limit):
        """
        @brief This function searches Google Scholar using the specified terms.
        """
        
        headers = {
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
            'referrer': 'https://google.com',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'en-US,en;q=0.9',
            'Pragma': 'no-cache',
        }
        
        i = 0
        results = []
        
        while i < limit:    

            params = urllib.parse.urlencode({'q': "+".join(terms)})
            
           #make sure you remove "amp;" from lines below in case they show up
            if i == 0:
                url = "https://scholar.google.com/scholar?hl=en&amp;as_sdt=0%2C13&amp;start={}&amp;{}&amp;num=10".format(i, params)
            else:
                url = "https://scholar.google.com/scholar?hl=en&amp;as_sdt=0%2C13&amp;{}&amp;num=10".format(params)
            print(url)
                   
            resp = get(url, headers=headers,allow_redirects = False)   
            status = resp.status_code
            print(resp.status_code)
        
            if status==200:
                content = resp.content
                soup = BeautifulSoup(content)
                citations = 0
                for record in soup.findAll(attrs={'class': 'gs_rt'}):
                    try:
                        a = record.findAll('a', href=True)[0]["href"]
                        results.append(a)  
                    except:
                        continue
                sleep(np.random.randint(10))
                    
            else:
                print ("ERROR: ")
                print (resp.status_code)
                return results
            
            i += 10
        
        return results

#example code
crawler = GoogleScholarCrawler()
results = crawler.crawl(['asteroid mining'], 10)
for r in results:
    print(r)

Mining the scientific literature for insights is not a new concept. However, we now have the computational learning tools that are necessary to start making really meaningful insights. Scientific articles like this recent one show how we can actually use papers to make predictions that are outside of what scientists have imagined up until now. Using only abstracts – not even the full papers – these scientists managed to make meaningful forecasts about potentially new thermoelectric materials. Even more impressive is the fact that all these inferences were done from a purely linguistic basis, without ever inserting actual knowledge about physics or chemistry into the machine learning process.

There are many problems where we could use a tool like this, but even the construction of a database to perform these searches is incredibly difficult and restrictive. Journal and indexing websites that have this information restrict its usage to paid subscribers, sometimes even charging more money for API access to facilitate the above process. This makes any effort resembling the above limited to those who have the access and resources to carry them out, leaving a ton of potential advances out of the question, simply because only a few privileged people have access to the data.

To try to solve this problem, I have though about the tools that we might use to create an actual database of scientific abstracts for a given topic, in a manner that is completely free. The first free tools available that came to mind was google scholar, which allows us to perform rather extensive searches of the scientific literature and obtain the url of the relevant papers. Taking some inspiration from a couple of previous and old google scholar python scripts available online, I created the new code you can see above.

Cartoon of the discovery of the inverse impact law exhibit in the pre-modern hall of the science history museum

This very simple crawler queries the google scholar website searching for some specific keywords and then returns the number of results requested by the user. The above code also includes an example that returns the first 10 results for “asteroid mining”. In this first step the crawler returns only the links to the pages and the next step will be to build a second crawler to then use these urls to extract the abstract information, from at least the most popular journal websites. With the url information we could then filter by publisher or indexer, which should make the task a lot easier.

The above crawler might also be too simple for practical use, since google may easily ban your attempts due to its rather obvious crawling activities, the use of proxies and other tactics might be necessary to build a database with 10K+ results for a given topic. However it is a first step and hopefully it will allow me to start constructing databases to tackle some scientific topics that I find very interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *