Navigating published literature with full control

LLM

Tracking a field in real time, cheaply

Author

Akshay Balsubramani

Literature searches and how to anchor them

The success of the “deep research” product offerings from the large LLM providers has made it clear how easy it is to perform bespoke tasks that involve internet research. Such tools are very useful, and the self-consistency achieved by grounding them in real-world context can significantly improve the quality and interpretability of the research results.

Deep research work is underdetermined, with many admissible answers. There are many coherent narratives that can be synthesized after scouring the internet for grounded information from a brief prompt. Therefore, running the same “deep research” search twice can lead to results which are similar in the basic gist, but substantially different in specific details on a sentence-by-sentence level.

That’s fine when mapping big trends in the literature, which loom large no matter which specific papers ground the results. But it can be problematic to ensure that individual papers are included, unless the papers are already known and referenced in the prompt. And this leads to some uncertainty and lack of coverage in the work products of deep research.

Tools for traversing collaboration graphs

There are typically two forms of uncertainty about the paper coverage of LLM-grounded tools.

Not enough breadth: Comprehensive longitudinal coverage of literature across journals and preprint servers in a field is not always easy. And interdisciplinary work straddles many different fields and preprints.
Not current enough: Preprint servers like arXiv have become a de facto standard for sharing research, and they are updated within days of manuscript submission. So this is the pace that the research literature moves at, and other research communication like peer-reviewed publications lags in comparison.

Google Scholar (GS) is a comprehensive interdisciplinary search engine for the research literature, including paywalled content and preprint servers alike. It is a common one-stop shop for literature search due to its broad coverage (Martı́n-Martı́n et al. 2018) and wide accessibility. It is also unique among major indices in the scope of its automated indexing, normally registering preprints within a few days of release for most scholarly fields. Aside from social communication channels like media, it remains a primary way that researchers update their knowledge of the literature by traversing it.

There are actually two distinct graphs that can be accessed with Google Scholar. The citation graph is a directed network of papers, each pointing to other papers that it cites. The author collaboration graph is an undirected network of authors, each pair connected if they have co-authored a paper together. The graphs are interleaved – edges in the author graph correspond to papers – so it’s crucial to understand both of them together. GS makes all this information easily searchable with time in a dynamically updated way.

These graphs aren’t domain-specific – the domains may be largely separate components, but are connected by relatively thin but significant strands of interdisciplinary work, on both the author and citation levels. So ranging over these graphs is essential to explore the literature for trends and other meta-searches, for even interdisciplinary and evolving fields. I have also found this to be indispensable in LLM workflows.

Here I demonstrate a few tools that are commonly useful in traversing the above citation and collaboration graphs across domains, in a fully exposed modular system. They build lightweight, customizable, comprehensive neighborhoods of the graphs.

Uses of these tools

These tools have been very useful in my work experience, serving several purposes:

Citation tracking and reference management: This is the original purpose of services like Google Scholar. It serves researchers who are not looking to explore but rather to produce metadata for some already-existing work.
Supporting RAG pipelines involving the content itself: Especially for preprint-heavy areas of literature, the content of the paper is often available for each paper entry. Literature reviews, reports, and related work sections can be drafted on the basis of this content; we discuss this a bit more in a later section below.
Working with the tacit knowledge of active scientists: Particularly for quickly evolving areas, such tacit knowledge is often encoded in recent papers, which through their joint citation network define the recent center of gravity of any subfield. Further tacit knowledge can be found in author collaboration relationships, especially when fed with the latest knowledge of preprints as GS is. Academic publication does not keep up with the accumulation of such knowledge. And although citation counts are often a lagging indicator of the pulse of a field, they provide a unique prior that is predictive of the future success of recent preprints, the current pulse of the field.

Outline

This post describes an automated system to track papers, authors, and their interrelations, using Google Scholar for literature search. The approach generalizes to any scholarly field, so we go through the implementation in code at first, with general LLM-based examples for illustration. These highlight the authoritative recall of Google Scholar, updated in nearly real time. Then, we fix a couple of detailed examples that we flesh out for the first time in developing this tool, within fast-exploding areas of biochem research that are particularly interesting to track.

Some of this can in principle be done by tools or bibliography managers, both open- and closed-source. These rely on a variety of literature databases and searches, and wrap them to abstract away complexity and capture value. These are commendable ways of augmenting GS, in many cases finding more works than GS can (Gusenbauer and Haddaway 2020) in specific domains (Martı́n-Martı́n et al. 2018). They do not allow for the level of customization required for all the above uses, and as closed efforts they cannot be freely recommended to the unresourced user like the tools we explore here. The approach in this post allows us to control and economize API calls, and keep a persistent querying system reliably up with minimal effort.

Basic tools

The first step to implement is to obtain a list of papers, given a topic query or some text from a manuscript. Here we will assume that the user has provided a query string, but this could also be extracted from a manuscript using keyword extraction or other methods. Google Scholar renders pages as static HTML, and therefore can be parsed using standard requests functionality, without using an automated headless browser like Selenium.

Our first focus here is to leverage every last drop of information available in the search results page, which is a static HTML page that can be parsed using standard Python libraries. The forward links in the page, contained in the HTML tags defining it, contain a wealth of information about a neighborhood in the citation graph. So we can relatively cheaply and quickly sketch out areas of the citation network that are relevant to the query.

We have two goals here:

Build a local neighborhood of the citation graph that is as comprehensive as possible.
Do this efficiently with a light footprint, minimizing API calls to Google Scholar.

With Google Scholar, this can be done very cheaply because:

It involves static HTML pages, not JavaScript.
The way the API is structured makes building the citation graph an algorithmically efficient task – it can be done over \(n\) papers with an order of magnitude fewer calls than \(n\).

Our implementation scrapes the raw HTML; what is required here is fairly minimal and may change with updates to the way Google Scholar formats its HTML pages.

Using the structure of the results page

LXML is the standard way to do these things and is a minimal robust HTML/XML processor, which we use here. BeautifulSoup is also a more feature-rich option that runs on top of LXML; it’s convenient in general, so the code is annotated where BeautifulSoup could be alternatively used.

First we get the contents of the page using requests, with optional proxy parameters.

Setup notes: To avoid running into rate limits, a proxy service can be used; details depend on the user, and your mileage may vary. Alternatives can invoke proxies during the programmatic search.¹

CODE

import time, requests, re, json, os
import numpy as np
from lxml import html



def get_scraper_params_url(url, proxy_config=None):
    if proxy_config is None:
        return {}, url
    SCRAPER_API_URL = ""  # Address for API server of the scraper proxy
    SCRAPER_API_KEY = ""  # Get from SCRAPER_API_URL
    scraper_params = {
        'api_key': SCRAPER_API_KEY,
        'url': url
    }
    return scraper_params, SCRAPER_API_URL


def get_raw_page(
    url, pause_sec=0
):
    time.sleep(pause_sec)   # to avoid hammering the server with requests
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.post(url, headers=headers)
    return html.fromstring(response.content)    # or return BeautifulSoup(response.text, 'html.parser')

CODE

def get_page(url, pause_sec=0):
    """Gets page content using a scraper API and returns lxml tree"""
    time.sleep(pause_sec)
    try:
        scraper_params, scraper_api_url = get_scraper_params_url(url)
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
        }
        response = requests.post(scraper_api_url, headers=headers, params=scraper_params, timeout=60)
        response.raise_for_status()
        # Check if we hit rate limits
        if response.status_code == 429:
            time_limit = 20
            print(f"Rate limit hit, waiting {time_limit} seconds...")
            time.sleep(time_limit)
            return get_page(url, pause_sec=5)  # Retry with longer pause
        return html.fromstring(response.content)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        # Implement exponential backoff for retries
        if hasattr(e, 'response') and e.response.status_code == 500:
            print("Server error, retrying in 10 seconds...")
            return get_page(url, pause_sec=5)
        raise


def get_page_with_retry(url, max_retries=3, pause_sec=0):
    """Wrapper with retry logic for resilience"""
    for attempt in range(max_retries):
        try:
            return get_page(url, pause_sec=pause_sec)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (attempt + 1) * 3
            print(f"Retry {attempt + 1}/{max_retries} after {wait_time}s...")
            time.sleep(wait_time)

Google Scholar consists of a couple of different types of pages, each using static HTML. The primary search results page for any query lists each paper entry sequentially with hyperlinks to related information, like this:

A paper as displayed in Google Scholar search results.

We can traverse the known structure of every search results page to usefully mine its contents. Each GET request displays a page consisting of about 20 papers as search results, each displayed as above. It’s important to glean all the information possible from this page, using the website’s markup schema.

This is really the heart of our implementation. Each entry in the Scholar page is represented by a dictionary containing the paper’s metadata, which can all be read from the HTML of the page.

CODE

BASE_URL = "https://scholar.google.com"


def base_scrape_search_results(url, pause_sec=0):
    page_contents = get_page_with_retry(url, pause_sec=pause_sec)
    results = page_contents.xpath('//div[contains(@class, "gs_r gs_or gs_scl")]')
    papers = []
    for r in results:
        gs_ri_entries = r.xpath('.//div[contains(@class, "gs_ri")]')
        file_tag = r.xpath('.//div[contains(@class, "gs_ggs")]')
        for entry in gs_ri_entries:
            this_paper = {
                'title': None, 'url': None, 'file_url': None,
                'citations': None, 'citedby_url': None, 'authors': None,
                'year': None, 'pubinfo': None, 'snippet': None
            }

            title_link = entry.xpath('.//h3[contains(@class, "gs_rt")]/a')
            if title_link:
                this_paper['title'] = title_link[0].text_content().strip()
                this_paper['url'] = title_link[0].get('href')

            # Extract citations
            cited_links = entry.xpath('.//a[contains(text(), "Cited by")]')
            if cited_links:
                cited_text = cited_links[0].text_content()
                citations_match = re.search(r'Cited by (\d+)', cited_text)
                if citations_match:
                    this_paper['citations'] = int(citations_match.group(1))
                    this_paper['citedby_url'] = BASE_URL + cited_links[0].get('href')
            else:
                this_paper['citations'] = 0
            # Extract bibliographic ID
            try:
                related_links = entry.xpath('.//a[contains(text(), "Related articles")]')
                if not related_links:
                    continue
                href = related_links[0].get('href')
                if ':' not in href:
                    continue
                cite_id = href.split(':')[1]
                bib_url = f"{BASE_URL}/scholar?hl=en&q=info:{cite_id}:scholar.google.com/&output=cite&scirp=0"
                this_paper['bib_url'] = bib_url
            except Exception as e:
                print(f"Error getting bibliographic URL for entry: {e}")
                continue

            if file_tag:
                file_link = file_tag[0].xpath('.//a')
                if file_link:
                    this_paper['file_url'] = file_link[0].get('href')

            pub_info_tag = entry.xpath('.//div[contains(@class, "gs_a")]')
            if not pub_info_tag:
                continue
            pub_text = pub_info_tag[0].text_content().strip()
            this_paper['pubinfo'] = pub_text

            # Parse authors and year
            parts = pub_text.replace('\u00A0', ' ').split(' - ')
            if parts:
                auth_text = parts[0].strip()
                authors = [x.strip('… ') for x in auth_text.split(',') if x.strip('… ')]
                if len(parts) > 1:
                    year_match = re.search(r'\b(19|20)\d{2}\b', parts[1])
                    if year_match:
                        this_paper['year'] = year_match.group(0)
                authID_dict = {a: None for a in authors}
                author_links = pub_info_tag[0].xpath('.//a')
                for link in author_links:
                    author_name = link.text_content().strip()
                    href = link.get('href')
                    if href and 'user=' in href:
                        user_id = href.split('user=')[1].split('&')[0]
                        if author_name in authID_dict:
                            authID_dict[author_name] = user_id
                this_paper['authors'] = authID_dict

            # Extract snippet
            snippet_tag = entry.xpath('.//div[contains(@class, "gs_rs")]')
            if snippet_tag:
                this_paper['snippet'] = snippet_tag[0].text_content().strip()
        papers.append(this_paper)
    return papers

A basic use of this functionality is to list all the papers that we would otherwise get by typing this search query into Google Scholar’s web interface.

CODE

u = 'https://scholar.google.com/scholar?start=0&q=%22graph+foundation+models%22&hl=en&num=10&as_sdt=0,5'

CODE

w = base_scrape_search_results(u)
w

[{'title': 'Graph foundation models: Concepts, opportunities and challenges',
  'url': 'https://ieeexplore.ieee.org/abstract/document/10915556/',
  'file_url': 'https://smufang.github.io/paper/TPAMI25_GFM.pdf',
  'citations': 15,
  'citedby_url': 'https://scholar.google.com/scholar?cites=1438884396902639289&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'J Liu': 'ufius3MAAAAJ',
   'C Yang': 'OlLjVUcAAAAJ',
   'Z Lu': 'YHjKBWQAAAAJ',
   'J Chen': None,
   'Y Li': 'kdd-ohUAAAAJ'},
  'year': '2025',
  'pubinfo': 'J Liu, C Yang, Z Lu, J Chen, Y Li…\xa0- …\xa0on Pattern Analysis\xa0…, 2025 - ieeexplore.ieee.org',
  'snippet': '… To this end, this article introduces the concept of Graph Foundation Models (GFMs), and \noffers an exhaustive explanation of their key characteristics and underlying technologies. We …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:ueY7Rrzx9xMJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Towards graph foundation models: A survey and beyond',
  'url': 'https://arxiv.org/abs/2310.11829',
  'file_url': 'https://arxiv.org/pdf/2310.11829',
  'citations': 131,
  'citedby_url': 'https://scholar.google.com/scholar?cites=9531235727869616567&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'J Liu': 'ufius3MAAAAJ',
   'C Yang': 'OlLjVUcAAAAJ',
   'Z Lu': 'YHjKBWQAAAAJ',
   'J Chen': 'IUXgyO8AAAAJ',
   'Y Li': 'kdd-ohUAAAAJ',
   'M Zhang': None},
  'year': '2023',
  'pubinfo': 'J Liu, C Yang, Z Lu, J Chen, Y Li, M Zhang…\xa0- arXiv preprint arXiv\xa0…, 2023 - arxiv.org',
  'snippet': '… introduces the concept of Graph Foundation Models (GFMs), … graph foundation models, this \npaper surveys some related … first survey towards graph foundation models. Existing surveys …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:tyEuVmXARYQJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Position: Graph foundation models are already here',
  'url': 'https://openreview.net/forum?id=Edz0QXKKAo',
  'file_url': 'https://openreview.net/pdf?id=Edz0QXKKAo',
  'citations': 63,
  'citedby_url': 'https://scholar.google.com/scholar?cites=6351192675309565134&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'H Mao': None,
   'Z Chen': '6hUny38AAAAJ',
   'W Tang': 'KpjOK18AAAAJ',
   'J Zhao': 'Pb_UoYwAAAAJ',
   'Y Ma': 'wf9TTOIAAAAJ'},
  'year': '2024',
  'pubinfo': 'H Mao, Z Chen, W Tang, J Zhao, Y Ma…\xa0- …\xa0on Machine Learning, 2024 - openreview.net',
  'snippet': '… In this paper, we provide principle guidance for the development of graph foundation models, \n… at developing next-generation graph foundation models with better versatility and fairness. …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:zrTjjxb4I1gJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Opengraph: Towards open graph foundation models',
  'url': 'https://arxiv.org/abs/2403.01121',
  'file_url': 'https://arxiv.org/pdf/2403.01121',
  'citations': 51,
  'citedby_url': 'https://scholar.google.com/scholar?cites=14599675307877519589&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'L Xia': 'fDDjoUEAAAAJ',
   'B Kao': 'TwSParMAAAAJ',
   'C Huang': 'Zkv9FqwAAAAJ'},
  'year': '2024',
  'pubinfo': 'L Xia, B Kao, C Huang\xa0- arXiv preprint arXiv:2403.01121, 2024 - arxiv.org',
  'snippet': '… Our study serves as an initial exploration of graph foundation models, focusing on distilling \nthe generalization capabilities from LLMs without relying on textual features. However, it is …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:5dAUEld3nMoJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Graph foundation models',
  'url': 'https://openreview.net/forum?id=WRH2HngHhb',
  'file_url': None,
  'citations': 41,
  'citedby_url': 'https://scholar.google.com/scholar?cites=5419560511653876122&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'H Mao': None,
   'Z Chen': '6hUny38AAAAJ',
   'W Tang': 'KpjOK18AAAAJ',
   'J Zhao': 'Pb_UoYwAAAAJ',
   'Y Ma': 'wf9TTOIAAAAJ',
   'T Zhao': '05cRc-MAAAAJ',
   'N Shah': 'Qut69OgAAAAJ'},
  'year': '2024',
  'pubinfo': 'H Mao, Z Chen, W Tang, J Zhao, Y Ma, T Zhao, N Shah…\xa0- CoRR, 2024 - openreview.net',
  'snippet': 'Graph Foundation Models (GFMs) are emerging as a significant research topic in the graph \ndomain, aiming to develop graph models trained on extensive and diverse data to enhance …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:mhHdP5IlNksJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Graph Foundation Models: A Comprehensive Survey',
  'url': 'https://arxiv.org/abs/2505.15116',
  'file_url': 'https://arxiv.org/pdf/2505.15116',
  'citations': 2,
  'citedby_url': 'https://scholar.google.com/scholar?cites=17289336296020729368&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'Z Wang': '-qXxOv0AAAAJ',
   'Z Liu': 'NLA-nSUAAAAJ',
   'T Ma': 'B_M9WPsAAAAJ',
   'J Li': 'uxxcR34AAAAJ',
   'Z Zhang': 'qJURp_AAAAAJ',
   'X Fu': 'BSnB9GQAAAAJ',
   'Y Li': 'Z4VYGCYAAAAJ'},
  'year': '2025',
  'pubinfo': 'Z Wang, Z Liu, T Ma, J Li, Z Zhang, X Fu, Y Li…\xa0- arXiv preprint arXiv\xa0…, 2025 - arxiv.org',
  'snippet': '… a holistic and systematic review of graph foundation models. We begin by outlining the … \nWe identify and categorize the fundamental challenges in building graph foundation models …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:GF6VJsIP8O8J:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Graph Foundation Models for Recommendation: A Comprehensive Survey',
  'url': 'https://arxiv.org/abs/2502.08346',
  'file_url': 'https://arxiv.org/pdf/2502.08346',
  'citations': 5,
  'citedby_url': 'https://scholar.google.com/scholar?cites=10727456275026868021&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'B Wu': 'qCf-504AAAAJ',
   'Y Wang': 'Hj2S-bIAAAAJ',
   'Y Zeng': None,
   'J Liu': 'ufius3MAAAAJ',
   'J Zhao': None,
   'C Yang': 'OlLjVUcAAAAJ'},
  'year': '2025',
  'pubinfo': 'B Wu, Y Wang, Y Zeng, J Liu, J Zhao, C Yang…\xa0- arXiv preprint arXiv\xa0…, 2025 - arxiv.org',
  'snippet': '… Recent research has focused on graph foundation models (GFMs), which integrate the \nstrengths of GNNs and LLMs to model complex RS problems more efficiently by leveraging the …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:NfeCSKWU35QJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Graph foundation model',
  'url': 'https://link.springer.com/article/10.1007/s11704-024-40046-0',
  'file_url': None,
  'citations': 4,
  'citedby_url': 'https://scholar.google.com/scholar?cites=5168487345685845860&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'C Shi': 'tUq_v90AAAAJ',
   'J Chen': 'IUXgyO8AAAAJ',
   'J Liu': 'ufius3MAAAAJ',
   'C Yang': None},
  'year': '2024',
  'pubinfo': 'C Shi, J Chen, J Liu, C Yang\xa0- Frontiers of Computer Science, 2024 - Springer',
  'snippet': '… , the impact of graph foundation models on graph tasks … graph foundation models with \nLLMs could enhance performance in open-ended tasks. Notably, graph foundation models show …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:ZNe0adsnukcJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Text-space graph foundation models: Comprehensive benchmarks and new insights',
  'url': 'https://proceedings.neurips.cc/paper_files/paper/2024/hash/0e0b39c69663e9c073739adf547ed778-Abstract-Datasets_and_Benchmarks_Track.html',
  'file_url': 'https://proceedings.neurips.cc/paper_files/paper/2024/file/0e0b39c69663e9c073739adf547ed778-Paper-Datasets_and_Benchmarks_Track.pdf',
  'citations': 29,
  'citedby_url': 'https://scholar.google.com/scholar?cites=3991282203754037766&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'Z Chen': '6hUny38AAAAJ',
   'H Mao': None,
   'J Liu': 'OgTXF4cAAAAJ',
   'Y Song': '17z7IcgAAAAJ',
   'B Li': None},
  'year': '2024',
  'pubinfo': 'Z Chen, H Mao, J Liu, Y Song, B Li…\xa0- Advances in\xa0…, 2024 - proceedings.neurips.cc',
  'snippet': '… Graph Foundation Models. GFMs extend the traditional GML setting across different \ndatasets and tasks. Despite the more diverse settings, most GFMs follow a unified paradigm: …',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:BgIZgQTiYzcJ:scholar.google.com/&output=cite&scirp=0'},
 {'title': 'Graphclip: Enhancing transferability in graph foundation models for text-attributed graphs',
  'url': 'https://dl.acm.org/doi/abs/10.1145/3696410.3714801',
  'file_url': 'https://arxiv.org/pdf/2410.10329',
  'citations': 18,
  'citedby_url': 'https://scholar.google.com/scholar?cites=16778566805568364536&as_sdt=2005&sciodt=0,5&hl=en',
  'authors': {'Y Zhu': '60HqQsQAAAAJ',
   'H Shi': 'JKwP43sAAAAJ',
   'X Wang': 'r8k-ywkAAAAJ',
   'Y Liu': 'qYQHl4sAAAAJ',
   'Y Wang': 'TIcYz4gAAAAJ',
   'B Peng': 'IPIZUVAAAAAJ'},
  'year': '2025',
  'pubinfo': 'Y Zhu, H Shi, X Wang, Y Liu, Y Wang, B Peng…\xa0- Proceedings of the\xa0…, 2025 - dl.acm.org',
  'snippet': '… the development of graph foundation models with strong … challenges by learning graph \nfoundation models with strong … learning, to enhance graph foundation models with strong cross-…',
  'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:-B8Q7pxx2egJ:scholar.google.com/&output=cite&scirp=0'}]

Answering text queries

Using these basic tools, we can start more advanced scraping routines that navigate the clusters of connected pages comprising the database. A natural place to start is by using the core search functionality to turn a text search query into a list of papers.

In general, such searches could have thousands of search results or more. Even if this number is far too many to actually process, ascertaining it from the search results page is useful to guide the search. This can be written as a useful subroutine get_results_count.

The main function here is get_top_papers_for_query, which takes a query and returns a list of papers. It uses the basic_search_url function to get the URL of the search results page, and then uses the base_scrape_search_results and get_results_count functions to scrape the page. This continues for each successive page of search results.

Sometimes, there are more search results than is desirable to process. In such cases, citation counts are a useful component of stopping heuristics that determine when to terminate. Below we stop when the papers being shown are no longer drawing sufficient attention from the community, i.e. when the median citations on the page are below a certain threshold.

CODE

def basic_search_url(query, start_num=0, results_per_page=100):
    query_str = '+'.join(query.split())
    url = f"{BASE_URL}/scholar?start={start_num}&q={query_str}&hl=en&num={results_per_page}&as_sdt=0,5"
    return url


def get_results_count(query, pause_sec=0):
    url = basic_search_url(query)
    tree = get_page_with_retry(url, pause_sec=pause_sec)
    # pick the `.gs_ab_mdw` node that contains “[About] … results”
    results_divs = tree.xpath('//div[@class="gs_ab_mdw"]')
    for div in results_divs:
        text = div.text_content()
        if 'results' in text:
            match = re.search(r'(?:About\s+)?([\d,\.]+)\s+results', text, re.IGNORECASE)
            if match:
                return int(match.group(1).replace(',', ''))
    return 0


def get_top_papers_for_query(query, max_papers=np.inf, min_citations=1, pause_sec=0, verbose=False):
    start_num = 0
    num_queries = 0
    results_count = get_results_count(query, pause_sec=pause_sec)
    num_queries += 1
    papers = []
    while len(papers) < np.min([max_papers, results_count]):
        url = basic_search_url(query, start_num=start_num)
        new_papers = base_scrape_search_results(url, pause_sec=pause_sec)
        num_queries += 1
        if not new_papers:
            print(f"No new papers at {url}")
            break
        start_num += len(new_papers)
        page_count = num_queries - 1
        papers.extend(new_papers)
        citations = [p['citations'] for p in new_papers if p['citations'] is not None]
        if citations:
            medcit = np.median(citations)
            if verbose:
                print(f"Median citations on page {page_count} of results: {medcit}")
            if medcit < min_citations:
                print(f"Stopping search, median citations {medcit} is below threshold {min_citations}.")
                break
    return papers, results_count, num_queries

Running this in practice

This function get_top_papers_for_query makes an API call to Google Scholar every 20 search results – Scholar will only display this many results on a single page – and analyzes the resulting HTML. So it takes just a few dozen calls to get hundreds of paper results, the scope of a single detailed literature review.

There are many examples of this in post-LLM AI, which is seeing an explosion of activity. A simple example to start with is the emerging area of “graph foundation models”; this term is very specific, as it was almost never used before 2024, even where such models existed. The corresponding search query gives a manageable number of results that we can readily parse. By default, we consider papers with at least one citation, which permissively includes even recent preprints to more completely observe the current pulse of the field.

CODE

gfm_query = '"graph foundation models"'

print(
    f"Query: {gfm_query}" + 
    f"\nNumber of search results: {get_results_count(gfm_query)}"
)

Query: "graph foundation models"
Number of search results: 403

CODE

g = get_top_papers_for_query(gfm_query)
print(
    f"{len(g[0])} papers retrieved, of {g[1]} total results" + 
    f"\nNumber of queries: {g[2]}"
)

Stopping search, median citations 0.0 is below threshold 1.
160 papers retrieved, of 403 total results
Number of queries: 9

Note the get_results_count function here, which ascertains the approximate total number of queries. This is a very useful way of knowing when to stop, which takes just one API call even for massive text searches. We can see this with a generic query about LLMs.

CODE

llm_query = 'large language models'

print(
    f"Query: {llm_query}" + 
    f"\nNumber of search results: {get_results_count(llm_query)}"
)

Query: large language models
Number of search results: 5920000

Advanced queries on Google Scholar

Google Scholar also has many more advanced options, such as searching for papers by a specific set of authors with conjunctions and disjunctions of phrases, and within date ranges. These amount to simple URL parameters that can be added to the search query.

CODE

def advanced_search_url(
    query, start_num=0, results_per_page=100, 
    phrase_str='', words_some='', words_none='', 
    scope='any', authors='', pub='', 
    ylo='', yhi=''
):
    """Generates advanced search URL"""
    query_str = '+'.join(query.split())
    url = (f"{BASE_URL}/scholar?start={start_num}&as_q={query_str}&as_epq={phrase_str}"
            f"&as_oq={words_some}&as_eq={words_none}&as_occt={scope}"
            f"&as_sauthors={authors}&as_publication={pub}"
            f"&as_ylo={ylo}&as_yhi={yhi}&btnG=&hl=en&num={results_per_page}&as_sdt=0%2C5")
    return url

The previous code can be modified to add these options by straightforwardly transforming the query URL string to the new URL, replacing the query URL generation function basic_search_url with this one. This leads to a few code changes which are reflected below, rewriting the retrieval functions to use the advanced_search_url function.

CODE

def search_url(
    query, mode='basic', 
    start_num=0, results_per_page=100, 
    phrase_str='', words_some='', words_none='', 
    scope='any', authors='', pub='', 
    ylo='', yhi=''
):
    if mode == 'basic':
        url = basic_search_url(query, start_num=start_num, results_per_page=results_per_page)
    elif mode == 'advanced':
        url = advanced_search_url(
            query, start_num=start_num, results_per_page=results_per_page, 
            phrase_str=phrase_str, words_some=words_some, words_none=words_none, 
            scope=scope, authors=authors, pub=pub, ylo=ylo, yhi=yhi
        )
    else:
        raise ValueError(f"Invalid mode: {mode}")
    return url


def get_results_count(query, pause_sec=0, search_mode='basic', **kwargs):
    url = search_url(query, mode=search_mode, **kwargs)
    # Modify the above line to pass into search URL any more arguments that this function receives. 
    tree = get_page_with_retry(url, pause_sec=pause_sec)
    # pick the `.gs_ab_mdw` node that contains “[About] … results”
    results_divs = tree.xpath('//div[@class="gs_ab_mdw"]')
    for div in results_divs:
        text = div.text_content()
        if 'results' in text:
            match = re.search(r'([\d,]+)\s+results', text)
            if match:
                return int(match.group(1).replace(',', ''))
    return 0


def write_paper_list_as_json(paper_list, fname):
    init_papers = {x['title']: x for x in paper_list}  # deduplicate by title
    init_papers_dict = json.dumps(init_papers, indent=2, ensure_ascii=False)
    with open(fname, 'w', encoding='utf-8') as f:
        f.write(init_papers_dict)


def get_top_papers_for_query(
    query, max_papers=np.inf, min_citations=1, out_file_name=None, verbose=False, 
    pause_sec=0, start_num=0, mode='basic', **kwargs
):
    num_queries = 0
    results_count = get_results_count(query, pause_sec=pause_sec, search_mode=mode, **kwargs)
    num_queries += 1
    papers = []
    while len(papers) < np.min([max_papers, results_count]):
        url = search_url(query, start_num=start_num, mode=mode, **kwargs)
        new_papers = base_scrape_search_results(url, pause_sec=pause_sec)
        num_queries += 1
        if not new_papers:
            print(f"No new papers at {url}")
            break
        start_num += len(new_papers)
        page_count = num_queries - 1
        papers.extend(new_papers)
        citations = [p['citations'] for p in new_papers if p['citations'] is not None]
        if citations:
            medcit = np.median(citations)
            if verbose:
                print(f"Median citations on page {page_count} of results: {medcit}")
            if medcit < min_citations:
                print(f"Stopping search, median citations {medcit} is below threshold {min_citations}.")
                break
    if out_file_name is not None:
        write_paper_list_as_json(papers, out_file_name)
    return papers, results_count, num_queries

We can illustrate this new search functionality using some more complex search queries.

A great example is the burgeoning study of systems of multiple LLM-based agents. This is a topic of dramatically increasing interest, that has arisen to rapid popularity quickly in LLM work.

In this context, the field essentially did not exist before 2023. As such, it makes for a good way to test our advanced search functionality. The following is an example showing the allowed options:

start_num: The starting index of the results to fetch.
results_per_page: The number of results to fetch per page.
phrase_str: A phrase to search for in the title or abstract.
words_some: Results containing any of these words.
words_none: Results not containing any of these words.
scope: The scope of the search ('any' for anywhere in the article, 'title' for title only).
authors: The authors to search for.
pub: The publication (journal, conference, etc.) name to search for.
ylo: Earliest year of search results.
yhi: Latest year of search results.

Here is one way to look for recent papers on this topic that don’t concern economics.

CODE

query_magent = '"large language model" tool multiagent'
multiagent_papers = get_top_papers_for_query(
    query_magent, mode='advanced', 
    out_file_name='../../files/multiagent_papers.json', 
    start_num=0, 
    results_per_page=100, 
    phrase_str='', 
    words_some='', 
    words_none='economics', 
    scope='any', 
    authors='', 
    min_citations=1, 
    pub='', 
    ylo='2024', 
    yhi=''
)
print(
    f"{len(multiagent_papers[0])} papers found, of {multiagent_papers[1]} total results" + 
    f"\nNumber of queries: {multiagent_papers[2]}"
)

Stopping search, median citations 0.5 is below threshold 1.
280 papers found, of 8590 total results
Number of queries: 15

As usual, we start to see the benefits of a clean implementation with scale. Hundreds of papers have been retrieved here, and saved conveniently, for later use in building the citation graph.

CODE

from itertools import islice

with open('../../files/multiagent_papers.json', 'r') as file:
    init_papers_dict = json.load(file)
print(
    f"{len(init_papers_dict)} papers considered as search results. First 5 papers: \n" + 
    '\n'.join([str(x) for x in list(islice(init_papers_dict.items(), 5))])
)

242 papers considered as search results. First 5 papers: 
('Large language model enhanced multi-agent systems for 6G communications', {'title': 'Large language model enhanced multi-agent systems for 6G communications', 'url': 'https://ieeexplore.ieee.org/abstract/document/10638533/', 'file_url': 'https://arxiv.org/pdf/2312.07850', 'citations': 85, 'citedby_url': 'https://scholar.google.com/scholar?cites=12439253059233971415&as_sdt=2005&sciodt=0,5&hl=en&num=20', 'authors': {'F Jiang': 'G-sluOMAAAAJ', 'Y Peng': 'DviEJD4AAAAJ', 'L Dong': None, 'K Wang': '4VwoNj0AAAAJ'}, 'year': '2024', 'pubinfo': 'F Jiang, Y Peng, L Dong, K Wang…\xa0- IEEE Wireless\xa0…, 2024 - ieeexplore.ieee.org', 'snippet': '… The rapid development of the large language model (LLM) presents huge opportunities \nfor … , a multi-agent system with customized communication knowledge and tools for solving …', 'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:1_zUWsAaoawJ:scholar.google.com/&output=cite&scirp=0'})
('ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning', {'title': 'ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning', 'url': 'https://pubs.rsc.org/en/content/articlehtml/2024/dd/d4dd00013g', 'file_url': 'https://pubs.rsc.org/zh-hans/content/articlepdf/2024/dd/d4dd00013g', 'citations': 76, 'citedby_url': 'https://scholar.google.com/scholar?cites=13800110467043355978&as_sdt=2005&sciodt=0,5&hl=en&num=20', 'authors': {'A Ghafarollahi': 'VXdIb40AAAAJ', 'MJ Buehler': 'hWBTSksAAAAJ'}, 'year': '2024', 'pubinfo': 'A Ghafarollahi, MJ Buehler\xa0- Digital Discovery, 2024 - pubs.rsc.org', 'snippet': '… a multi-agent strategy to the protein design problems by introducing ProtAgents, a multi-agent … \nIt is worth mentioning that all the tools implemented in our multi-agent system are fixed, …', 'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:Su1EiVrXg78J:scholar.google.com/&output=cite&scirp=0'})
('MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge', {'title': 'MechAgents: Large language model multi-agent collaborations can solve mechanics problems, generate new data, and integrate knowledge', 'url': 'https://www.sciencedirect.com/science/article/pii/S2352431624000117', 'file_url': 'https://www.sciencedirect.com/science/article/am/pii/S2352431624000117', 'citations': 75, 'citedby_url': 'https://scholar.google.com/scholar?cites=13635405768788113747&as_sdt=2005&sciodt=0,5&hl=en&num=20', 'authors': {'B Ni': 'dg9WKX8AAAAJ', 'MJ Buehler': 'hWBTSksAAAAJ'}, 'year': '2024', 'pubinfo': 'B Ni, MJ Buehler\xa0- Extreme Mechanics Letters, 2024 - Elsevier', 'snippet': 'Solving mechanics problems using numerical methods requires comprehensive intelligent \ncapability of retrieving relevant knowledge and theory, constructing and executing codes, …', 'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:U3ntoE2xOr0J:scholar.google.com/&output=cite&scirp=0'})
('Graphteam: Facilitating large language model-based graph analysis via multi-agent collaboration', {'title': 'Graphteam: Facilitating large language model-based graph analysis via multi-agent collaboration', 'url': 'https://arxiv.org/abs/2410.18032', 'file_url': 'https://arxiv.org/pdf/2410.18032', 'citations': 6, 'citedby_url': 'https://scholar.google.com/scholar?cites=9145904115587407543&as_sdt=2005&sciodt=0,5&hl=en&num=20', 'authors': {'XS Li': 'pHPTHHwAAAAJ', 'Q Chu': 'NjztroAAAAAJ', 'Y Chen': 'Rnsawl0AAAAJ', 'Y Liu': None, 'Z Yu': None}, 'year': '2024', 'pubinfo': 'XS Li, Q Chu, Y Chen, Y Liu, Y Liu, Z Yu…\xa0- arXiv preprint arXiv\xa0…, 2024 - arxiv.org', 'snippet': '… external knowledge or tools for problem solving. By simulating human problem-solving \nstrategies such as analogy and collaboration, we propose a multi-agent system based on LLMs …', 'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:t95eB1bH7H4J:scholar.google.com/&output=cite&scirp=0'})
('Multi-agent large language model frameworks: Unlocking new possibilities for optimizing wastewater treatment operation', {'title': 'Multi-agent large language model frameworks: Unlocking new possibilities for optimizing wastewater treatment operation', 'url': 'https://www.sciencedirect.com/science/article/pii/S0013935125006528', 'file_url': None, 'citations': 2, 'citedby_url': 'https://scholar.google.com/scholar?cites=7293116657822658993&as_sdt=2005&sciodt=0,5&hl=en&num=20', 'authors': {'S Rothfarb': None, 'M Friday': None, 'X Wang': 'eiPLCDEAAAAJ', 'A Zaghi': 'IX1BPogAAAAJ', 'B Li': '3Ec1wOIAAAAJ'}, 'year': '2025', 'pubinfo': 'S Rothfarb, M Friday, X Wang, A Zaghi, B Li\xa0- Environmental Research, 2025 - Elsevier', 'snippet': '… LLM, necessitating a multi-agent framework where specialized agents … This perspective \npaper highlights how multi-agent, tool-… validation, human oversight, and interpretability tools. …', 'bib_url': 'https://scholar.google.com/scholar?hl=en&q=info:sfEAr_VaNmUJ:scholar.google.com/&output=cite&scirp=0'})

We now have a list of hundreds or thousands of papers that are relevant to the query, each of which can be stored along with a page listing the articles that cite it. This information is what the get_top_papers_for_query function outputs.

Authors → papers

In the context of the search, particularly into specific research-oriented material, the user’s intent is often most easily expressible by referring to specific authors or papers that the user has in mind.

Subsets of relevant papers are therefore often closely viewed by individual researchers, by looking at the recent publication history of a particular author. This needs a little further code, to get the relevant author information (identifying papers written by the author).

CODE

def papers_by_author(author_scholar_id, max_papers=20):
    """
    Retrieves papers by a specific author using their Google Scholar ID.
    
    Args:
        author_scholar_id: Google Scholar author ID
        max_papers: Maximum number of papers to retrieve
    
    Returns:
        List of paper dictionaries
    """
    auth_url = f"{BASE_URL}/citations?user={author_scholar_id}&hl=en&cstart=0&pagesize={max_papers}"
    tree = get_page_with_retry(auth_url)
    papers = []
    # Extract paper entries from author's page
    entries = tree.xpath('//tr[@class="gsc_a_tr"]')
    for entry in entries:
        paper = {}
        # Title and link
        title_elem = entry.xpath('.//a[@class="gsc_a_at"]')
        if title_elem:
            paper['title'] = title_elem[0].text_content().strip()
            paper['url'] = BASE_URL + title_elem[0].get('href')
        # Authors and publication info
        authors_elem = entry.xpath('.//div[@class="gs_gray"][1]')
        if authors_elem:
            paper['authors'] = authors_elem[0].text_content().strip()
        pub_elem = entry.xpath('.//div[@class="gs_gray"][2]')
        if pub_elem:
            paper['pubinfo'] = pub_elem[0].text_content().strip()
            # Extract year
            year_match = re.search(r'\b(19|20)\d{2}\b', paper['pubinfo'])
            if year_match:
                paper['year'] = year_match.group(0)
        # Citations
        cite_elem = entry.xpath('.//a[@class="gsc_a_ac gs_ibl"]')
        if cite_elem and cite_elem[0].text_content().strip():
            paper['citations'] = int(cite_elem[0].text_content().strip())
        else:
            paper['citations'] = 0
        papers.append(paper)
    return papers

Building a comprehensive bibliography (BibTeX file)

Now that we have a set of relevant papers in our working set, we need to collect their bibliographic details in order to cite them properly in the Related Work text. Fortunately, Google Scholar stores each of its papers’ citations in BibTeX form. With appropriate knowledge of the structure of the Google Scholar database, this can be retrieved.

(Alternatively, if a DOI is available, we could use CrossRef or Semantic Scholar to get polished metadata (or even directly query CrossRef’s citation API to get BibTeX). )

Below is a code snippet to retrieve the .bib file contents for any paper:

Compiling a BibTeX file

Now it only remains to cycle over the papers in the citation graph, as implied by a text query and/or a query of similar papers. This is essentially a version of what Google Scholar is doing in weighting papers by citation relevance. However, abstracting this functionality out allows us to customize it. The final payoff is a programmatically generated BibTeX file which serves as a .bib reference for any text that uses these papers as grounding.

CODE

def base_scrape_bibtex(url, pause_sec=None):
    """
    Get the Bibtex citation for this paper. 
    This involves constructing the citation ID from the search result page, 
    traversing the citation page, and fetching the actual BibTeX content. 
    """
    tree = get_page_with_retry(url, pause_sec=pause_sec)
    bibtex_citations = []
    entries = tree.xpath('//div[contains(@class, "gs_ri")]')
    for entry in entries:
        try:
            related_links = entry.xpath('.//a[contains(text(), "Related articles")]')
            if not related_links:
                continue
            href = related_links[0].get('href')
            if ':' not in href:
                continue
            cite_id = href.split(':')[1]
            cite_url = f"{BASE_URL}/scholar?hl=en&q=info:{cite_id}:scholar.google.com/&output=cite&scirp=0"
            cite_tree = get_page_with_retry(cite_url, pause_sec=pause_sec)
            bibtex_links = cite_tree.xpath('//div[@id="gs_citi"]//a[contains(text(), "BibTeX")]')
            if bibtex_links:
                bibtex_url = BASE_URL + bibtex_links[0].get('href')
                bibtex_response = requests.get(bibtex_url)
                bibtex_text = bibtex_response.text.strip()
                if bibtex_text and bibtex_text.startswith('@'):
                    bibtex_citations.append(bibtex_text)
        except Exception as e:
            print(f"Error extracting BibTeX for entry: {e}")
            continue
    return bibtex_citations


def get_bibtex_for_paper(paper):
    """
    Gets BibTeX citation for a single paper.
    
    Args:
        paper: Paper dictionary containing title
    
    Returns:
        BibTeX string or None
    """
    if not paper.get('title'):
        return None
    # Search for the exact paper title
    search_query = f'"{paper["title"]}"'
    search_url = basic_search_url(search_query, results_per_page=1)
    bibtex_entries = base_scrape_bibtex(search_url)
    return bibtex_entries[0] if bibtex_entries else None


def get_bibtex_for_query(query, max_papers=100, pause_sec=None):
    bibtex_citations = []
    start_num = 0
    while len(papers) < max_papers:
        url = basic_search_url(query, start_num=start_num)
        new_bibtex = base_scrape_bibtex(url, pause_sec=pause_sec)
        start_num += len(new_bibtex)
        bibtex_citations.extend(new_bibtex)
    return bibtex_citations

After generating all entries, we write them to related_work.bib. This BibTeX file can later be used in a LaTeX document to provide the reference list.

CODE

def compile_bibliography(papers, output_file="related_work.bib"):
    """
    Compiles a BibTeX bibliography for all papers in the working set.
    
    This function attempts to get BibTeX entries for each paper and writes
    them to a .bib file for use in LaTeX documents.
    
    Args:
        papers: List of paper dictionaries or dictionary of papers
        output_file: Path to output .bib file
    
    Returns:
        Number of successfully retrieved BibTeX entries
    """
    # Handle both list and dict inputs
    if isinstance(papers, dict):
        papers = list(papers.values())
    bib_entries = []
    failed_papers = []
    print(f"Compiling bibliography for {len(papers)} papers...")
    for i, paper in enumerate(papers):
        if i % 10 == 0:
            print(f"  Processing paper {i+1}/{len(papers)}...")
        try:
            bibtex = get_bibtex_for_paper(paper)
            if bibtex:
                bib_entries.append(bibtex)
            else:
                failed_papers.append(paper)
            # Delay if necessary to avoid spamming
            time.sleep(0)
        except Exception as e:
            print(f"  Error getting BibTeX for '{paper.get('title', 'Unknown')[:40]}...': {e}")
            failed_papers.append(paper)
    # Write successful entries to file
    with open(output_file, "w", encoding='utf-8') as bibfile:
        for entry in bib_entries:
            bibfile.write(entry + "\n\n")
    print(f"\nWrote {len(bib_entries)} BibTeX entries to {output_file}")
    if failed_papers:
        print(f"Failed to get BibTeX for {len(failed_papers)} papers")
        # Optionally save failed papers for manual processing
        with open("failed_papers.json", "w", encoding='utf-8') as f:
            json.dump(failed_papers, f, indent=2, ensure_ascii=False)
    return len(bib_entries)

Alternatives: Instead of relying on Google Scholar for metadata (which might have inconsistencies), one could use the CrossRef API or Semantic Scholar to get high-quality metadata. For example, if DOIs are known, CrossRef can return BibTeX or JSON data for the paper. Semantic Scholar’s data (via their Graph API) includes fields like title, authors, venue, year, DOI, and even formatted citation strings. If high precision is needed (especially in biomedical citations where PubMed IDs might be used), one could also query PubMed or use reference managers like Zotero via their API for metadata.

Implementation notes

This is where things may get expensive to run exhaustively. Using our methods so far, tracking down the BibTeX entry for a single paper takes at least two GET requests, perhaps three if the paper’s citation ID itself needs to be determined. The rest of the paper metadata covered so far costs a fraction of a GET request per query when amortized over the many search results displayed on a single retrieved page. So we may end up using many API calls to fill in BibTeX information for an entire citation graph.

CODE

with open('comb_IL_1.json', 'r') as file:
    init_papers_dict = json.load(file)
#init_papers_dict
init_papers_list = list(init_papers_dict.values())
print(f"{len(init_papers_list)} papers in the initial set")

# Sort init_papers_dict.values() in descending order of citations
sorted_papers = sorted(init_papers_dict.values(), key=lambda x: x['citations'], reverse=True)
print(f"Top 5 papers by citations: \n{json.dumps(sorted_papers[:5], indent=2, ensure_ascii=False)}")

916 papers in the initial set
Top 5 papers by citations: 
[
  {
    "title": "Engineering precision nanoparticles for drug delivery",
    "url": "https://www.nature.com/articles/s41573-020-0090-8",
    "file_url": "https://www.nature.com/articles/s41573-020-0090-8.pdf",
    "citations": 6671,
    "citedby_url": "https://scholar.google.com/scholar?cites=4756991422206245943&as_sdt=2005&sciodt=0,5&hl=en&num=20",
    "authors": {
      "MJ Mitchell": "cc7DoNsAAAAJ",
      "MM Billingsley": "uG650rcAAAAJ",
      "RM Haley": "1jhdTmQAAAAJ"
    },
    "year": "2021",
    "pubinfo": "MJ Mitchell, MM Billingsley, RM Haley… - Nature reviews drug …, 2021 - nature.com",
    "snippet": "… and optimized dosing or combinatorial strategies. However, … components: cationic or \nionizable lipids that complex with … However, despite these advantages, LNP systems can still …",
    "bib_url": "https://scholar.google.com/scholar?hl=en&q=info:N_wusH46BEIJ:scholar.google.com/&output=cite&scirp=0"
  },
  {
    "title": "Lipid nanoparticles for mRNA delivery",
    "url": "https://www.nature.com/articles/s41578-021-00358-0",
    "file_url": "https://www.nature.com/articles/s41578-021-00358-0.pdf",
    "citations": 2934,
    "citedby_url": "https://scholar.google.com/scholar?cites=17748629404528907077&as_sdt=2005&sciodt=0,5&hl=en&num=20",
    "authors": {
      "X Hou": "2hiInR0AAAAJ",
      "T Zaks": null,
      "R Langer": "5HX--AYAAAAJ",
      "Y Dong": "VdjKyiUAAAAJ"
    },
    "year": "2021",
    "pubinfo": "X Hou, T Zaks, R Langer, Y Dong - Nature Reviews Materials, 2021 - nature.com",
    "snippet": "… Cationic lipids, ionizable lipids and other types of lipid have … A combinatorial library has \nbeen designed that contains lipid-like … the chemical diversity of ionizable lipids 86 . Many …",
    "bib_url": "https://scholar.google.com/scholar?hl=en&q=info:RZP_OmjMT_YJ:scholar.google.com/&output=cite&scirp=0"
  },
  {
    "title": "Delivery materials for siRNA therapeutics",
    "url": "https://www.nature.com/articles/nmat3765",
    "file_url": "https://www.researchgate.net/profile/Robert-Dorkin-2/publication/258036544_Delivery_materials_for_siRNA_therapeutics/links/563bc57308ae45b5d2869cce/Delivery-materials-for-siRNA-therapeutics.pdf",
    "citations": 2242,
    "citedby_url": "https://scholar.google.com/scholar?cites=10346749194725180430&as_sdt=2005&sciodt=0,5&hl=en&num=20",
    "authors": {
      "R Kanasty": null,
      "JR Dorkin": "wF4BwVAAAAAJ",
      "A Vegas": "HCJJn10AAAAJ",
      "D Anderson": "NM1dXVYAAAAJ"
    },
    "year": "2013",
    "pubinfo": "R Kanasty, JR Dorkin, A Vegas, D Anderson - Nature materials, 2013 - nature.com",
    "snippet": "… Since then, a number of lipid nanoparticle (LNP) RNAi drugs … of lipid pK a on in vivo gene \nsilencing using 53 ionizable lipids… Several combinatorial libraries have been generated using …",
    "bib_url": "https://scholar.google.com/scholar?hl=en&q=info:DkSnu5IJl48J:scholar.google.com/&output=cite&scirp=0"
  },
  {
    "title": "Rational design of cationic lipids for siRNA delivery",
    "url": "https://www.nature.com/articles/nbt.1602",
    "file_url": null,
    "citations": 2123,
    "citedby_url": "https://scholar.google.com/scholar?cites=6097447847758729327&as_sdt=2005&sciodt=0,5&hl=en&num=20",
    "authors": {
      "SC Semple": null,
      "A Akinc": "vTSZuDQAAAAJ",
      "J Chen": null,
      "AP Sandhu": null,
      "BL Mui": null
    },
    "year": "2010",
    "pubinfo": "SC Semple, A Akinc, J Chen, AP Sandhu, BL Mui… - Nature …, 2010 - nature.com",
    "snippet": "… An empirical, combinatorial chemistry–based approach recently identified novel … the LNP \nrapidly upon intravenous injection. As our goal was to identify novel ionizable cationic lipids for …",
    "bib_url": "https://scholar.google.com/scholar?hl=en&q=info:bxAhCoN8nlQJ:scholar.google.com/&output=cite&scirp=0"
  },
  {
    "title": "Lipid nanoparticles─ from liposomes to mRNA vaccine delivery, a landscape of research diversity and advancement",
    "url": "https://pubs.acs.org/doi/abs/10.1021/acsnano.1c04996",
    "file_url": "https://pubs.acs.org/doi/pdf/10.1021/acsnano.1c04996",
    "citations": 1782,
    "citedby_url": "https://scholar.google.com/scholar?cites=14382414878584419374&as_sdt=2005&sciodt=0,5&hl=en&num=20",
    "authors": {
      "R Tenchov": "sMs6Z3YAAAAJ",
      "R Bird": null,
      "AE Curtze": null,
      "Q Zhou": null
    },
    "year": "2021",
    "pubinfo": "R Tenchov, R Bird, AE Curtze, Q Zhou - ACS nano, 2021 - ACS Publications",
    "snippet": "… Ionizable lipids which are positively charged only inside the … frequently used cationic lipids \nin LNP formulations according … comprising a combination of imaging lipid nanoparticles and …",
    "bib_url": "https://scholar.google.com/scholar?hl=en&q=info:LjRReiGamMcJ:scholar.google.com/&output=cite&scirp=0"
  }
]

Exporting code

For future such reference, it’s useful to collect all the code from this post in a .py file.

CODE

all_code = r'''import time, requests, re, json, os
import numpy as np
from lxml import html


def get_scraper_params_url(url, proxy_config=None):
    if proxy_config is None:
        return {}, url
    SCRAPER_API_URL = ""  # Address for API server of the scraper proxy
    SCRAPER_API_KEY = ""  # Get from SCRAPER_API_URL
    scraper_params = {
        'api_key': SCRAPER_API_KEY,
        'url': url
    }
    return scraper_params, SCRAPER_API_URL


def get_page(url, pause_sec=0):
    """Gets page content using a scraper API and returns lxml tree"""
    time.sleep(pause_sec)
    try:
        scraper_params, scraper_api_url = get_scraper_params_url(url)
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
        }
        response = requests.post(scraper_api_url, headers=headers, params=scraper_params, timeout=60)
        response.raise_for_status()
        # Check if we hit rate limits
        if response.status_code == 429:
            time_limit = 20
            print(f"Rate limit hit, waiting {time_limit} seconds...")
            time.sleep(time_limit)
            return get_page(url, pause_sec=5)  # Retry with longer pause
        return html.fromstring(response.content)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching {url}: {e}")
        # Implement exponential backoff for retries
        if hasattr(e, 'response') and e.response.status_code == 500:
            print("Server error, retrying in 10 seconds...")
            return get_page(url, pause_sec=5)
        raise


def get_page_with_retry(url, max_retries=3, pause_sec=0):
    """Wrapper with retry logic for resilience"""
    for attempt in range(max_retries):
        try:
            return get_page(url, pause_sec=pause_sec)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (attempt + 1) * 3
            print(f"Retry {attempt + 1}/{max_retries} after {wait_time}s...")
            time.sleep(wait_time)


BASE_URL = "https://scholar.google.com"


def base_scrape_search_results(url, pause_sec=0):
    page_contents = get_page_with_retry(url, pause_sec=pause_sec)
    results = page_contents.xpath('//div[contains(@class, "gs_r gs_or gs_scl")]')
    papers = []
    for r in results:
        gs_ri_entries = r.xpath('.//div[contains(@class, "gs_ri")]')
        file_tag = r.xpath('.//div[contains(@class, "gs_ggs")]')
        for entry in gs_ri_entries:
            this_paper = {
                'title': None, 'url': None, 'file_url': None,
                'citations': None, 'citedby_url': None, 'authors': None,
                'year': None, 'pubinfo': None, 'snippet': None
            }

            title_link = entry.xpath('.//h3[contains(@class, "gs_rt")]/a')
            if title_link:
                this_paper['title'] = title_link[0].text_content().strip()
                this_paper['url'] = title_link[0].get('href')

            # Extract citations
            cited_links = entry.xpath('.//a[contains(text(), "Cited by")]')
            if cited_links:
                cited_text = cited_links[0].text_content()
                citations_match = re.search(r'Cited by (\d+)', cited_text)
                if citations_match:
                    this_paper['citations'] = int(citations_match.group(1))
                    this_paper['citedby_url'] = BASE_URL + cited_links[0].get('href')
            else:
                this_paper['citations'] = 0
            # Extract bibliographic ID
            try:
                related_links = entry.xpath('.//a[contains(text(), "Related articles")]')
                if not related_links:
                    continue
                href = related_links[0].get('href')
                if ':' not in href:
                    continue
                cite_id = href.split(':')[1]
                bib_url = f"{BASE_URL}/scholar?hl=en&q=info:{cite_id}:scholar.google.com/&output=cite&scirp=0"
                this_paper['bib_url'] = bib_url
            except Exception as e:
                print(f"Error getting bibliographic URL for entry: {e}")
                continue

            if file_tag:
                file_link = file_tag[0].xpath('.//a')
                if file_link:
                    this_paper['file_url'] = file_link[0].get('href')

            pub_info_tag = entry.xpath('.//div[contains(@class, "gs_a")]')
            if not pub_info_tag:
                continue
            pub_text = pub_info_tag[0].text_content().strip()
            this_paper['pubinfo'] = pub_text

            # Parse authors and year
            parts = pub_text.replace('\u00A0', ' ').split(' - ')
            if parts:
                auth_text = parts[0].strip()
                authors = [x.strip('… ') for x in auth_text.split(',') if x.strip('… ')]
                if len(parts) > 1:
                    year_match = re.search(r'\b(19|20)\d{2}\b', parts[1])
                    if year_match:
                        this_paper['year'] = year_match.group(0)
                authID_dict = {a: None for a in authors}
                author_links = pub_info_tag[0].xpath('.//a')
                for link in author_links:
                    author_name = link.text_content().strip()
                    href = link.get('href')
                    if href and 'user=' in href:
                        user_id = href.split('user=')[1].split('&')[0]
                        if author_name in authID_dict:
                            authID_dict[author_name] = user_id
                this_paper['authors'] = authID_dict

            # Extract snippet
            snippet_tag = entry.xpath('.//div[contains(@class, "gs_rs")]')
            if snippet_tag:
                this_paper['snippet'] = snippet_tag[0].text_content().strip()
        papers.append(this_paper)
    return papers


def basic_search_url(query, start_num=0, results_per_page=100):
    query_str = '+'.join(query.split())
    url = f"{BASE_URL}/scholar?start={start_num}&q={query_str}&hl=en&num={results_per_page}&as_sdt=0,5"
    return url


def advanced_search_url(
    query, start_num=0, results_per_page=100, 
    phrase_str='', words_some='', words_none='', 
    scope='any', authors='', pub='', 
    ylo='', yhi=''
):
    """Generates advanced search URL"""
    query_str = '+'.join(query.split())
    url = (f"{BASE_URL}/scholar?start={start_num}&as_q={query_str}&as_epq={phrase_str}"
            f"&as_oq={words_some}&as_eq={words_none}&as_occt={scope}"
            f"&as_sauthors={authors}&as_publication={pub}"
            f"&as_ylo={ylo}&as_yhi={yhi}&btnG=&hl=en&num={results_per_page}&as_sdt=0%2C5")
    return url


def search_url(
    query, mode='basic', 
    start_num=0, results_per_page=100, 
    phrase_str='', words_some='', words_none='', 
    scope='any', authors='', pub='', 
    ylo='', yhi=''
):
    if mode == 'basic':
        url = basic_search_url(query, start_num=start_num, results_per_page=results_per_page)
    elif mode == 'advanced':
        url = advanced_search_url(
            query, start_num=start_num, results_per_page=results_per_page, 
            phrase_str=phrase_str, words_some=words_some, words_none=words_none, 
            scope=scope, authors=authors, pub=pub, ylo=ylo, yhi=yhi
        )
    else:
        raise ValueError(f"Invalid mode: {mode}")
    return url


def get_results_count(query, pause_sec=0, search_mode='basic', **kwargs):
    url = search_url(query, mode=search_mode, **kwargs)
    # Modify the above line to pass into search URL any more arguments that this function receives. 
    tree = get_page_with_retry(url, pause_sec=pause_sec)
    # pick the `.gs_ab_mdw` node that contains “[About] … results”
    results_divs = tree.xpath('//div[@class="gs_ab_mdw"]')
    for div in results_divs:
        text = div.text_content()
        if 'results' in text:
            match = re.search(r'([\d,]+)\s+results', text)
            if match:
                return int(match.group(1).replace(',', ''))
    return 0


def write_paper_list_as_json(paper_list, fname):
    init_papers = {x['title']: x for x in paper_list}  # deduplicate by title
    init_papers_dict = json.dumps(init_papers, indent=2, ensure_ascii=False)
    with open(fname, 'w', encoding='utf-8') as f:
        f.write(init_papers_dict)


def get_top_papers_for_query(
    query, max_papers=np.inf, min_citations=1, out_file_name=None, verbose=False, 
    pause_sec=0, start_num=0, mode='basic', **kwargs
):
    num_queries = 0
    results_count = get_results_count(query, pause_sec=pause_sec, search_mode=mode, **kwargs)
    num_queries += 1
    papers = []
    while len(papers) < np.min([max_papers, results_count]):
        url = search_url(query, start_num=start_num, mode=mode, **kwargs)
        new_papers = base_scrape_search_results(url, pause_sec=pause_sec)
        num_queries += 1
        if not new_papers:
            print(f"No new papers at {url}")
            break
        start_num += len(new_papers)
        page_count = num_queries - 1
        papers.extend(new_papers)
        citations = [p['citations'] for p in new_papers if p['citations'] is not None]
        if citations:
            medcit = np.median(citations)
            if verbose:
                print(f"Median citations on page {page_count} of results: {medcit}")
            if medcit < min_citations:
                print(f"Stopping search, median citations {medcit} is below threshold {min_citations}.")
                break
    if out_file_name is not None:
        write_paper_list_as_json(papers, out_file_name)
    return papers, results_count, num_queries


def papers_by_author(author_scholar_id, max_papers=20):
    """
    Retrieves papers by a specific author using their Google Scholar ID.
    
    Args:
        author_scholar_id: Google Scholar author ID
        max_papers: Maximum number of papers to retrieve
    
    Returns:
        List of paper dictionaries
    """
    auth_url = f"{BASE_URL}/citations?user={author_scholar_id}&hl=en&cstart=0&pagesize={max_papers}"
    tree = get_page_with_retry(auth_url)
    papers = []
    # Extract paper entries from author's page
    entries = tree.xpath('//tr[@class="gsc_a_tr"]')
    for entry in entries:
        paper = {}
        # Title and link
        title_elem = entry.xpath('.//a[@class="gsc_a_at"]')
        if title_elem:
            paper['title'] = title_elem[0].text_content().strip()
            paper['url'] = BASE_URL + title_elem[0].get('href')
        # Authors and publication info
        authors_elem = entry.xpath('.//div[@class="gs_gray"][1]')
        if authors_elem:
            paper['authors'] = authors_elem[0].text_content().strip()
        pub_elem = entry.xpath('.//div[@class="gs_gray"][2]')
        if pub_elem:
            paper['pubinfo'] = pub_elem[0].text_content().strip()
            # Extract year
            year_match = re.search(r'\b(19|20)\d{2}\b', paper['pubinfo'])
            if year_match:
                paper['year'] = year_match.group(0)
        # Citations
        cite_elem = entry.xpath('.//a[@class="gsc_a_ac gs_ibl"]')
        if cite_elem and cite_elem[0].text_content().strip():
            paper['citations'] = int(cite_elem[0].text_content().strip())
        else:
            paper['citations'] = 0
        papers.append(paper)
    return papers


def base_scrape_bibtex(url, pause_sec=None):
    """
    Get the Bibtex citation for this paper. 
    This involves constructing the citation ID from the search result page, 
    traversing the citation page, and fetching the actual BibTeX content. 
    """
    tree = get_page_with_retry(url, pause_sec=pause_sec)
    bibtex_citations = []
    entries = tree.xpath('//div[contains(@class, "gs_ri")]')
    for entry in entries:
        try:
            related_links = entry.xpath('.//a[contains(text(), "Related articles")]')
            if not related_links:
                continue
            href = related_links[0].get('href')
            if ':' not in href:
                continue
            cite_id = href.split(':')[1]
            cite_url = f"{BASE_URL}/scholar?hl=en&q=info:{cite_id}:scholar.google.com/&output=cite&scirp=0"
            cite_tree = get_page_with_retry(cite_url, pause_sec=pause_sec)
            bibtex_links = cite_tree.xpath('//div[@id="gs_citi"]//a[contains(text(), "BibTeX")]')
            if bibtex_links:
                bibtex_url = BASE_URL + bibtex_links[0].get('href')
                bibtex_response = requests.get(bibtex_url)
                bibtex_text = bibtex_response.text.strip()
                if bibtex_text and bibtex_text.startswith('@'):
                    bibtex_citations.append(bibtex_text)
        except Exception as e:
            print(f"Error extracting BibTeX for entry: {e}")
            continue
    return bibtex_citations


def get_bibtex_for_paper(paper):
    """
    Gets BibTeX citation for a single paper.
    
    Args:
        paper: Paper dictionary containing title
    
    Returns:
        BibTeX string or None
    """
    if not paper.get('title'):
        return None
    # Search for the exact paper title
    search_query = f'"{paper["title"]}"'
    search_url = basic_search_url(search_query, results_per_page=1)
    bibtex_entries = base_scrape_bibtex(search_url)
    return bibtex_entries[0] if bibtex_entries else None


def get_bibtex_for_query(query, max_papers=100, pause_sec=None):
    bibtex_citations = []
    start_num = 0
    while len(papers) < max_papers:
        url = basic_search_url(query, start_num=start_num)
        new_bibtex = base_scrape_bibtex(url, pause_sec=pause_sec)
        start_num += len(new_bibtex)
        bibtex_citations.extend(new_bibtex)
    return bibtex_citations


def compile_bibliography(papers, output_file="related_work.bib"):
    """
    Compiles a BibTeX bibliography for all papers in the working set.
    
    This function attempts to get BibTeX entries for each paper and writes
    them to a .bib file for use in LaTeX documents.
    
    Args:
        papers: List of paper dictionaries or dictionary of papers
        output_file: Path to output .bib file
    
    Returns:
        Number of successfully retrieved BibTeX entries
    """
    # Handle both list and dict inputs
    if isinstance(papers, dict):
        papers = list(papers.values())
    bib_entries = []
    failed_papers = []
    print(f"Compiling bibliography for {len(papers)} papers...")
    for i, paper in enumerate(papers):
        if i % 10 == 0:
            print(f"  Processing paper {i+1}/{len(papers)}...")
        try:
            bibtex = get_bibtex_for_paper(paper)
            if bibtex:
                bib_entries.append(bibtex)
            else:
                failed_papers.append(paper)
            # Delay if necessary to avoid spamming
            time.sleep(0)
        except Exception as e:
            print(f"  Error getting BibTeX for '{paper.get('title', 'Unknown')[:40]}...': {e}")
            failed_papers.append(paper)
    # Write successful entries to file
    with open(output_file, "w", encoding='utf-8') as bibfile:
        for entry in bib_entries:
            bibfile.write(entry + "\n\n")
    print(f"\nWrote {len(bib_entries)} BibTeX entries to {output_file}")
    if failed_papers:
        print(f"Failed to get BibTeX for {len(failed_papers)} papers")
        # Optionally save failed papers for manual processing
        with open("failed_papers.json", "w", encoding='utf-8') as f:
            json.dump(failed_papers, f, indent=2, ensure_ascii=False)
    return len(bib_entries)
'''

CODE

file_pfx = "../../files/litsearch/"

# If this directory doesn't exist, create it
if not os.path.exists(file_pfx):
    os.makedirs(file_pfx)

with open(file_pfx + "gs_literature_tools.py", "w", encoding="utf-8") as f:
    f.write(all_code)

References

Gusenbauer, Michael, and Neal R Haddaway. 2020. “Which Academic Search Systems Are Suitable for Systematic Reviews or Meta-Analyses? Evaluating Retrieval Qualities of Google Scholar, PubMed, and 26 Other Resources.” Research Synthesis Methods 11 (2): 181–217.

Martı́n-Martı́n, Alberto, Enrique Orduna-Malea, Mike Thelwall, and Emilio Delgado López-Cózar. 2018. “Google Scholar, Web of Science, and Scopus: A Systematic Comparison of Citations in 252 Subject Categories.” Journal of Informetrics 12 (4): 1160–77.

Footnotes

The scholarly Python package has been used in the past as an unofficial Google Scholar API. There have been other now-defunct attempts under out-of-date versions of Google Scholar to do this as well. In trials across an extensive range of use cases and subfields of academic/patent literature, we have found these open-source alternatives unsatisfactory because they do not sufficiently optimize API calls and therefore use an order of magnitude more calls than desired or necessary under certain common usage patterns. Also, programmatic customizability in other ways (such as browsing from different devices) is not possible with such packages. The code in this post is a modular implementation of the scraping process, which is much more flexible.↩︎

Reuse

CC BY 4.0