How to Scrape Google SERP with Selenium in Python

Today, I’m going to share how you can use Selenium to scrape Google’s Search Engine Results Pages (SERP) from scratch […]

Today, I’m going to share how you can use Selenium to scrape Google’s Search Engine Results Pages (SERP) from scratch in Python. And guess what? We’ll be doing this on a DigitalOcean server using a Jupyter Notebook without using any scraping API. If you’re ready for an adventure in web scraping, let’s get started!

Setting Up Our DigitalOcean Server

First things first, we need a server. I’m opting for a DigitalOcean droplet because of its ease of use and reliability. For this project, I’ve set up a droplet with 2 vCPU, 2 GB RAM, and a 60GB SSD. This configuration runs on Ubuntu 22.04 (LTS) x64, offering a stable and robust environment for our scraping project.

SSH into Your Server

Once your droplet is ready, let’s SSH into it. If you’re familiar with SSH, this should be a breeze. Make sure you have your private key ready. Here’s the command I use:

ssh -L 8888:localhost:8888 -i /path/to/private/key root@{ip-address}

Notice the port tunneling in the command? That’s because we’re going to use Jupyter Notebook, which runs on port 8888 by default.

Setting Up the Environment

With access to our server, it’s time to set up our Python environment. It’s crucial to set up our environment with all necessary dependencies, including Chrome and ChromeDriver. Here are the commands to get everything ready:

1. Update the Server:
Before installing anything, it’s a good practice to update your server’s package list:

sudo apt update

2. Install Google Chrome:
Google Chrome isn’t included in the default Ubuntu repositories. To install it, first download the Debian package from the Chrome website:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

Then install it using the DPKG command:

sudo dpkg -i google-chrome-stable_current_amd64.deb

If you encounter any errors, fix them by running:

sudo apt-get install -f

3. Install Chromedriver:
For Selenium to control Chrome, you need Chromedriver. Download and install it with:

wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

4. Install Additional Dependencies:
You might need a few additional packages to ensure everything works smoothly:

sudo apt-get install -y libxi6 libgconf-2-4 python3-pip

5. Install Python Packages:
Now, install the necessary Python packages including Selenium, BeautifulSoup, and Jupyter Notebook:

pip install selenium bs4 notebook python3-lxml cchardet PyVirtualDisplay xvfb

With these dependencies installed, your DigitalOcean server is now ready for web scraping using Python.

6. Start Jupyter Notebook:
Now, start Jupyter notebook with this command:

jupyter notebook --allow-root

Now, you can access the Jupyter Notebook through your local browser by navigating to https://localhost:8888

Setting Up Selenium with Proxy

For scraping Google SERP, we’ll use Selenium with a proxy. I’m using OxyLabs’ Datacenter Proxy, but you can choose any other reliable proxy service. Here’s the sample code to set up Selenium with the proxy in a Jupyter Notebook:

import time, lxml, cchardet, re
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path
from pyvirtualdisplay import Display

screen_display = Display(visible=0, size=(800, 800))
screen_display.start()

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=ddc.oxylabs.io:8011')
chrome_options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
chrome_options.add_argument('--headless=new')


with webdriver.Chrome(service=Service(binary_path), options=chrome_options) as driver:
driver.maximize_window()

driver.get('https://www.google.com/search?q=restaurants+near+me')
time.sleep(3) # Wait 3 seconds for page to load

thesoup = BeautifulSoup(driver.page_source, 'lxml')

In this code, we’re setting up the Chrome driver with the necessary options, including the proxy settings. The --headless option allows us to run the browser in the background.

Once everything is set up, the actual scraping is straightforward. We navigate to the Google search page for our query and then parse the page source with BeautifulSoup. Here, we’re looking for “restaurants near me”, but you can modify the query as needed.

Extracting Data with BeautifulSoup

Assuming we have navigated to our desired Google search page and have thesoup contained the page source, our next step is to extract useful information. In this case, let’s extract the titles and URLs of the search results. Here’s how you can do it:

for result in thesoup.find_all('div', class_='tF2Cxc'):
    title = result.find('h3').get_text()
    link = result.find('a')['href']
    print(f"Title: {title}\nLink: {link}\n")

This code iterates through each search result (which, in Google’s HTML structure, is typically contained within div tags with the class ‘tF2Cxc’). For each result, it finds the title (<h3> tag) and the corresponding URL (the href attribute in the <a> tag).

Advanced Data Extraction

If you want to get more sophisticated, you could also extract other pieces of information like the brief description (snippet) that Google provides for each search result:

for result in thesoup.find_all('div', class_='tF2Cxc'):
    title = result.find('h3').get_text()
    link = result.find('a')['href']
    snippet = result.find('div', class_='VwiC3b').get_text()
    print(f"Title: {title}\nLink: {link}\nSnippet: {snippet}\n")

This code is similar to the previous one but also looks for a div with the class ‘IsZvec’, which typically contains the search result snippet.

Final Thoughts Before Wrapping Up

Extracting data from Google SERP can provide insightful information for various purposes like SEO analysis, market research, or even academic purposes. While BeautifulSoup makes it easy to parse and extract data, remember to use this power responsibly and abide by legal and ethical considerations.

Happy data extraction!