Today, I’m going to share how you can use Selenium to scrape Google’s Search Engine Results Pages (SERP) from scratch in Python. And guess what? We’ll be doing this on a DigitalOcean server using a Jupyter Notebook without using any scraping API. If you’re ready for an adventure in web scraping, let’s get started!
Setting Up Our DigitalOcean Server
First things first, we need a server. I’m opting for a DigitalOcean droplet because of its ease of use and reliability. For this project, I’ve set up a droplet with 2 vCPU, 2 GB RAM, and a 60GB SSD. This configuration runs on Ubuntu 22.04 (LTS) x64, offering a stable and robust environment for our scraping project.
SSH into Your Server
Once your droplet is ready, let’s SSH into it. If you’re familiar with SSH, this should be a breeze. Make sure you have your private key ready. Here’s the command I use:
ssh -L 8888:localhost:8888 -i /path/to/private/key root@{ip-address}
Notice the port tunneling in the command? That’s because we’re going to use Jupyter Notebook, which runs on port 8888 by default.
Setting Up the Environment
With access to our server, it’s time to set up our Python environment. It’s crucial to set up our environment with all necessary dependencies, including Chrome and ChromeDriver. Here are the commands to get everything ready:
1. Update the Server:
Before installing anything, it’s a good practice to update your server’s package list:
sudo apt update
2. Install Google Chrome:
Google Chrome isn’t included in the default Ubuntu repositories. To install it, first download the Debian package from the Chrome website:
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
Then install it using the DPKG command:
sudo dpkg -i google-chrome-stable_current_amd64.deb
If you encounter any errors, fix them by running:
sudo apt-get install -f
3. Install Chromedriver:
For Selenium to control Chrome, you need Chromedriver. Download and install it with:
wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver
4. Install Additional Dependencies:
You might need a few additional packages to ensure everything works smoothly:
sudo apt-get install -y libxi6 libgconf-2-4 python3-pip
5. Install Python Packages:
Now, install the necessary Python packages including Selenium, BeautifulSoup, and Jupyter Notebook:
pip install selenium bs4 notebook python3-lxml cchardet PyVirtualDisplay xvfb
With these dependencies installed, your DigitalOcean server is now ready for web scraping using Python.
6. Start Jupyter Notebook:
Now, start Jupyter notebook with this command:
jupyter notebook --allow-root
Now, you can access the Jupyter Notebook through your local browser by navigating to https://localhost:8888
Setting Up Selenium with Proxy
For scraping Google SERP, we’ll use Selenium with a proxy. I’m using OxyLabs’ Datacenter Proxy, but you can choose any other reliable proxy service. Here’s the sample code to set up Selenium with python headless browser and a proxy in a Jupyter Notebook:
import time, lxml, cchardet, re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path
from pyvirtualdisplay import Display
screen_display = Display(visible=0, size=(800, 800))
screen_display.start()
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=ddc.oxylabs.io:8011')
chrome_options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})
chrome_options.add_argument('--headless=new')
with webdriver.Chrome(service=Service(binary_path), options=chrome_options) as driver:
driver.maximize_window()
driver.get('https://www.google.com/search?q=restaurants+near+me')
time.sleep(3) # Wait 3 seconds for page to load
thesoup = BeautifulSoup(driver.page_source, 'lxml')
In this code, we’re setting up the Chrome driver with the necessary options, including the proxy settings. The --headless
option allows us to run the browser in the background.
Once everything is set up, the actual scraping is straightforward. We navigate to the Google search page for our query and then parse the page source with BeautifulSoup. Here, we’re looking for “restaurants near me”, but you can modify the query as needed.
Extracting Data with BeautifulSoup
Assuming we have navigated to our desired Google search page and have thesoup
contained the page source, our next step is to extract useful information. In this case, let’s extract the titles and URLs of the search results. Here’s how you can do it:
for result in thesoup.find_all('div', class_='tF2Cxc'):
title = result.find('h3').get_text()
link = result.find('a')['href']
print(f"Title: {title}\nLink: {link}\n")
This code iterates through each search result (which, in Google’s HTML structure, is typically contained within div tags with the class ‘tF2Cxc’). For each result, it finds the title (<h3>
tag) and the corresponding URL (the href
attribute in the <a>
tag).
Advanced Data Extraction
If you want to get more sophisticated, you could also extract other pieces of information like the brief description (snippet) that Google provides for each search result:
for result in thesoup.find_all('div', class_='tF2Cxc'):
title = result.find('h3').get_text()
link = result.find('a')['href']
snippet = result.find('div', class_='VwiC3b').get_text()
print(f"Title: {title}\nLink: {link}\nSnippet: {snippet}\n")
This code is similar to the previous one but also looks for a div with the class ‘IsZvec’, which typically contains the search result snippet.
Final Thoughts Before Wrapping Up
In this article, I showed you how to scrape google results data for various purposes like SEO analysis, market research, or even academic purposes. While BeautifulSoup makes it easy to parse and extract data, remember to use this power responsibly and abide by legal and ethical considerations.
Happy data extraction!
How we reviewed this article:
- Content Process
- I get an Idea
- Get feedback from fellow developers if they want a detailed post on it
- Research existing blog posts to see if there's a well written article on it
- Write the post
- Get feedback from colleagues, improve and publish!