Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In simpler terms, web scraping is a technique to automatically extract large amounts of data from websites. It involves using a script or tool to load a web page, retrieve specific pieces of data for analysis.
Legal and Ethical Considerations:
We will use Python and learn about the requests
and beautifulsoup
libraries, installing them with virtual environments and the pip tool (i.e. similar to Ubuntu's apt
package manager).
Let's get started! But first, some quick theory…
The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. It’s a protocol, or set of rules, that allows web browsers and servers to communicate with each other, enabling the transfer of data over the internet.
When you visit a website, your browser (like Chrome or Firefox) acts as a client, and it sends an HTTP request to a web server where the website is hosted. The server processes this request, retrieves the requested data (like a web page, image, or video), and sends it back to the browser in the form of an HTTP response.
If we want to automate this process, Python has a popular library: requests
, allowing us to do what the browser (HTTP client) does using simple lines of code!
However, for web scraping, we're mainly interested in the information contained by the page. This information is presented using the HyperText Markup Language (HTML). This raw HTML data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing (using the BeatifulSoup library for Python).
BeautifulSoup parses and organizes the HTML code, making it easier for us as developers to navigate and locate specific elements.
We still need to inspect this improved HTML “soup” to find the data we’re looking for, thus we need to understand the basic HTML tags:
<body>
– Contains all the visible content on the page.<div>
– Groups sections together, often with class
' or id
' attributes to help identify different parts.<p>
– Paragraphs of text; useful for finding blocks of text content.<h1>
, <h2>
, etc. – Headings that structure content by importance, from largest to smallest.<a>
– Links. Check the href
' attribute to find URLs.<img>
– Images. The src
attribute contains the image’s URL.Python libraries are collections of reusable code that provide functionality for a wide range of tasks, from data analysis and machine learning to web development and automation. Libraries are often hosted on the Python Package Index (PyPI) and can be easily installed using package managers like pip.
As you work on different Python projects, you may need different versions of the same module. Probably even a module that you already have installed system-wide. For this, we use virtual environments. These environments allow you to install specific module versions in a local directory and alters your shell's environment to prioritize using them. Switching between environments can be as easy as sourceing another setup script.
The problem with virtual environments is that they don't mesh well with apt. In stead of apt, we will use a Python module manager called pip3. Our suggestion is to use pip only in virtual environments. Yes, it can also install modules system-wide, but most modules can be found as apt packages anyway. Generally, it is not a good idea to mix package managers (and modern Linux distributions outright deny this with explicit error when trying to install Python packages outside a virtualenv)!
First things first, we need python3, the venv module, and pip. These, we can get with apt
$ sudo apt install python3 python3-venv python3-pip
Assuming that you are in your project's root directory already, we can set up the virtual environment:
$ python3 -m venv .venv
The -m
flag specifies to the Python interpreter a module. python3 searches its known install paths for said module (in this case, venv) and runs it as a script. .venv
is the script's argument and represents the name of the storage directory. Take a look at its internal structure:
$ tree -L 3 .venv
Notice that in .venv/bin/ we have both binaries and activation scripts. These scripts, when sourced, will force the current shell to prioritize using these. The modules you install will be placed in .venv/lib/python3.*/site-packages/. Try to activate your environment now. Once active, you will have access to the deactivate command that will restore your previous environment state:
$ source .venv/bin/activate $ deactivate
(.venv)
prompt that looks nothing like that in the GIF above. Add this to your .zshrc:
VIRTUAL_ENV_DISABLE_PROMPT="yes"
The display function depends on your selected theme. For agnoster, you can fiddle with the prompt_virtualenv() function in the agnoster.zsh-theme source file.
Same as apt, pip used to have a search function for modules. Unfortunately, they removed this feature due to a high number of queries. Now, to search for modules, you will need to use the web interface.
Let us install the modules needed for this laboratory. After that, let us also check the versions of all modules (some will be added implicitly). Can you also find their installation path in the .venv/ directory?
$ pip3 install requests beautifulsoup4
$ pip3 list
You might also want to read the official VSCode documentation about virtual environments.
$ pip3 freeze > requirements.txt $ pip3 install -r requirements.txt
We can use the python3 interpreter in interactive mode to quickly test whether our modules installed correctly:
$ python3 Python 3.12.7 (main, Oct 1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> help(requests.get)
If you ever need to debug your python program, the best approach is to open up a shell and run your program a bit tedious (copy pasting each line of the program). However, there's a better method:
import IPython IPython.embed(colors='neutral')
$ sudo apt install ipython3
This will stop the execution of your script and open an ipython shell that has access to a copy of all your local & global data. Use this to inspect the state of your variables, their types, etc. Exit the shell (Ctrl + D) to continue the execution.
Now that we have the requests
library, we can easily send HTTP requests to any URL. This prompts the server to respond with the information we need. When the request is successful, the server will reply with a standard status code: 200
(Success), indicating everything went smoothly. Simply replace the URL with the desired website (try a wikipedia page), and you’re ready to go!
import requests # Imports the library in the script url = "https://example.com" response = requests.get(url) # Send request to website html_content = response.text # Get the HTML content of the page # optionally, print it on console (see its ugliness) print(html_content)
Now, let’s parse through the HTML we just received. We’ll use BeautifulSoup, a powerful library commonly used in web scraping. BeautifulSoup helps us navigate and work with the HTML content of any page, making it easy to locate specific data we want to extract.
import requests from bs4 import BeautifulSoup url = "https://example.com" response = requests.get(url) html_content = response.content # HTML content of the page soup = BeautifulSoup(html_content, "html.parser") #this is the first step in using BeautifulSoup
The soup object in BeautifulSoup is a complex object with many built-in methods that allow us to interact with and extract data from the HTML. We use these methods with the following syntax:
soup.method_name(arguments)
(in advanced programming terms, it's a method call of an object)
Here are some examples of those methods (functions):
Find is used to retrieve the first occurance of a specific html tag (this can be any tag!).
# ... soup = BeautifulSoup(html_content, "html.parser") title = soup.find("h1") # finds the first primary title (this can be any number from 1->6) title2 = soup.title paragraph = soup.find("p") # the content of the first paragraph image = soup.find("img") # Finds the first <img> and extracts the information about it as a dictionary print(image['src']) #extract the URL of the image link = soup.find("a") # Finds the first <a> (anchor) tag print(link['href']) # Prints the URL of the link list_item = soup.find("li") #The first list item
title = soup.find("h1") print (title) # -> <h1>Example Title</h1> print (title.text) # -> Example Title
The find_all method searches for all occurrences of a specified tag and returns a list (array) of all matching tags. Each item in the list is a full HTML tag, allowing you to work with multiple instances of the same element on a page.
import requests from bs4 import BeautifulSoup # Define the target URL url = "https://example.com" response = requests.get(url) # Send a request to the website html_content = response.content # Retrieve the HTML content of the page # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(html_content, "html.parser") # Extract all paragraph (<p>) elements paragraphs = soup.find_all("p") # Finds all <p> tags on the page # paragraphs is a list! # Iterate over each paragraph and print its text content for paragraph in paragraphs: print(paragraph.text)
Now here's the task: try to inspect Wikipedia's HTML and fetch the content of the table row (<tr>
) elements containing each president's info! Note that you should filter the parent table before that.
Most websites have multiple pages, so our scraper should be capable of handling this by navigating through pagination. Pagination is typically controlled through the URL, so we’ll need to make multiple requests for each page. By identifying the pattern in the website’s subpage URLs, we can scan each page separately to gather all the data we need.
import requests from bs4 import BeautifulSoup # Base URL for pagination, with a placeholder for the page number base_url = 'https://example.com/page/' # Loop through pages 1 to 5 for page_num in range(1, 6): # Construct the URL for the current page by appending the page number url = f'{base_url}{page_num}' # Send a request to the current page URL response = requests.get(url) # Parse the HTML content of the page using BeautifulSoup soup = BeautifulSoup(response.text, 'html.parser') # Now, you can use BeautifulSoup methods to extract data from the current page items = soup.find_all("div", class_="item") # Modify the selector based on the target data for item in items: print(item.text) # This loop will go through pages 1 to 5, fetching and parsing each page's HTML. # Adjust the page range as needed for your specific case.
Final task: fetch data from more wikipedia pages, e.g. birthdays of the last 5 presidents (let's keep it simple, manually explore the website, take note of the URLs for each one and hardcode them as a Python list
).