Lab 03 - Python development

Objectives

Install Python packages using pip + virtual environments.
Learn how to extract data from websites using libraries like requests and BeautifulSoup.
Handle common web scraping challenges such as pagination and data cleaning.
Implement a practical web scraping project to consolidate learning.

Overview

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In simpler terms, web scraping is a technique to automatically extract large amounts of data from websites. It involves using a script or tool to load a web page, retrieve specific pieces of data for analysis.

Legal and Ethical Considerations:

Always check the website's `robots.txt` file.
Respect the website's terms of service.
Avoid overloading the website with too many requests.

We will use Python and learn about the requests and beautifulsoup libraries, installing them with virtual environments and the pip tool (i.e. similar to Ubuntu's apt package manager).

Let's get started! But first, some quick theory…

The HTTP protocol

The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. It’s a protocol, or set of rules, that allows web browsers and servers to communicate with each other, enabling the transfer of data over the internet.

When you visit a website, your browser (like Chrome or Firefox) acts as a client, and it sends an HTTP request to a web server where the website is hosted. The server processes this request, retrieves the requested data (like a web page, image, or video), and sends it back to the browser in the form of an HTTP response.

If we want to automate this process, Python has a popular library: requests, allowing us to do what the browser (HTTP client) does using simple lines of code!

However, for web scraping, we're mainly interested in the information contained by the page. This information is presented using the HyperText Markup Language (HTML). This raw HTML data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing (using the BeatifulSoup library for Python).

Parsing HTML using BeautifulSoup

BeautifulSoup parses and organizes the HTML code, making it easier for us as developers to navigate and locate specific elements.

We still need to inspect this improved HTML “soup” to find the data we’re looking for, thus we need to understand the basic HTML tags:

<body> – Contains all the visible content on the page.
<div> – Groups sections together, often with class' or id' attributes to help identify different parts.
<p> – Paragraphs of text; useful for finding blocks of text content.
<h1>, <h2>, etc. – Headings that structure content by importance, from largest to smallest.
<a> – Links. Check the href' attribute to find URLs.
<img> – Images. The src attribute contains the image’s URL.

Tasks

01. [30p] Python environment

Python libraries are collections of reusable code that provide functionality for a wide range of tasks, from data analysis and machine learning to web development and automation. Libraries are often hosted on the Python Package Index (PyPI) and can be easily installed using package managers like pip.

As you work on different Python projects, you may need different versions of the same module. Probably even a module that you already have installed system-wide. For this, we use virtual environments. These environments allow you to install specific module versions in a local directory and alters your shell's environment to prioritize using them. Switching between environments can be as easy as sourceing another setup script.

The problem with virtual environments is that they don't mesh well with apt. In stead of apt, we will use a Python module manager called pip3. Our suggestion is to use pip only in virtual environments. Yes, it can also install modules system-wide, but most modules can be found as apt packages anyway. Generally, it is not a good idea to mix package managers (and modern Linux distributions outright deny this with explicit error when trying to install Python packages outside a virtualenv)!

[10p] Task 1.1 - Dependency installation

First things first, we need python3, the venv module, and pip. These, we can get with apt

$ sudo apt install python3 python3-venv python3-pip

[10p] Task 1.2 - Creating the environment

Assuming that you are in your project's root directory already, we can set up the virtual environment:

$ python3 -m venv .venv

The -m flag specifies to the Python interpreter a module. python3 searches its known install paths for said module (in this case, venv) and runs it as a script. .venv is the script's argument and represents the name of the storage directory. Take a look at its internal structure:

$ tree -L 3 .venv

Notice that in .venv/bin/ we have both binaries and activation scripts. These scripts, when sourced, will force the current shell to prioritize using these. The modules you install will be placed in .venv/lib/python3.*/site-packages/. Try to activate your environment now. Once active, you will have access to the deactivate command that will restore your previous environment state:

$ source .venv/bin/activate
$ deactivate

If you are still using the setup from the first lab, you may get an ugly (.venv) prompt that looks nothing like that in the GIF above. Add this to your .zshrc:

VIRTUAL_ENV_DISABLE_PROMPT="yes"

The display function depends on your selected theme. For agnoster, you can fiddle with the prompt_virtualenv() function in the agnoster.zsh-theme source file.

[10p] Task 1.3 - Fetching modules with pip

Same as apt, pip used to have a search function for modules. Unfortunately, they removed this feature due to a high number of queries. Now, to search for modules, you will need to use the web interface.

Let us install the modules needed for this laboratory. After that, let us also check the versions of all modules (some will be added implicitly). Can you also find their installation path in the .venv/ directory?

$ pip3 install requests beautifulsoup4
$ pip3 list

WSL and VScode For those that use wsl, any Library you install and try to use in VScode on windows will not be visible, you must install the wsl extension and open the folder directly from VScode:

install the VScode wsl extension
Ctrl + Shift + P → WSL: open folder in WSL

You might also want to read the official VSCode documentation about virtual environments.

This list of dependencies can be exported (together with the exact version) in a way that allows another user to install the exact same modules in their own virtual environments. The file holding this information is named by convention requirements.txt:

$ pip3 freeze > requirements.txt
$ pip3 install -r requirements.txt

[10p] Task 1.4 - Testing that it works, with python CLI (REPL)

We can use the python3 interpreter in interactive mode to quickly test whether our modules installed correctly:

$ python3
Python 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> help(requests.get)

If you can't import the requests module, try to source the activation script again after installing the packages with pip. Some versions of venv / pip might act up.

[0p] Task 1.5 - Debugging scripts

If you ever need to debug your python program, the best approach is to open up a shell and run your program a bit tedious (copy pasting each line of the program). However, there's a better method:

import IPython
 
IPython.embed(colors='neutral')

$ sudo apt install ipython3

This will stop the execution of your script and open an ipython shell that has access to a copy of all your local & global data. Use this to inspect the state of your variables, their types, etc. Exit the shell (Ctrl + D) to continue the execution.

02. [20p] Making HTTP Requests

Now that we have the requests library, we can easily send HTTP requests to any URL. This prompts the server to respond with the information we need. When the request is successful, the server will reply with a standard status code: 200 (Success), indicating everything went smoothly. Simply replace the URL with the desired website (try a wikipedia page), and you’re ready to go!

import requests    # Imports the library in the script
 
url = "https://example.com"
response = requests.get(url)  # Send request to website
html_content = response.text  # Get the HTML content of the page
# optionally, print it on console (see its ugliness)
print(html_content)

03. [40p] Parsing the HTML content

Now, let’s parse through the HTML we just received. We’ll use BeautifulSoup, a powerful library commonly used in web scraping. BeautifulSoup helps us navigate and work with the HTML content of any page, making it easy to locate specific data we want to extract.

First, let’s “make the soup”—this is the initial step in using BeautifulSoup. This process cleans up the HTML code and prepares it for easy navigation, allowing us to either manually inspect it or directly work with it using BeautifulSoup methods.

import requests
from bs4 import BeautifulSoup
 
url = "https://example.com"
response = requests.get(url)
html_content = response.content  # HTML content of the page
soup = BeautifulSoup(html_content, "html.parser") #this is the first step in using BeautifulSoup

BeautifulSoup Tools and Tips

The soup object in BeautifulSoup is a complex object with many built-in methods that allow us to interact with and extract data from the HTML. We use these methods with the following syntax:

soup.method_name(arguments)

(in advanced programming terms, it's a method call of an object)

Here are some examples of those methods (functions):

Find

Find is used to retrieve the first occurance of a specific html tag (this can be any tag!).

# ...
soup = BeautifulSoup(html_content, "html.parser")
 
title = soup.find("h1")   # finds the first primary title (this can be any number from 1->6)
title2 = soup.title
 
paragraph = soup.find("p") # the content of the first paragraph
 
image = soup.find("img")  # Finds the first <img> and extracts the information about it as a dictionary
print(image['src']) #extract the URL of the image
 
 
link = soup.find("a")  # Finds the first <a> (anchor) tag
print(link['href'])  # Prints the URL of the link
 
list_item = soup.find("li") #The first list item

Extracting Just the Text
All these functions retrieve the entire HTML tag, but if we only want the text content inside, we can use .text

title = soup.find("h1")
print (title)       # -> <h1>Example Title</h1>
print (title.text)  # -> Example Title

Find_all

The find_all method searches for all occurrences of a specified tag and returns a list (array) of all matching tags. Each item in the list is a full HTML tag, allowing you to work with multiple instances of the same element on a page.

import requests
from bs4 import BeautifulSoup
 
# Define the target URL
url = "https://example.com"
response = requests.get(url)  # Send a request to the website
html_content = response.content  # Retrieve the HTML content of the page
 
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
 
# Extract all paragraph (<p>) elements
paragraphs = soup.find_all("p")  # Finds all <p> tags on the page
# paragraphs is a list!
 
# Iterate over each paragraph and print its text content
for paragraph in paragraphs:
    print(paragraph.text)

Using Regex

Regular expressions, or regex, are sequences of characters that define a search pattern. They’re a powerful tool for text processing, used to match, search, and manipulate text based on specific patterns.

We won't do that here ;)

You might want to learn about them (you will, in your 3rd year ;) ) if you want to process complex text, though it's not recommended for HTML documents.

Now here's the task: try to inspect Wikipedia's HTML and fetch the content of the table row (<tr>) elements containing each president's info! Note that you should filter the parent table before that.

04. [20p] Handling multi-page websites

Most websites have multiple pages, so our scraper should be capable of handling this by navigating through pagination. Pagination is typically controlled through the URL, so we’ll need to make multiple requests for each page. By identifying the pattern in the website’s subpage URLs, we can scan each page separately to gather all the data we need.

Identify the pagination pattern and loop through each of them
Request each page
Combine the results

import requests
from bs4 import BeautifulSoup
 
# Base URL for pagination, with a placeholder for the page number
base_url = 'https://example.com/page/'
 
# Loop through pages 1 to 5
for page_num in range(1, 6):
    # Construct the URL for the current page by appending the page number
    url = f'{base_url}{page_num}'
 
    # Send a request to the current page URL
    response = requests.get(url)
 
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
 
    # Now, you can use BeautifulSoup methods to extract data from the current page
    items = soup.find_all("div", class_="item")  # Modify the selector based on the target data
    for item in items:
        print(item.text)  
 
# This loop will go through pages 1 to 5, fetching and parsing each page's HTML.
# Adjust the page range as needed for your specific case.

Each website has its own page format that must be identified and handled accordingly.
Some sites include this in a file with one of those names:

sitemap.xml
sitemap_index.xml
sitemap

Final task: fetch data from more wikipedia pages, e.g. birthdays of the last 5 presidents (let's keep it simple, manually explore the website, take note of the URLs for each one and hardcode them as a Python list).