Show page

Differences

This shows you the differences between two versions of the page.

--- ii:labs:03 [2024/11/06 22:41]
mircea.braguta [Handling multi-page websites]
+++ ii:labs:03 [2024/11/06 23:16] (current)
florin.stancu
@@ Line 5: / Line 5: @@
 ===== Objectives =====
-    * Understand the basics of web scraping using Python.
+  * Install Python packages using pip + virtual environments.
-    * Learn how to extract data from websites using libraries like requests and BeautifulSoup.
+  * Learn how to extract data from websites using libraries like requests and BeautifulSoup.
-    * Handle common web scraping challenges such as pagination and data cleaning.
+  * Handle common web scraping challenges such as pagination and data cleaning.
-    * Implement a practical web scraping project to consolidate learning.
+  * Implement a practical web scraping project to consolidate learning.
+===== Overview =====
-===== Contents =====
-==== 1. Introduction to Web Scraping ====
 Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In simpler terms, web scraping is a technique to automatically extract large amounts of data from websites. It involves using a script or tool to load a web page, retrieve specific pieces of data for analysis.
 Legal and Ethical Considerations:
-    * Always check the website's `robots.txt` file.
-    * Respect the website's terms of service.
-    * Avoid overloading the website with too many requests.
-==== 2. Understanding the steps ====
+  * Always check the website's `robots.txt` file.
-=== Sending a request ===
+  * Respect the website's terms of service.
-    *A request to a website (or server) is an action where you ask it to provide specific data. In web scraping, this usually means requesting the HTML content of the page.
+  * Avoid overloading the website with too many requests.
-    *The server responds to the request by sending the requested data, which our scraper can then process. However, the raw data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing. (BeatifulSoup)
-=== Parsing and Page Analysis ===
+We will use Python and learn about the ''requests'' and ''beautifulsoup'' libraries, installing them with virtual environments and the **pip** tool (i.e. similar to Ubuntu's ''apt'' package manager).
-- Now, we have the raw, messy HTML of the page, which we need to interpret to locate the data we’re after. In its raw form, this is challenging to work with, so we use BeautifulSoup to transform it into a more readable "soup." BeautifulSoup organizes the HTML, making it easier for us as developers to navigate and locate specific elements.
+Let's get started! But first, some quick theory...
-- We still need to inspect this improved HTML "soup" to find the data we’re looking for. Our best friends here are the tags. Here are some key ones to watch out for (BeatifulSoup can help us out here too)
+=== The HTTP protocol ===
-    * `<body>` – Contains all the visible content on the page.
+The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. It’s a protocol, or set of rules, that allows web browsers and servers to communicate with each other, enabling the transfer of data over the internet.
-    * `<div>` – Groups sections together, often with `class` or `id` attributes to help identify different parts.
+When you visit a website, your browser (like Chrome or Firefox) acts as a client, and it sends an HTTP request to a web server where the website is hosted. The server processes this request, retrieves the requested data (like a web page, image, or video), and sends it back to the browser in the form of an HTTP response.
-    * `<p>` – Paragraphs of text; useful for finding blocks of text content.
+If we want to automate this process, Python has a popular library: ''requests'', allowing us to do what the browser (HTTP client) does using simple lines of code!
-    * `<h1>`, `<h2>`, etc. – Headings that structure content by importance, from largest to smallest.
+However, for web scraping, we're mainly interested in the information contained by the page. This information is presented using the HyperText Markup Language (HTML).
+This raw HTML data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing (using the BeatifulSoup library for Python).
-    * `<a>` – Links. Check the `href` attribute to find URLs.
+=== Parsing HTML using BeautifulSoup ===
-    * `<img>` – Images. The `src` attribute contains the image’s URL.
+BeautifulSoup parses and organizes the HTML code, making it easier for us as developers to navigate and locate specific elements.
-=== Content Extraction ===
-    * Now we have the nice page and now about the tags , let's use them and BeatifulSoup to get what we need!
-    * Follow the detailed steps ahead!
+We still need to inspect this improved HTML "soup" to find the data we’re looking for, thus we need to understand the basic HTML tags:
-==== Setting Up the Environment ====
+    * ''<body>'' – Contains all the visible content on the page.
-Let's install the required libraries using pip:
+    * ''<div>'' – Groups sections together, often with ''class''' or ''id''' attributes to help identify different parts.
+    * ''<p>'' – Paragraphs of text; useful for finding blocks of text content.
+    * ''<h1>'', ''<h2>'', etc. – Headings that structure content by importance, from largest to smallest.
+    * ''<a>'' – Links. Check the ''href''' attribute to find URLs.
+    * ''<img>'' – Images. The ''src'' attribute contains the image’s URL.
-<code bash>
+===== Tasks =====
-pip install requests beautifulsoup4
-</code>
-<note tip>
+{{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}}
-**WSL and VScode** For those that use wsl , any Library you install and try to use in VScode on windows will not be vissible , you must install the wsl extension for VScode and open the folder directly in WSL from VScode
-    *install wsl extension plugin
-    *Ctrl + Shift + P -> WSL: open folder in WSL
-</note>
-==== Making HTTP Requests ====
-We’ll use Python’s requests library to easily send requests to any web server. This prompts the server to respond with the information we need. When the request is successful, the server will reply with a status code 200, indicating everything went smoothly. Simply replace the URL with the desired website, and you’re ready to go.
-<code python>
-import requests    #Imports the library in the script
-url = "https://example.com"
-response = requests.get(url)  # Send request to website
-html_content = response.text  # Get the HTML content of the page
-</code>
-==== Parsing the HTML content ====
-Now, let’s parse through the HTML we just received. We’ll use BeautifulSoup, a powerful library commonly used in web scraping. BeautifulSoup helps us navigate and work with the HTML content of any page, making it easy to locate specific data we want to extract.
-    * First, let’s “make the soup”—this is the initial step in using BeautifulSoup. This process cleans up the HTML code and prepares it for easy navigation, allowing us to either manually inspect it or directly work with it using BeautifulSoup methods.
-<code python>
-import requests
-from bs4 import BeautifulSoup
-url = "https://example.com"
-response = requests.get(url)
-html_content = response.content  # HTML content of the page
-soup = BeautifulSoup(html_content, "html.parser") #this is the first step in using BeautifulSoup
-</code>
-==== BeautifulSoup Tools and Tips ====
-The soup object in BeautifulSoup is a complex object with many built-in methods that allow us to interact with and extract data from the HTML. We use these methods with the following syntax:
-<code python>
-soup.method_name(arguments)
-</code>
-Here are some examples of those methods (functions):
-===Find===
-Find is used to retrieve the //first occurance// of a specific html tag this can be any tag!
-<code python>
-...
-soup = BeautifulSoup(html_content, "html.parser")
-title = soup.find("h1")   # finds all primary titles (this can be any number from 1->6)
-title2 = soup.title
-paragraph = soup.find("p") # the content of the first paragraph
-image = soup.find("img")  # Finds the first <img> and extracts the information about it as a dictionary
-print(image['src']) #extract the URL of the image
-link = soup.find("a")  # Finds the first <a> (anchor) tag
-print(link['href'])  # Prints the URL of the link
-list_item = soup.find("li") #The first list item
-</code>
-<note tip>
-**Extracting Just the Text**\\
-All these functions retrieve the entire HTML tag, but if we only want the text content inside, we can use .text
-<code python>
-title = soup.find("h1")
-print (title)       # -> <h1>Example Title</h1>
-print (title.text)  # -> Example Title
-</code>
-</note>
-===Find_all===
-The find_all method searches for all occurrences of a specified tag and returns a list (array) of all matching tags. Each item in the list is a full HTML tag, allowing you to work with multiple instances of the same element on a page.
-<code python>
-import requests
-from bs4 import BeautifulSoup
-# Define the target URL
-url = "https://example.com"
-response = requests.get(url)  # Send a request to the website
-html_content = response.content  # Retrieve the HTML content of the page
-# Parse the HTML content using BeautifulSoup
-soup = BeautifulSoup(html_content, "html.parser")
-# Extract all paragraph (<p>) elements
-paragraphs = soup.find_all("p")  # Finds all <p> tags on the page
-# paragraphs is a list!
-# Iterate over each paragraph and print its text content
-for paragraph in paragraphs:
-    print(paragraph.text)
-</code>
-===Using Regex===
-adsjfnfbdasonfadondfaipfdsaipfdsnipfasdnipfdanipfnipadsnpifdsinpdfsnifdas
-====Handling multi-page websites====
-Most websites have multiple pages, so our scraper should be capable of handling this by navigating through pagination. Pagination is typically controlled through the URL, so we’ll need to make multiple requests for each page. By identifying the pattern in the website’s subpage URLs, we can scan each page separately to gather all the data we need.
-    *Identify the pagination pattern and loop through each of them
-    *Request each page
-    *Combine the results\\
-<code python>
-import requests
-from bs4 import BeautifulSoup
-# Base URL for pagination, with a placeholder for the page number
-base_url = 'https://example.com/page/'
-# Loop through pages 1 to 5
-for page_num in range(1, 6):
-    # Construct the URL for the current page by appending the page number
-    url = f'{base_url}{page_num}'
-    # Send a request to the current page URL
-    response = requests.get(url)
-    # Parse the HTML content of the page using BeautifulSoup
-    soup = BeautifulSoup(response.text, 'html.parser')
-    # Now, you can use BeautifulSoup methods to extract data from the current page
-    items = soup.find_all("div", class_="item")  # Modify the selector based on the target data
-    for item in items:
-        print(item.text)
-# This loop will go through pages 1 to 5, fetching and parsing each page's HTML.
-# Adjust the page range as needed for your specific case.
-</code>
-<note tip>Each website has it's own page format that must be identified and handled accordingly\\ some sites include thhis in a file with one of those names:
-    *sitemap.xml
-    *sitemap_index.xml
-    *sitemap
-</note>
-==== 01. [40p] Python environment ====
-Python libraries are collections of reusable code that provide functionality for a wide range of tasks, from data analysis and machine learning to web development and automation. Libraries are often hosted on the Python Package Index (PyPI) and can be easily installed using package managers like **pip**.
-As you work on different //Python// projects, you may need different versions of the same module. Probably even a module that you already have installed system-wide. For this, we use [[https://docs.python.org/3/library/venv.html|virtual environments]]. These environments allow you to install specific module versions in a local directory and alters your shell's environment to prioritize using them. Switching between environments can be as easy as **source**ing another setup script.
-The problem with virtual environments is that they don't mesh well with **apt**. In stead of **apt**, we will use a //Python// module manager called **pip3**. Our suggestion is to use **pip** __only__ in virtual environments. Yes, it can also install modules system-wide, but most modules can be found as **apt** packages anyway. Generally, it is not a good idea to mix package managers (and modern Linux distributions outright deny this with explicit error when trying to install Python packages outside a virtualenv)!
-=== [10p] Task 1.1 - Dependency installation ===
-First things first, we need **python3**, the **venv** module, and **pip**. These, we can get with **apt**
-<code bash>
-$ sudo apt install python3 python3-venv python3-pip
-</code>
-=== [10p] Task 1.2 - Creating the environment ===
-Assuming that you are in your project's root directory already, we can set up the virtual environment:
-<code bash>
-$ python3 -m venv .venv
-</code>
-The ''-m'' flag specifies to the //Python// interpreter a module. **python3** searches its known install paths for said module (in this case, **venv**) and runs it as a script. ''.venv'' is the script's argument and represents the name of the storage directory. Take a look at its internal structure:
-<code bash>
-$ tree -L 3 .venv
-</code>
-Notice that in //.venv/bin/ //we have both binaries and activation scripts. These scripts, when sourced, will force the __current__ shell to prioritize using these. The modules you install will be placed in //.venv/lib/python3.*/site-packages/.// Try to activate your environment now. Once active, you will have access to the **deactivate** command that will restore your previous environment state:
-<code bash>
-$ source .venv/bin/activate
-$ deactivate
-</code>
-<note tip>
-If you are still using the setup from the first lab, you may get an ugly ''(.venv)'' prompt that looks nothing like that in the GIF above. Add this to your //.zshrc//:
-<code bash>
-VIRTUAL_ENV_DISABLE_PROMPT="yes"
-</code>
-The display function depends on your selected theme. For //agnoster//, you can fiddle with the **prompt_virtualenv()** function in the //agnoster.zsh-theme// source file.
-</note>
-=== [10p] Task 1.3 - Fetching modules with pip ===
-Same as **apt**, **pip** used to have a search function for modules. Unfortunately, they removed this feature due to a high number of queries. Now, to search for modules, you will need to use the [[https://pypi.org/project/pip/|web interface]].
-Let us install the modules needed for this laboratory. After that, let us also check the versions of all modules (some will be added implicitly). Can you also find their installation path in the //.venv/ //directory?
-<code bash>
-$ pip3 install requests beautifulsoup4
-$ pip3 list
-</code>
-<note tip>
-This list of dependencies can be exported (together with the exact version) in a way that allows another user to install the __exact__ same modules in their own virtual environments. The file holding this information in named by convention //requirements.txt//:
-<code bash>
-$ pip3 freeze > requirements.txt
-$ pip3 install -r requirements.txt
-</code>
-</note>
-=== [10p] Task D - Testing that it works, with python CLI (REPL) ===
-We can use the **python3** interpreter in interactive mode to quickly test whether our modules installed correctly:
-<code python>
-$ python3
-Python 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910] on linux
-Type "help", "copyright", "credits" or "license" for more information.
->>> import requests
->>> help(requests.get)
-</code>
-<note important>
-If you can't import the **requests** module, try to source the activation script again after installing the packages with **pip**. Some versions of **venv** / **pip** might act up.
-</note>
-===== Tasks =====
-{{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}}