Show page

Differences

This shows you the differences between two versions of the page.

--- ii:labs:03 [2024/11/06 21:07]
mircea.braguta [Parsing the HTML content]
+++ ii:labs:03 [2024/11/06 23:16] (current)
florin.stancu
@@ Line 5: / Line 5: @@
 ===== Objectives =====
-    * Understand the basics of web scraping using Python.
+  * Install Python packages using pip + virtual environments.
-    * Learn how to extract data from websites using libraries like requests and BeautifulSoup.
+  * Learn how to extract data from websites using libraries like requests and BeautifulSoup.
-    * Handle common web scraping challenges such as pagination and data cleaning.
+  * Handle common web scraping challenges such as pagination and data cleaning.
-    * Implement a practical web scraping project to consolidate learning.
+  * Implement a practical web scraping project to consolidate learning.
+===== Overview =====
-===== Contents =====
-==== 1. Introduction to Web Scraping ====
 Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In simpler terms, web scraping is a technique to automatically extract large amounts of data from websites. It involves using a script or tool to load a web page, retrieve specific pieces of data for analysis.
 Legal and Ethical Considerations:
-    * Always check the website's `robots.txt` file.
-    * Respect the website's terms of service.
-    * Avoid overloading the website with too many requests.
-==== 2. Understanding the steps ====
+  * Always check the website's `robots.txt` file.
-=== Sending a request ===
+  * Respect the website's terms of service.
-    *A request to a website (or server) is an action where you ask it to provide specific data. In web scraping, this usually means requesting the HTML content of the page.
+  * Avoid overloading the website with too many requests.
-    *The server responds to the request by sending the requested data, which our scraper can then process. However, the raw data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing. (BeatifulSoup)
-=== Parsing and Page Analysis ===
+We will use Python and learn about the ''requests'' and ''beautifulsoup'' libraries, installing them with virtual environments and the **pip** tool (i.e. similar to Ubuntu's ''apt'' package manager).
-- Now, we have the raw, messy HTML of the page, which we need to interpret to locate the data we’re after. In its raw form, this is challenging to work with, so we use BeautifulSoup to transform it into a more readable "soup." BeautifulSoup organizes the HTML, making it easier for us as developers to navigate and locate specific elements.
+Let's get started! But first, some quick theory...
-- We still need to inspect this improved HTML "soup" to find the data we’re looking for. Our best friends here are the tags. Here are some key ones to watch out for (BeatifulSoup can help us out here too)
+=== The HTTP protocol ===
-    * `<body>` – Contains all the visible content on the page.
+The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. It’s a protocol, or set of rules, that allows web browsers and servers to communicate with each other, enabling the transfer of data over the internet.
-    * `<div>` – Groups sections together, often with `class` or `id` attributes to help identify different parts.
+When you visit a website, your browser (like Chrome or Firefox) acts as a client, and it sends an HTTP request to a web server where the website is hosted. The server processes this request, retrieves the requested data (like a web page, image, or video), and sends it back to the browser in the form of an HTTP response.
-    * `<p>` – Paragraphs of text; useful for finding blocks of text content.
+If we want to automate this process, Python has a popular library: ''requests'', allowing us to do what the browser (HTTP client) does using simple lines of code!
-    * `<h1>`, `<h2>`, etc. – Headings that structure content by importance, from largest to smallest.
+However, for web scraping, we're mainly interested in the information contained by the page. This information is presented using the HyperText Markup Language (HTML).
+This raw HTML data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing (using the BeatifulSoup library for Python).
-    * `<a>` – Links. Check the `href` attribute to find URLs.
+=== Parsing HTML using BeautifulSoup ===
-    * `<img>` – Images. The `src` attribute contains the image’s URL.
+BeautifulSoup parses and organizes the HTML code, making it easier for us as developers to navigate and locate specific elements.
-=== Content Extraction ===
-    * Now we have the nice page and now about the tags , let's use them and BeatifulSoup to get what we need!
-    * Follow the detailed steps ahead!
+We still need to inspect this improved HTML "soup" to find the data we’re looking for, thus we need to understand the basic HTML tags:
-==== Setting Up the Environment ====
+    * ''<body>'' – Contains all the visible content on the page.
-Let's install the required libraries using pip:
+    * ''<div>'' – Groups sections together, often with ''class''' or ''id''' attributes to help identify different parts.
+    * ''<p>'' – Paragraphs of text; useful for finding blocks of text content.
+    * ''<h1>'', ''<h2>'', etc. – Headings that structure content by importance, from largest to smallest.
+    * ''<a>'' – Links. Check the ''href''' attribute to find URLs.
+    * ''<img>'' – Images. The ''src'' attribute contains the image’s URL.
-<code bash>
+===== Tasks =====
-pip install requests beautifulsoup4
-</code>
-<note tip>
+{{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}}
-**WSL and VScode** For those that use wsl , any Library you install and try to use in VScode on windows will not be vissible , you must install the wsl extension for VScode and open the folder directly in WSL from VScode
-    *install wsl extension plugin
-    *Ctrl + Shift + P -> WSL: open folder in WSL
-</note>
-==== Making HTTP Requests ====
-We’ll use Python’s requests library to easily send requests to any web server. This prompts the server to respond with the information we need. When the request is successful, the server will reply with a status code 200, indicating everything went smoothly. Simply replace the URL with the desired website, and you’re ready to go.
-<code python>
-import requests    #Imports the library in the script
-url = "https://example.com"
-response = requests.get(url)  # Send request to website
-html_content = response.text  # Get the HTML content of the page
-</code>
-==== Parsing the HTML content ====
-Now, let’s parse through the HTML we just received. We’ll use BeautifulSoup, a powerful library commonly used in web scraping. BeautifulSoup helps us navigate and work with the HTML content of any page, making it easy to locate specific data we want to extract.
-    * First, let’s “make the soup”—this is the initial step in using BeautifulSoup. This process cleans up the HTML code and prepares it for easy navigation, allowing us to either manually inspect it or directly work with it using BeautifulSoup methods.
-<code python>
-import requests
-from bs4 import BeautifulSoup
-url = "https://example.com"
-response = requests.get(url)
-html_content = response.content  # HTML content of the page
-soup = BeautifulSoup(html_content, "html.parser") #this is the first step in using BeautifulSoup
-</code>
-=== BeautifulSoup Tools and Tips ===
-The soup object in BeautifulSoup is a complex object with many built-in methods that allow us to interact with and extract data from the HTML. We use these methods with the following syntax:
-<code python>
-soup.method_name(arguments)
-</code>
-Here are some examples of those methods (functions):
-==Find==
-Find is used to retrieve the //first occurance// of a specific html tag this can be any tag!
-<code python>
-...
-soup = BeautifulSoup(html_content, "html.parser")
-title = soup.find("h1")   # finds all primary titles (this can be any number from 1->6)
-paragraph = soup.find("p") # the content of the first paragraph
-image = soup.find("img")  # Finds the first <img> and extracts the information about it as a dictionary
-print(image['src']) #extract the URL of the image
-link = soup.find("a")  # Finds the first <a> (anchor) tag
-print(link['href'])  # Prints the URL of the link
-list_item = soup.find("li") #the first item in a list
-</code>
-<note tip>
-**Extracting Just the Text**\\
-All these functions retrieve the entire HTML tag, but if we only want the text content inside, we can use .text
-<code python>
-title = soup.find("h1")
-print (title)       # -> <h1>Example Title</h1>
-print (title.text)  # -> Example Title
-</code>
-</note>
-==Find_all==
-===== Intro =====
-Today we're picking up where we left off last time. By now you should already know the basics of working with //Python//. Developing a project in //Python// however, requires more than interacting with a shell or editing some scripts. In this lab, you will (hopefully) learn to manage isolated virtual environments and debug errors in your scripts. But probably most important, you will learn to consult API documentations.
-===== Tasks =====
-{{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}}