This shows you the differences between two versions of the page.
ii:labs:03 [2021/11/21 20:23] radu.mantu [Proof of Work] |
ii:labs:03 [2024/11/06 23:16] (current) florin.stancu |
||
---|---|---|---|
Line 5: | Line 5: | ||
===== Objectives ===== | ===== Objectives ===== | ||
- | * using virtual environments and **pip** | + | * Install Python packages using pip + virtual environments. |
- | * debugging scripts | + | * Learn how to extract data from websites using libraries like requests and BeautifulSoup. |
- | * understanding public APIs | + | * Handle common web scraping challenges such as pagination and data cleaning. |
+ | * Implement a practical web scraping project to consolidate learning. | ||
- | ===== Contents ===== | + | ===== Overview ===== |
- | {{page>:ii:labs:03:meta:nav&nofooter&noeditbutton}} | + | Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. In simpler terms, web scraping is a technique to automatically extract large amounts of data from websites. It involves using a script or tool to load a web page, retrieve specific pieces of data for analysis. |
- | ===== Proof of Work ===== | + | Legal and Ethical Considerations: |
+ | |||
+ | * Always check the website's `robots.txt` file. | ||
+ | * Respect the website's terms of service. | ||
+ | * Avoid overloading the website with too many requests. | ||
+ | |||
+ | We will use Python and learn about the ''requests'' and ''beautifulsoup'' libraries, installing them with virtual environments and the **pip** tool (i.e. similar to Ubuntu's ''apt'' package manager). | ||
+ | |||
+ | Let's get started! But first, some quick theory... | ||
+ | |||
+ | === The HTTP protocol === | ||
+ | |||
+ | The HTTP (Hypertext Transfer Protocol) is the foundation of data communication on the web. It’s a protocol, or set of rules, that allows web browsers and servers to communicate with each other, enabling the transfer of data over the internet. | ||
+ | |||
+ | When you visit a website, your browser (like Chrome or Firefox) acts as a client, and it sends an HTTP request to a web server where the website is hosted. The server processes this request, retrieves the requested data (like a web page, image, or video), and sends it back to the browser in the form of an HTTP response. | ||
+ | |||
+ | If we want to automate this process, Python has a popular library: ''requests'', allowing us to do what the browser (HTTP client) does using simple lines of code! | ||
+ | |||
+ | However, for web scraping, we're mainly interested in the information contained by the page. This information is presented using the HyperText Markup Language (HTML). | ||
+ | This raw HTML data sent by the server is often messy and challenging to parse, so we need to clean and structure it for easier processing (using the BeatifulSoup library for Python). | ||
+ | |||
+ | === Parsing HTML using BeautifulSoup === | ||
+ | |||
+ | BeautifulSoup parses and organizes the HTML code, making it easier for us as developers to navigate and locate specific elements. | ||
+ | |||
+ | We still need to inspect this improved HTML "soup" to find the data we’re looking for, thus we need to understand the basic HTML tags: | ||
+ | |||
+ | * ''<body>'' – Contains all the visible content on the page. | ||
+ | * ''<div>'' – Groups sections together, often with ''class''' or ''id''' attributes to help identify different parts. | ||
+ | * ''<p>'' – Paragraphs of text; useful for finding blocks of text content. | ||
+ | * ''<h1>'', ''<h2>'', etc. – Headings that structure content by importance, from largest to smallest. | ||
+ | * ''<a>'' – Links. Check the ''href''' attribute to find URLs. | ||
+ | * ''<img>'' – Images. The ''src'' attribute contains the image’s URL. | ||
- | Today we're picking up where we left off last time. By now you should already know the basics of working with //Python//. Developing a project in //Python// however, requires more than interacting with a shell or editing some scripts. In this lab, you will (hopefully) learn to manage isolated virtual environments and debug errors in your scripts. But probably most important, you will learn to consult API documentations. | ||
- | As a more tangible goal, you will have to write your very own [[https://discord.com/|discord]] music bot! Exciting stuff, right? As always, in addition to the script itself, remember to put together a //.pdf// explaining your approach to solving the problem. Once finished, upload both to the appropriate [[https://curs.upb.ro/2021/course/view.php?id=5793|moodle]] assignment. | ||
===== Tasks ===== | ===== Tasks ===== | ||
{{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}} | {{namespace>:ii:labs:03:tasks&nofooter&noeditbutton}} | ||
+ | |||
+ |