This shows you the differences between two versions of the page.
ewis:laboratoare:04 [2022/03/29 16:52] alexandru.predescu |
ewis:laboratoare:04 [2023/03/29 15:23] (current) alexandru.predescu [Web Scraping in Python] |
||
---|---|---|---|
Line 130: | Line 130: | ||
**T2 (1p)** Override the method //say_hi// to show the grade as well. | **T2 (1p)** Override the method //say_hi// to show the grade as well. | ||
*Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | ||
- | **T3 (2p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program? | + | **T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program? |
</note> | </note> | ||
Line 227: | Line 227: | ||
<code python> | <code python> | ||
- | from subprocess import Popen, PIPE | + | import requests |
- | from lxml import etree | + | |
- | from io import StringIO | + | # url to scrape data from |
- | user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' | + | |
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | ||
- | print("fetching: " + url) | + | |
- | get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE) | + | print("fetching page") |
- | result = get.stdout.read().decode('utf8') | + | |
- | tree = etree.parse(StringIO(result), etree.HTMLParser()) | + | # get response object |
- | str_tree = etree.tostring(tree, encoding='utf8', method='xml') | + | response = requests.get(url) |
- | str_data = str_tree.decode() | + | |
+ | # get byte string | ||
+ | byte_data = response.content | ||
+ | |||
+ | # get html source code | ||
+ | html_data = byte_data.decode("utf-8") | ||
print("writing file") | print("writing file") | ||
with open("index.html", "w", encoding="utf-8") as f: | with open("index.html", "w", encoding="utf-8") as f: | ||
- | f.write(str_data) | + | f.write(html_data) |
</code> | </code> | ||
- | <note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note> | + | <note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note> |
To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | ||
Line 252: | Line 257: | ||
filename = "index.html" | filename = "index.html" | ||
+ | parser = etree.HTMLParser() | ||
tree = etree.parse(filename) | tree = etree.parse(filename) | ||
tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | ||
Line 261: | Line 267: | ||
<note> | <note> | ||
- | **T5 (1p)** Examine the downloaded HTML file. Extract the laptop names into a text file. | + | **T5 (2p)** Examine the downloaded HTML file. Extract the laptop names into a text file. |
*Hint: filter the extracted tags by tag and attribute | *Hint: filter the extracted tags by tag and attribute | ||
*which combination of tag and attribute brings us to the data that we want to extract (laptop names) | *which combination of tag and attribute brings us to the data that we want to extract (laptop names) |