Show page

Differences

This shows you the differences between two versions of the page.

--- ewis:laboratoare:04 [2022/03/29 16:52]
alexandru.predescu
+++ ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
@@ Line 130: / Line 130: @@
 **T2 (1p)** Override the method //say_hi// to show the grade as well.
   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class
-**T3 (2p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?
+**T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?
 </note>
@@ Line 227: / Line 227: @@
 <code python>
-from subprocess import Popen, PIPE
+import requests
-from lxml import etree
-from io import StringIO
+# url to scrape data from
-user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
 url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
-print("fetching: " + url)
-get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
+print("fetching page")
-result = get.stdout.read().decode('utf8')
-tree = etree.parse(StringIO(result), etree.HTMLParser())
+# get response object
-str_tree = etree.tostring(tree, encoding='utf8', method='xml')
+response = requests.get(url)
-str_data = str_tree.decode()
+# get byte string
+byte_data = response.content
+# get html source code
+html_data = byte_data.decode("utf-8")
 print("writing file")
 with open("index.html", "w", encoding="utf-8") as f:
-    f.write(str_data)
+    f.write(html_data)
 </code>
-<note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note>
+<note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note>
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
@@ Line 252: / Line 257: @@
 filename = "index.html"
+parser = etree.HTMLParser()
 tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()]
@@ Line 261: / Line 267: @@
 <note>
-**T5 (1p)** Examine the downloaded HTML file. Extract the laptop names into a text file.
+**T5 (2p)** Examine the downloaded HTML file. Extract the laptop names into a text file.
   *Hint: filter the extracted tags by tag and attribute
   *which combination of tag and attribute brings us to the data that we want to extract (laptop names)

Laboratories

Resources

ewis/laboratoare/04.1648561941.txt.gz · Last modified: 2022/03/29 16:52 by alexandru.predescu

Show page Old revisions

Media Manager Back to top