Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:04 [2022/03/29 16:52]
alexandru.predescu
ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
Line 130: Line 130:
 **T2 (1p)** Override the method //say_hi// to show the grade as well. **T2 (1p)** Override the method //say_hi// to show the grade as well.
   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class
-**T3 (2p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //​Student//​. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?+**T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //​Student//​. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?
 </​note>​ </​note>​
  
Line 227: Line 227:
  
 <code python> <code python>
-from subprocess ​import ​Popen, PIPE +import ​requests 
-from lxml import etree + 
-from io import StringIO +# url to scrape data from
-user_agent = '​Mozilla/​5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/​537.36 (KHTML, like Gecko) Chrome/​55.0.2883.95 Safari/​537.36'​+
 url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​ url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​
-print("​fetching" ​+ url+ 
-get = Popen(['​curl',​ '​-s',​ '​-A',​ user_agent, ​url], stdout=PIPE+print("​fetching ​page") 
-result = get.stdout.read().decode('​utf8'​) + 
-tree etree.parse(StringIO(result),​ etree.HTMLParser()) +get response object 
-str_tree = etree.tostring(tree,​ encoding='​utf8',​ method='​xml'​) +response ​requests.get(url) 
-str_data ​str_tree.decode()+ 
 +get byte string 
 +byte_data ​response.content 
 + 
 +# get html source code 
 +html_data ​byte_data.decode("​utf-8"​) 
 print("​writing file") print("​writing file")
 with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f: with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f:
-    f.write(str_data)+    f.write(html_data)
 </​code>​ </​code>​
  
-<note tip>The Python script ​uses ''​curl'',​ the command line tool that can request the web page from the HTTP server. You can find more about ''​curl'' ​[[https://curl.se/docs/httpscripting.html|here]].</​note>​+<note tip>The Python script ​makes an HTTP request ​to retrieve ​the web page from the server. You can find more about HTTP requests ​[[https://developer.mozilla.org/​en-US/docs/Web/​HTTP/​Overview|here]].</​note>​
  
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
Line 252: Line 257:
  
 filename = "​index.html"​ filename = "​index.html"​
 +parser = etree.HTMLParser()
 tree = etree.parse(filename) tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()] tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()]
Line 261: Line 267:
  
 <​note>​ <​note>​
-**T5 (1p)** Examine the downloaded HTML file. Extract the laptop names into a text file.+**T5 (2p)** Examine the downloaded HTML file. Extract the laptop names into a text file.
   *Hint: filter the extracted tags by tag and attribute   *Hint: filter the extracted tags by tag and attribute
   *which combination of tag and attribute brings us to the data that we want to extract (laptop names)   *which combination of tag and attribute brings us to the data that we want to extract (laptop names)
ewis/laboratoare/04.1648561941.txt.gz · Last modified: 2022/03/29 16:52 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0