Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:04 [2022/03/30 17:53]
alexandru.predescu
ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
Line 227: Line 227:
  
 <code python> <code python>
-from subprocess ​import ​Popen, PIPE +import ​requests 
-from lxml import etree + 
-from io import StringIO +# url to scrape data from
-user_agent = '​Mozilla/​5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/​537.36 (KHTML, like Gecko) Chrome/​55.0.2883.95 Safari/​537.36'​+
 url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​ url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​
-print("​fetching" ​+ url+ 
-get = Popen(['​curl',​ '​-s',​ '​-A',​ user_agent, ​url], stdout=PIPE+print("​fetching ​page") 
-result = get.stdout.read().decode('​utf8'​) + 
-tree etree.parse(StringIO(result),​ etree.HTMLParser()) +get response object 
-str_tree = etree.tostring(tree,​ encoding='​utf8',​ method='​xml'​) +response ​requests.get(url) 
-str_data ​str_tree.decode()+ 
 +get byte string 
 +byte_data ​response.content 
 + 
 +# get html source code 
 +html_data ​byte_data.decode("​utf-8"​) 
 print("​writing file") print("​writing file")
 with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f: with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f:
-    f.write(str_data)+    f.write(html_data)
 </​code>​ </​code>​
  
-<note tip>The Python script ​uses ''​curl'',​ the command line tool that can request the web page from the HTTP server. You can find more about ''​curl'' ​[[https://curl.se/docs/httpscripting.html|here]].</​note>​+<note tip>The Python script ​makes an HTTP request ​to retrieve ​the web page from the server. You can find more about HTTP requests ​[[https://developer.mozilla.org/​en-US/docs/Web/​HTTP/​Overview|here]].</​note>​
  
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
Line 252: Line 257:
  
 filename = "​index.html"​ filename = "​index.html"​
 +parser = etree.HTMLParser()
 tree = etree.parse(filename) tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()] tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()]
ewis/laboratoare/04.1648651981.txt.gz · Last modified: 2022/03/30 17:53 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0