Show page

Differences

This shows you the differences between two versions of the page.

--- ewis:laboratoare:04 [2022/03/30 17:53]
alexandru.predescu
+++ ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
@@ Line 227: / Line 227: @@
 <code python>
-from subprocess import Popen, PIPE
+import requests
-from lxml import etree
-from io import StringIO
+# url to scrape data from
-user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
 url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
-print("fetching: " + url)
-get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
+print("fetching page")
-result = get.stdout.read().decode('utf8')
-tree = etree.parse(StringIO(result), etree.HTMLParser())
+# get response object
-str_tree = etree.tostring(tree, encoding='utf8', method='xml')
+response = requests.get(url)
-str_data = str_tree.decode()
+# get byte string
+byte_data = response.content
+# get html source code
+html_data = byte_data.decode("utf-8")
 print("writing file")
 with open("index.html", "w", encoding="utf-8") as f:
-    f.write(str_data)
+    f.write(html_data)
 </code>
-<note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note>
+<note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note>
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
@@ Line 252: / Line 257: @@
 filename = "index.html"
+parser = etree.HTMLParser()
 tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()]

Laboratories

Resources

ewis/laboratoare/04.1648651981.txt.gz · Last modified: 2022/03/30 17:53 by alexandru.predescu

Show page Old revisions

Media Manager Back to top