Show page

Differences

This shows you the differences between two versions of the page.

--- ewis:laboratoare:04 [2022/03/29 16:38]
alexandru.predescu
+++ ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
@@ Line 19: / Line 19: @@
 **Encapsulation, polymorphism, abstraction, inheritance** are fundamentals in object oriented programming language (in Python they are a bit more loosely defined)
-  ***Encapsulation**: data (attributes) and functionality (methods) are contained and accessible via a single unit
+  ***Encapsulation**: data and functionality are contained and accessible via a single unit
   ***Abstraction**: abstract units expose only a high-level interface and hides the implementation details
   ***Inheritance**: the procedure in which one class inherits the attributes and methods of another class
@@ Line 30: / Line 30: @@
 A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. **Encapsulation** is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. **Abstraction** is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details.
-For example in a banking application different objects may be bank account, customer type, branch.
+<note tip>For example in a banking application different objects may be **bank account**, **customer**, **customer type**, **branch**. These can contain specific methods and attributes, can be related (e.g. a bank account belongs to a customer of some type - individual/business, and was created at a branch), and should be easy to use, maintain and extend as the application becomes larger.</note>
 In Python, a class is defined using ''class'' and class methods (functions) are defined using ''def'' and **always** have the first parameter ''self''. The keyword ''self'' represents the instance of the class, and can be used to access the attributes and methods of the class.
@@ Line 130: / Line 130: @@
 **T2 (1p)** Override the method //say_hi// to show the grade as well.
   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class
-**T3 (2p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//.
+**T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?
 </note>
@@ Line 227: / Line 227: @@
 <code python>
-from subprocess import Popen, PIPE
+import requests
-from lxml import etree
-from io import StringIO
+# url to scrape data from
-user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
 url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
-print("fetching: " + url)
-get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
+print("fetching page")
-result = get.stdout.read().decode('utf8')
-tree = etree.parse(StringIO(result), etree.HTMLParser())
+# get response object
-str_tree = etree.tostring(tree, encoding='utf8', method='xml')
+response = requests.get(url)
-str_data = str_tree.decode()
+# get byte string
+byte_data = response.content
+# get html source code
+html_data = byte_data.decode("utf-8")
 print("writing file")
 with open("index.html", "w", encoding="utf-8") as f:
-    f.write(str_data)
+    f.write(html_data)
 </code>
-<note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note>
+<note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note>
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
@@ Line 252: / Line 257: @@
 filename = "index.html"
+parser = etree.HTMLParser()
 tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()]
@@ Line 261: / Line 267: @@
 <note>
-**T5 (1p)** Examine the downloaded HTML file. Extract the laptop names into a text file.
+**T5 (2p)** Examine the downloaded HTML file. Extract the laptop names into a text file.
   *Hint: filter the extracted tags by tag and attribute
   *which combination of tag and attribute brings us to the data that we want to extract (laptop names)
@@ Line 274: / Line 280: @@
   * [[https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html|Object-Oriented Programming in Python]]
   * [[https://docs.python.org/3/library/datetime.html|datetime — Basic date and time types]]
+  * [[https://en.wikipedia.org/wiki/Composition_over_inheritance|Composition over inheritance]]
+  * [[https://www.w3schools.com/html/|HTML Tutorial]]
+  * [[https://lxml.de/api/lxml.etree._Element-class.html|lxml API]]

Laboratories

Resources

ewis/laboratoare/04.1648561128.txt.gz · Last modified: 2022/03/29 16:38 by alexandru.predescu

Show page Old revisions

Media Manager Back to top