This shows you the differences between two versions of the page.
ewis:laboratoare:04 [2022/03/29 16:38] alexandru.predescu |
ewis:laboratoare:04 [2023/03/29 15:23] (current) alexandru.predescu [Web Scraping in Python] |
||
---|---|---|---|
Line 19: | Line 19: | ||
**Encapsulation, polymorphism, abstraction, inheritance** are fundamentals in object oriented programming language (in Python they are a bit more loosely defined) | **Encapsulation, polymorphism, abstraction, inheritance** are fundamentals in object oriented programming language (in Python they are a bit more loosely defined) | ||
- | ***Encapsulation**: data (attributes) and functionality (methods) are contained and accessible via a single unit | + | ***Encapsulation**: data and functionality are contained and accessible via a single unit |
***Abstraction**: abstract units expose only a high-level interface and hides the implementation details | ***Abstraction**: abstract units expose only a high-level interface and hides the implementation details | ||
***Inheritance**: the procedure in which one class inherits the attributes and methods of another class | ***Inheritance**: the procedure in which one class inherits the attributes and methods of another class | ||
Line 30: | Line 30: | ||
A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. **Encapsulation** is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. **Abstraction** is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details. | A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. **Encapsulation** is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. **Abstraction** is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details. | ||
- | For example in a banking application different objects may be bank account, customer type, branch. | + | <note tip>For example in a banking application different objects may be **bank account**, **customer**, **customer type**, **branch**. These can contain specific methods and attributes, can be related (e.g. a bank account belongs to a customer of some type - individual/business, and was created at a branch), and should be easy to use, maintain and extend as the application becomes larger.</note> |
In Python, a class is defined using ''class'' and class methods (functions) are defined using ''def'' and **always** have the first parameter ''self''. The keyword ''self'' represents the instance of the class, and can be used to access the attributes and methods of the class. | In Python, a class is defined using ''class'' and class methods (functions) are defined using ''def'' and **always** have the first parameter ''self''. The keyword ''self'' represents the instance of the class, and can be used to access the attributes and methods of the class. | ||
Line 130: | Line 130: | ||
**T2 (1p)** Override the method //say_hi// to show the grade as well. | **T2 (1p)** Override the method //say_hi// to show the grade as well. | ||
*Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | ||
- | **T3 (2p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. | + | **T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program? |
</note> | </note> | ||
Line 227: | Line 227: | ||
<code python> | <code python> | ||
- | from subprocess import Popen, PIPE | + | import requests |
- | from lxml import etree | + | |
- | from io import StringIO | + | # url to scrape data from |
- | user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' | + | |
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | ||
- | print("fetching: " + url) | + | |
- | get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE) | + | print("fetching page") |
- | result = get.stdout.read().decode('utf8') | + | |
- | tree = etree.parse(StringIO(result), etree.HTMLParser()) | + | # get response object |
- | str_tree = etree.tostring(tree, encoding='utf8', method='xml') | + | response = requests.get(url) |
- | str_data = str_tree.decode() | + | |
+ | # get byte string | ||
+ | byte_data = response.content | ||
+ | |||
+ | # get html source code | ||
+ | html_data = byte_data.decode("utf-8") | ||
print("writing file") | print("writing file") | ||
with open("index.html", "w", encoding="utf-8") as f: | with open("index.html", "w", encoding="utf-8") as f: | ||
- | f.write(str_data) | + | f.write(html_data) |
</code> | </code> | ||
- | <note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note> | + | <note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note> |
To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | ||
Line 252: | Line 257: | ||
filename = "index.html" | filename = "index.html" | ||
+ | parser = etree.HTMLParser() | ||
tree = etree.parse(filename) | tree = etree.parse(filename) | ||
tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | ||
Line 261: | Line 267: | ||
<note> | <note> | ||
- | **T5 (1p)** Examine the downloaded HTML file. Extract the laptop names into a text file. | + | **T5 (2p)** Examine the downloaded HTML file. Extract the laptop names into a text file. |
*Hint: filter the extracted tags by tag and attribute | *Hint: filter the extracted tags by tag and attribute | ||
*which combination of tag and attribute brings us to the data that we want to extract (laptop names) | *which combination of tag and attribute brings us to the data that we want to extract (laptop names) | ||
Line 274: | Line 280: | ||
* [[https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html|Object-Oriented Programming in Python]] | * [[https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html|Object-Oriented Programming in Python]] | ||
* [[https://docs.python.org/3/library/datetime.html|datetime — Basic date and time types]] | * [[https://docs.python.org/3/library/datetime.html|datetime — Basic date and time types]] | ||
+ | * [[https://en.wikipedia.org/wiki/Composition_over_inheritance|Composition over inheritance]] | ||
+ | * [[https://www.w3schools.com/html/|HTML Tutorial]] | ||
+ | * [[https://lxml.de/api/lxml.etree._Element-class.html|lxml API]] | ||