This shows you the differences between two versions of the page.
ewis:laboratoare:04 [2022/03/29 16:02] alexandru.predescu [Classes and Objects] |
ewis:laboratoare:04 [2023/03/29 15:23] (current) alexandru.predescu [Web Scraping in Python] |
||
---|---|---|---|
Line 7: | Line 7: | ||
==== Introduction to OOP in Python ===== | ==== Introduction to OOP in Python ===== | ||
- | Object Oriented Programming (OOP) is a programming that allows us to organize software as a collection of objects that consist of both data and behavior. | + | Object Oriented Programming (OOP) is a paradigm that allows us to organize software as a collection of objects that consist of both data and behavior. |
The advantages of OOP: | The advantages of OOP: | ||
Line 17: | Line 17: | ||
***Extensibility**: adding new features can be solved by creating new objects, without affecting the existing ones. Changes inside a class do not affect any other part of a program | ***Extensibility**: adding new features can be solved by creating new objects, without affecting the existing ones. Changes inside a class do not affect any other part of a program | ||
- | **Encapsulation, polymorphism, abstraction, inheritance** are fundamentals in object oriented programming language (in Python it's a bit more loose implementation) | + | **Encapsulation, polymorphism, abstraction, inheritance** are fundamentals in object oriented programming language (in Python they are a bit more loosely defined) |
- | <note tip>In Python we can use OOP but it's not mandatory. Other programming languages (e.g. Java, C#) are actually centered on OOP paradigm with better support for enterprise development. However Python is more often used as a scripting language, focused on simplicity, and OOP can be hard to master.</note> | + | ***Encapsulation**: data and functionality are contained and accessible via a single unit |
+ | ***Abstraction**: abstract units expose only a high-level interface and hides the implementation details | ||
+ | ***Inheritance**: the procedure in which one class inherits the attributes and methods of another class | ||
+ | ***Polymorphism**: the provision of a single interface to entities of different types | ||
+ | |||
+ | <note tip>Python is often used as a scripting language, focused on simplicity and flexibility, so we can use OOP but it's not mandatory. This is because, in practice, OOP is easy to learn but hard to master. Other programming languages (e.g. Java, C#) are actually centered on the OOP paradigm to provide better support for enterprise software development (less flexible but more organized and maintainable).</note> | ||
==== Classes and Objects ==== | ==== Classes and Objects ==== | ||
- | A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. Encapsulation: data + methods. | + | A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. **Encapsulation** is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. **Abstraction** is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details. |
- | For example in a banking application different objects may be bank account, customer type, branch. | + | <note tip>For example in a banking application different objects may be **bank account**, **customer**, **customer type**, **branch**. These can contain specific methods and attributes, can be related (e.g. a bank account belongs to a customer of some type - individual/business, and was created at a branch), and should be easy to use, maintain and extend as the application becomes larger.</note> |
In Python, a class is defined using ''class'' and class methods (functions) are defined using ''def'' and **always** have the first parameter ''self''. The keyword ''self'' represents the instance of the class, and can be used to access the attributes and methods of the class. | In Python, a class is defined using ''class'' and class methods (functions) are defined using ''def'' and **always** have the first parameter ''self''. The keyword ''self'' represents the instance of the class, and can be used to access the attributes and methods of the class. | ||
Line 64: | Line 69: | ||
<note> | <note> | ||
- | **T1 (1p)** Create a class //Student// with the instance attributes //name// and //grade// and a method //change_grade//. Use the class to create two instances with the names //Alice// and //Bob// and the method //change_grade// to assign their grades. | + | **T1 (2p)** Create a class //Student// with the instance attributes //name// and //grade// and a method //change_grade//. Use the class to create two instances with the names //Alice// and //Bob// and the method //change_grade// to assign their grades. |
</note> | </note> | ||
Line 125: | Line 130: | ||
**T2 (1p)** Override the method //say_hi// to show the grade as well. | **T2 (1p)** Override the method //say_hi// to show the grade as well. | ||
*Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class | ||
+ | **T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //Student//. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program? | ||
</note> | </note> | ||
Line 221: | Line 227: | ||
<code python> | <code python> | ||
- | from subprocess import Popen, PIPE | + | import requests |
- | from lxml import etree | + | |
- | from io import StringIO | + | # url to scrape data from |
- | user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36' | + | |
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' | ||
- | print("fetching: " + url) | + | |
- | get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE) | + | print("fetching page") |
- | result = get.stdout.read().decode('utf8') | + | |
- | tree = etree.parse(StringIO(result), etree.HTMLParser()) | + | # get response object |
- | str_tree = etree.tostring(tree, encoding='utf8', method='xml') | + | response = requests.get(url) |
- | str_data = str_tree.decode() | + | |
+ | # get byte string | ||
+ | byte_data = response.content | ||
+ | |||
+ | # get html source code | ||
+ | html_data = byte_data.decode("utf-8") | ||
print("writing file") | print("writing file") | ||
with open("index.html", "w", encoding="utf-8") as f: | with open("index.html", "w", encoding="utf-8") as f: | ||
- | f.write(str_data) | + | f.write(html_data) |
</code> | </code> | ||
- | <note tip>The Python script uses ''curl'', the command line tool that can request the web page from the HTTP server. You can find more about ''curl'' [[https://curl.se/docs/httpscripting.html|here]].</note> | + | <note tip>The Python script makes an HTTP request to retrieve the web page from the server. You can find more about HTTP requests [[https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview|here]].</note> |
To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// | ||
Line 246: | Line 257: | ||
filename = "index.html" | filename = "index.html" | ||
+ | parser = etree.HTMLParser() | ||
tree = etree.parse(filename) | tree = etree.parse(filename) | ||
tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] | ||
Line 268: | Line 280: | ||
* [[https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html|Object-Oriented Programming in Python]] | * [[https://python-textbok.readthedocs.io/en/1.0/Object_Oriented_Programming.html|Object-Oriented Programming in Python]] | ||
* [[https://docs.python.org/3/library/datetime.html|datetime — Basic date and time types]] | * [[https://docs.python.org/3/library/datetime.html|datetime — Basic date and time types]] | ||
+ | * [[https://en.wikipedia.org/wiki/Composition_over_inheritance|Composition over inheritance]] | ||
+ | * [[https://www.w3schools.com/html/|HTML Tutorial]] | ||
+ | * [[https://lxml.de/api/lxml.etree._Element-class.html|lxml API]] | ||