Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ewis:laboratoare:04 [2022/03/29 16:04]
alexandru.predescu
ewis:laboratoare:04 [2023/03/29 15:23] (current)
alexandru.predescu [Web Scraping in Python]
Line 7: Line 7:
 ==== Introduction to OOP in Python ===== ==== Introduction to OOP in Python =====
  
-Object Oriented Programming (OOP) is a programming ​that allows us to organize software as a collection of objects that consist of both data and behavior.+Object Oriented Programming (OOP) is a paradigm ​that allows us to organize software as a collection of objects that consist of both data and behavior.
  
 The advantages of OOP: The advantages of OOP:
Line 17: Line 17:
   ***Extensibility**:​ adding new features can be solved by creating new objects, without affecting the existing ones. Changes inside a class do not affect any other part of a program   ***Extensibility**:​ adding new features can be solved by creating new objects, without affecting the existing ones. Changes inside a class do not affect any other part of a program
  
-**Encapsulation,​ polymorphism,​ abstraction,​ inheritance** are fundamentals in object oriented programming language (in Python ​it'​s ​a bit more loose implementation)+**Encapsulation,​ polymorphism,​ abstraction,​ inheritance** are fundamentals in object oriented programming language (in Python ​they are a bit more loosely defined)
  
-<note tip>In Python we can use OOP but it's not mandatory. Other programming languages (e.g. Java, C#) are actually centered on OOP paradigm ​with better support for enterprise development. However Python is more often used as a scripting language, focused on simplicity, ​and OOP can be hard to master.</​note>​+  ***Encapsulation**:​ data and functionality are contained and accessible via a single unit 
 +  ***Abstraction**:​ abstract units expose only a high-level interface and hides the implementation details 
 +  ***Inheritance**:​ the procedure in which one class inherits the attributes and methods of another class 
 +  ***Polymorphism**:​ the provision of a single interface to entities of different types  
 + 
 +<note tip>​Python ​is often used as a scripting language, focused on simplicity and flexibility,​ so we can use OOP but it's not mandatory. This is because, in practice, OOP is easy to learn but hard to master. Other programming languages (e.g. Java, C#) are actually centered on the OOP paradigm ​to provide ​better support for enterprise ​software ​development ​(less flexible but more organized ​and maintainable).</​note>​
  
 ==== Classes and Objects ==== ==== Classes and Objects ====
  
-A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. Encapsulationdata methods.+A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. ​**Encapsulation** is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. **Abstraction** is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details.
  
-For example in a banking application different objects may be bank account, customer type, branch.+<note tip>For example in a banking application different objects may be **bank account****customer**,​ **customer type****branch**. These can contain specific methods and attributes, can be related (e.g. a bank account belongs to a customer of some type - individual/​business,​ and was created at a branch), and should be easy to use, maintain and extend as the application becomes larger.</​note>​
  
 In Python, a class is defined using ''​class''​ and class methods (functions) are defined using ''​def''​ and **always** have the first parameter ''​self''​. The keyword ''​self''​ represents the instance of the class, and can be used to access the attributes and methods of the class. In Python, a class is defined using ''​class''​ and class methods (functions) are defined using ''​def''​ and **always** have the first parameter ''​self''​. The keyword ''​self''​ represents the instance of the class, and can be used to access the attributes and methods of the class.
Line 123: Line 128:
  
 <​note>​ <​note>​
-**T2 (2p)** Override the method //say_hi// to show the grade as well.+**T2 (1p)** Override the method //say_hi// to show the grade as well.
   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class   *Hint: You can define (override) the method in the //Student// class and re-use the method defined in the parent class
 +**T3 (1p)** **Polymorphism** represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class //Person// and //​Student//​. For each of the elements print the name using the method //say_hi//. Is there any difference between the two types of objects when we use them in the main program?
 </​note>​ </​note>​
  
Line 221: Line 227:
  
 <code python> <code python>
-from subprocess ​import ​Popen, PIPE +import ​requests 
-from lxml import etree + 
-from io import StringIO +# url to scrape data from
-user_agent = '​Mozilla/​5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/​537.36 (KHTML, like Gecko) Chrome/​55.0.2883.95 Safari/​537.36'​+
 url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​ url = '​https://​webscraper.io/​test-sites/​e-commerce/​allinone/​computers/​laptops'​
-print("​fetching" ​+ url+ 
-get = Popen(['​curl',​ '​-s',​ '​-A',​ user_agent, ​url], stdout=PIPE+print("​fetching ​page") 
-result = get.stdout.read().decode('​utf8'​) + 
-tree etree.parse(StringIO(result),​ etree.HTMLParser()) +get response object 
-str_tree = etree.tostring(tree,​ encoding='​utf8',​ method='​xml'​) +response ​requests.get(url) 
-str_data ​str_tree.decode()+ 
 +get byte string 
 +byte_data ​response.content 
 + 
 +# get html source code 
 +html_data ​byte_data.decode("​utf-8"​) 
 print("​writing file") print("​writing file")
 with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f: with open("​index.html",​ "​w",​ encoding="​utf-8"​) as f:
-    f.write(str_data)+    f.write(html_data)
 </​code>​ </​code>​
  
-<note tip>The Python script ​uses ''​curl'',​ the command line tool that can request the web page from the HTTP server. You can find more about ''​curl'' ​[[https://curl.se/docs/httpscripting.html|here]].</​note>​+<note tip>The Python script ​makes an HTTP request ​to retrieve ​the web page from the server. You can find more about HTTP requests ​[[https://developer.mozilla.org/​en-US/docs/Web/​HTTP/​Overview|here]].</​note>​
  
 To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml// To parse the HTML file (separating the different tags in the HTML), we use the //etree// module from //lxml//
Line 246: Line 257:
  
 filename = "​index.html"​ filename = "​index.html"​
 +parser = etree.HTMLParser()
 tree = etree.parse(filename) tree = etree.parse(filename)
 tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()] tags = [[elem.tag, elem.attrib,​ elem.text] for elem in tree.iter()]
Line 268: Line 280:
   * [[https://​python-textbok.readthedocs.io/​en/​1.0/​Object_Oriented_Programming.html|Object-Oriented Programming in Python]]   * [[https://​python-textbok.readthedocs.io/​en/​1.0/​Object_Oriented_Programming.html|Object-Oriented Programming in Python]]
   * [[https://​docs.python.org/​3/​library/​datetime.html|datetime — Basic date and time types]]   * [[https://​docs.python.org/​3/​library/​datetime.html|datetime — Basic date and time types]]
 +  * [[https://​en.wikipedia.org/​wiki/​Composition_over_inheritance|Composition over inheritance]]
 +  * [[https://​www.w3schools.com/​html/​|HTML Tutorial]]
 +  * [[https://​lxml.de/​api/​lxml.etree._Element-class.html|lxml API]]
  
  
ewis/laboratoare/04.1648559079.txt.gz · Last modified: 2022/03/29 16:04 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0