Lab 4. Object Oriented Programming. Web Scraping

Object Oriented Programming (OOP) is a computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data field that has unique attributes and behavior. OOP allows for writing code that is better organized and maintainable, especially in large enterprise applications.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites, that can be found as HTML content.

Introduction to OOP in Python

Object Oriented Programming (OOP) is a paradigm that allows us to organize software as a collection of objects that consist of both data and behavior.

The advantages of OOP:

  • Modularity: programs organized into smaller, more manageable chunks with easy to understand and clear structure
  • Simplicity: software object model real life objects
  • Re-usability: objects can be reused in different programs
  • Maintainability: objects can be maintained separately, making locating and fixing problems easier
  • Extensibility: adding new features can be solved by creating new objects, without affecting the existing ones. Changes inside a class do not affect any other part of a program

Encapsulation, polymorphism, abstraction, inheritance are fundamentals in object oriented programming language (in Python they are a bit more loosely defined)

  • Encapsulation: data and functionality are contained and accessible via a single unit
  • Abstraction: abstract units expose only a high-level interface and hides the implementation details
  • Inheritance: the procedure in which one class inherits the attributes and methods of another class
  • Polymorphism: the provision of a single interface to entities of different types

Python is often used as a scripting language, focused on simplicity and flexibility, so we can use OOP but it's not mandatory. This is because, in practice, OOP is easy to learn but hard to master. Other programming languages (e.g. Java, C#) are actually centered on the OOP paradigm to provide better support for enterprise software development (less flexible but more organized and maintainable).

Classes and Objects

A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. Encapsulation is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. Abstraction is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details.

For example in a banking application different objects may be bank account, customer, customer type, branch. These can contain specific methods and attributes, can be related (e.g. a bank account belongs to a customer of some type - individual/business, and was created at a branch), and should be easy to use, maintain and extend as the application becomes larger.

In Python, a class is defined using class and class methods (functions) are defined using def and always have the first parameter self. The keyword self represents the instance of the class, and can be used to access the attributes and methods of the class.

An object represents an instance of a class.

A constructor is used to initialize the object's state: The special method init() is called when creating an instance (object) and can be used for defining instance attributes and initial values.

# Creating a class
class Person:  
    # class attributes
    kind = "human"
 
    # init method or constructor   
    def __init__(self, name):
        # instance attributes
        self.name = name  
 
    # sample method  
    def say_hi(self):  
        print('Hello, my name is', self.name)  
 
# creating an object instance from a class   
p = Person('John')
# calling an instance method  
p.say_hi() 
 
# in Python, class attributes can also be accessed using class name 
print(Person.kind)
print(p.kind)
 
# instance variables cannot be accessed using class name
print(Person.name)

T1 (2p) Create a class Student with the instance attributes name and grade and a method change_grade. Use the class to create two instances with the names Alice and Bob and the method change_grade to assign their grades.

The collection of methods is often referred to as an API (Application Programming Interface)

Extending the functionality of a class

There are two main OOP principles that define relationships between objects: inheritance and composition.

Inheritance

An existing class can be extended by “inheriting” the attributes and functions of the base class. Inheritance is a way of arranging objects in a hierarchy from the most general to the most specific. This is one way to extend a program, in the end making the code more structured. A subclass / child class inherits all base / parent attributes and methods. You may provide additional functionality to the inherited methods by overriding the implementation in the child class.

In pseudocode:

class SuperClass:
    Attributes of SuperClass
    Methods of SuperClass
class SubClass(SuperClass):
    Attributes of SubClass
    Methods of SubClass

Here is an example in Python, defining the Student class that extends the Person class defined earlier:

# Extending a class
# Student class inherits from Person class
class Student(Person):
    # init method or constructor
    def __init__(self, name):
        # you can reuse the method in the base class
        super().__init__(name)
        # this also works
        Person.__init__(self, name)
        # initialize instance attribute
        self.grade = None
        self.course_grades = []
 
    # sample method
    def change_grade(self, grade):
        # set instance attribute
        self.grade = grade
 
# Creating an object instance from a class   
s = Student('John')  
s.change_grade(10)
# calling the method defined in the base class
s.say_hi() 
# class variables can also be accessed using class name 
print(Student.kind)

T2 (1p) Override the method say_hi to show the grade as well.

  • Hint: You can define (override) the method in the Student class and re-use the method defined in the parent class

T3 (1p) Polymorphism represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class Person and Student. For each of the elements print the name using the method say_hi. Is there any difference between the two types of objects when we use them in the main program?

Aggregation. Composition
  • Composition is a way of attaching objects to other objects. In this case, the object declared as an attribute to the parent object belongs exclusively to the parent object.
  • If the link between two objects is weaker, and neither object has exclusive ownership of the other, it is called aggregation.

Relationships like this can be one-to-one, one-to-many or many-to-many, and they can be unidirectional or bidirectional.

# import this module
import datetime
 
# add the following methods to the Student class:
def add_course_grade(self, course_grade):
    # course_grade is an object attached by composition 
    self.course_grades.append(course_grade)
 
def compute_gpa(self):
    self.grade = sum([course_grade.grade for course_grade in self.course_grades])/len(self.course_grades)
 
# define a new class to contain the grade for a course
class CourseGrade():
    # init method or constructor
    def __init__(self, course_id, grade, date):
        self.course_id = course_id
        self.grade = grade
        self.date = date
        self.date_changed = date
 
    # sample method
    def change_grade(self, grade, date):
        print("grade changed from: ", self.grade, " to: ", grade, " at: ", date)
        self.grade = grade
        self.date_changed = date
 
student = Student('John')
course_grade = CourseGrade('EWIS', 9, datetime.date(2022, 3, 30))
student.add_course_grade(course_grade)
course_grade.change_grade(10, datetime.date(2022, 3, 31))
student.say_hi()
student.compute_gpa()
student.say_hi()
student.change_grade(10)
student.say_hi()

Complex class hierarchies can become hard to understand. Sometimes we can replace inheritance with composition and achieve a similar result – this principle is called Composition over inheritance and can make the code easier to understand and maintain.

T3 (1p) Add the two methods (add_course_grade, compute_gpa) to the Student class

T4 (1p) Examine the code. How are the objects student and course_grade related? (aggregation vs composition)

Web Scraping in Python

The HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document. HTML elements are delineated by tags, written using angle brackets. Tags such as <img /> and <input /> directly introduce content into the page. Other tags such as <p> and <div> surround and provide information about document text and may include other tags as sub-elements. Browsers do not display the HTML tags, but use them to interpret the content of the page.

You can find tutorials on HTML here.

Here is a simple web page. Try to create an .html file, add the content as text and open with a web browser.

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
 
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
 
</body>
</html>

Usually, the data can be found right inside the HTML tags as it's rendered by the browser. Web scraping is the process of extracting data from web pages and involves using the HTTP protocol to fetch the page and then extracting the content from the HTML.

Some websites do not allow web scraping and may use several methods to prevent “unauthorized” access. This is why you can find captchas such as “I am not a robot” to check that the web client is a real user and not an automated script.

Note: It is not advised to do web scraping on google search engine!

However, web scraping is not an easy job that works the same for each website, as it requires to know the structure of each web page to be able to extract the required data.

In this example, we will use a sample website designed for testing web scraping programs. The main objective is to extract the items from the e-commerce website into a more useful representation for data processing (e.g. a list of objects).

We will need the lxml package that can be installed via Python package manager:

py -3 -m pip install lxml

Here is the code for fetching the webpage content into a HTML file:

from subprocess import Popen, PIPE
from lxml import etree
from io import StringIO
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
print("fetching: " + url)
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8')
tree = etree.parse(StringIO(result), etree.HTMLParser())
str_tree = etree.tostring(tree, encoding='utf8', method='xml')
str_data = str_tree.decode()
print("writing file")
with open("index.html", "w", encoding="utf-8") as f:
    f.write(str_data)

The Python script uses curl, the command line tool that can request the web page from the HTTP server. You can find more about curl here.

To parse the HTML file (separating the different tags in the HTML), we use the etree module from lxml Here is the code for extracting the items from the webpage:

from lxml import etree
 
filename = "index.html"
tree = etree.parse(filename)
tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()]
for tag in tags:
    print(tag)

Here you can find the description of the Element class that we use in the example to extract details for each HTML tag (e.g. tag, attributes, text)

T5 (2p) Examine the downloaded HTML file. Extract the laptop names into a text file.

  • Hint: filter the extracted tags by tag and attribute
  • which combination of tag and attribute brings us to the data that we want to extract (laptop names)

T6 (2p) Examine the downloaded HTML file. Extract the laptop names and prices into a CSV file.

  • Hint: filter the extracted tags by tag and attribute
  • which combination of tag and attribute brings us to the data that we want to extract (laptop names, prices)

Resources

ewis/laboratoare/04.txt · Last modified: 2022/03/30 17:53 by alexandru.predescu
CC Attribution-Share Alike 3.0 Unported
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0