Object Oriented Programming (OOP) is a computer programming model that organizes software design around data, or objects, rather than functions and logic. An object can be defined as a data field that has unique attributes and behavior. OOP allows for writing code that is better organized and maintainable, especially in large enterprise applications.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites, that can be found as HTML content.
Object Oriented Programming (OOP) is a paradigm that allows us to organize software as a collection of objects that consist of both data and behavior.
The advantages of OOP:
Encapsulation, polymorphism, abstraction, inheritance are fundamentals in object oriented programming language (in Python they are a bit more loosely defined)
A class is a user-defined data structure from which objects are created. Classes provide a means of bundling data (variables) and functionality (functions) together. Encapsulation is the most important principle of OOP where data (attributes) and functionality (methods) are contained and accessible via a single unit. Abstraction is another core principle, which is similar to encapsulation but exposes only a high-level interface and hides the implementation details.
In Python, a class is defined using class
and class methods (functions) are defined using def
and always have the first parameter self
. The keyword self
represents the instance of the class, and can be used to access the attributes and methods of the class.
An object represents an instance of a class.
A constructor is used to initialize the object's state: The special method init()
is called when creating an instance (object) and can be used for defining instance attributes and initial values.
# Creating a class class Person: # class attributes kind = "human" # init method or constructor def __init__(self, name): # instance attributes self.name = name # sample method def say_hi(self): print('Hello, my name is', self.name) # creating an object instance from a class p = Person('John') # calling an instance method p.say_hi() # in Python, class attributes can also be accessed using class name print(Person.kind) print(p.kind) # instance variables cannot be accessed using class name print(Person.name)
T1 (2p) Create a class Student with the instance attributes name and grade and a method change_grade. Use the class to create two instances with the names Alice and Bob and the method change_grade to assign their grades.
There are two main OOP principles that define relationships between objects: inheritance and composition.
An existing class can be extended by “inheriting” the attributes and functions of the base class. Inheritance is a way of arranging objects in a hierarchy from the most general to the most specific. This is one way to extend a program, in the end making the code more structured. A subclass / child class inherits all base / parent attributes and methods. You may provide additional functionality to the inherited methods by overriding the implementation in the child class.
In pseudocode:
class SuperClass: Attributes of SuperClass Methods of SuperClass class SubClass(SuperClass): Attributes of SubClass Methods of SubClass
Here is an example in Python, defining the Student class that extends the Person class defined earlier:
# Extending a class # Student class inherits from Person class class Student(Person): # init method or constructor def __init__(self, name): # you can reuse the method in the base class super().__init__(name) # this also works Person.__init__(self, name) # initialize instance attribute self.grade = None self.course_grades = [] # sample method def change_grade(self, grade): # set instance attribute self.grade = grade # Creating an object instance from a class s = Student('John') s.change_grade(10) # calling the method defined in the base class s.say_hi() # class variables can also be accessed using class name print(Student.kind)
T3 (1p) Polymorphism represents a key principle of OOP. To understand this principle, create a list that contains multiple objects of class Person and Student. For each of the elements print the name using the method say_hi. Is there any difference between the two types of objects when we use them in the main program?
Relationships like this can be one-to-one, one-to-many or many-to-many, and they can be unidirectional or bidirectional.
# import this module import datetime # add the following methods to the Student class: def add_course_grade(self, course_grade): # course_grade is an object attached by composition self.course_grades.append(course_grade) def compute_gpa(self): self.grade = sum([course_grade.grade for course_grade in self.course_grades])/len(self.course_grades) # define a new class to contain the grade for a course class CourseGrade(): # init method or constructor def __init__(self, course_id, grade, date): self.course_id = course_id self.grade = grade self.date = date self.date_changed = date # sample method def change_grade(self, grade, date): print("grade changed from: ", self.grade, " to: ", grade, " at: ", date) self.grade = grade self.date_changed = date student = Student('John') course_grade = CourseGrade('EWIS', 9, datetime.date(2022, 3, 30)) student.add_course_grade(course_grade) course_grade.change_grade(10, datetime.date(2022, 3, 31)) student.say_hi() student.compute_gpa() student.say_hi() student.change_grade(10) student.say_hi()
T4 (1p) Examine the code. How are the objects student and course_grade related? (aggregation vs composition)
The HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document. HTML elements are delineated by tags, written using angle brackets. Tags such as <img /> and <input /> directly introduce content into the page. Other tags such as <p> and <div> surround and provide information about document text and may include other tags as sub-elements. Browsers do not display the HTML tags, but use them to interpret the content of the page.
You can find tutorials on HTML here.
Here is a simple web page. Try to create an .html
file, add the content as text and open with a web browser.
<!DOCTYPE html> <html> <head> <title>Page Title</title> </head> <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> </html>
Usually, the data can be found right inside the HTML tags as it's rendered by the browser. Web scraping is the process of extracting data from web pages and involves using the HTTP protocol to fetch the page and then extracting the content from the HTML.
Note: It is not advised to do web scraping on google search engine!
However, web scraping is not an easy job that works the same for each website, as it requires to know the structure of each web page to be able to extract the required data.
In this example, we will use a sample website designed for testing web scraping programs. The main objective is to extract the items from the e-commerce website into a more useful representation for data processing (e.g. a list of objects).
py -3 -m pip install lxml
Here is the code for fetching the webpage content into a HTML file:
import requests # url to scrape data from url = 'https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops' print("fetching page") # get response object response = requests.get(url) # get byte string byte_data = response.content # get html source code html_data = byte_data.decode("utf-8") print("writing file") with open("index.html", "w", encoding="utf-8") as f: f.write(html_data)
To parse the HTML file (separating the different tags in the HTML), we use the etree module from lxml Here is the code for extracting the items from the webpage:
from lxml import etree filename = "index.html" parser = etree.HTMLParser() tree = etree.parse(filename) tags = [[elem.tag, elem.attrib, elem.text] for elem in tree.iter()] for tag in tags: print(tag)
T6 (2p) Examine the downloaded HTML file. Extract the laptop names and prices into a CSV file.