Table of Contents

Lab 3. Useful Python skills. Working with data files

Exceptions

If you have some code that may “raise an exception” under some unknown conditions, you can defend your program by placing the suspicious code in a try: block. After the try: block, include an except: statement, followed by a block of code which handles the problem as elegantly as possible:

try:
  print("Hello")
except:
  print("Something went wrong")
else:
  print("Nothing went wrong")
finally:
  print("The 'try except' is finished")
 

Defining new exceptions can be done using raise statement:

def move(level):
   if level < 0:
      raise Exception("Invalid level: " + str(level))
      # The code below to this would not be executed
      # if we raise the exception
 
   print("Command received")
 
try:
   move(1)
   print("Move complete")
   move(-1)
   print("Move complete")
except Exception as e:
   print("Exception detected")
   print(e)
else:
   print("Action complete")
 

Exceptions are useful for separating normal behavior from unexpected behavior when writing code.

Lambda expressions

A Python lambda is an anonymous function with the following syntax:

lambda arguments: expression
# Python code to calculate the cube of a number  
# showing difference between def() and lambda(). 
 
# using lambda expressions to define an "annonymous" function
cube = lambda x: x*x*x 
print(cube(7)) 
 
# defining a named function
def cube(y): 
    return y*y*y; 
 
print(cube(5))
 
# using lambda expressions with functions
cubex = lambda x: cube(x)
print(cubex(7))  

Lambda functions can be used along with built-in functions like filter(), map() and reduce().

Use of lambda with filter()

The filter() function in Python takes in a function and a list as arguments. This offers an elegant way to filter out all the elements of a sequence for which the function returns True (reference: Python filter() Function). Here is a small program that returns the odd numbers from an input list:

# Python code to illustrate 
# filter() with lambda() 
li = [5, 7, 22, 97, 54, 62, 77, 23, 73, 61] 
final_list = list(filter(lambda x: (x%2 != 0) , li)) 
print(final_list) 

You can write this using list comprehension too: final_list = [x for x in li if x%2 != 0]

Use of lambda with map()

The map() function in Python takes in a function and a list as argument. The function is called with a lambda function and a list and a new list is returned which contains the modified items (reference: Python map() Function). Here is a small program that returns the double of a list:

# Python code to illustrate  
# map() with lambda()  
li = [5, 7, 22, 97, 54, 62, 77, 23, 73, 61] 
final_list = list(map(lambda x: x*2 , li)) 
print(final_list) 

You can write this using list comprehension too: final_list = [x*2 for x in li]

Use of lambda with reduce()

The reduce() function in Python takes in a function and a list as argument. The function is called with a lambda function and a list and a new reduced result is returned. This performs a rolling computation to sequential pairs of the list (reference: Python reduce function). This is a part of functools module. Here is a small program that returns the sum of a list:

# Python code to illustrate  
# reduce() with lambda() 
from functools import reduce
li = [5, 8, 10, 20, 50, 100] 
sum = reduce((lambda x, y: x + y), li) 
print (sum) 

This is shorter than:

li = [5, 8, 10, 20, 50, 100] 
sum = 0
for val in li:
    sum = sum + val
print (sum)

T1 Having a list, filter the even numbers and get the sum of the squared elements.

  • Example list (random generator)
import numpy as np
li = []
size = 100
maxn = 10
li = [int(np.random.rand()*maxn) for i in range(size)]
  • Hint: Use filter, map, reduce and lambda expressions

Data files in Python

Files are used to store and exchange data between programs. Some files are human-readable such as text files, and some are human and machine-readable such as xml, html, json for web data exchange, and some are machine-readable only such as image files, sound files and video files.

Getting started

Reading and writing data to files using Python is pretty straightforward. To do this, you must first open files in the appropriate mode (read/write).

file = open('data.txt', 'w') 
try: 
    file.write('ana are mere') 
finally: 
    file.close() 

Another way of writing is by using the with statement, that takes care of disposing the resources and handling exceptions, making the code cleaner. with is commonly used with files, sockets, etc.

Create a new python file write_file.py

with open('data.txt', 'w') as f:
    data = 'ana are mere'
    f.write(data)

Create a new python file read_file.py

with open('data.txt', 'r') as f:
    data = f.read()
    print(data)

Run the python scripts: write_file.py, read_file.py

The with statement in Python is used in exception handling to make the code cleaner and much more readable.

Working with real data

Let's try to open a large data file with tabular data. A database of TV Shows and Movies listed on Netflix can be found here (check download link in resources section). The data is stored as a CSV file, which is a common format when working with data. You can open the file with a text editor or with Excel to view the data.

With Python, because there are diacritics and special characters that are not recognized by default, we need to specify a universal encoding when opening the file:

with open('netflix_titles.csv', 'r', encoding="utf-8") as f:
    data = f.read()
    print(data)

This would take a long time to print. There has to be a better way of working with data.

The pandas Python library provides some straight-forward methods to work with data sets. For tabular data, the DataFrame structure is used to load the data from different sources and formats. A key feature is the ability to access a group of rows and columns by label(s) or a boolean array. In the following example, we import the csv data into a Pandas DataFrame:

import pandas as pd
df = pd.read_csv('netflix_titles.csv')
print(df)

To install pandas, you can use the package manager from the terminal, as shown in Lab 1:

py -3 -m pip install pandas

Let's say we want to extract only the titles from the dataset and print the output into a text file (your task). Here is how we can use the Pandas library to work with datasets in a simple way:

import pandas as pd
df = pd.read_csv('netflix_titles.csv')
# get the data by column and convert to a list
titles = list(df["title"])
print(titles)

There are many more operations that you can perform with Pandas, which might be useful for working with data: 10 minutes to pandas

Let's try to sort the titles by the release year and print the results into a text file (your task).

import pandas as pd
df = pd.read_csv('netflix_titles.csv')
# sort the data frame by column name
df = df.sort_values(by="release_year")

Here is a (long form, you can write this shorter) code for writing a text file with separate lines for each entry:

titles = ["Avengers", "Avatar", "Star Wars"]
n_titles = len(titles)
with open("titles.txt", "w", encoding="utf-8") as f:
    for i in range(n_titles):
        f.write(titles[i] + "\n")

Notice the “\n” special character. It is used to mark the line ending. Newline

Directories

The built-in os module has a number of useful functions that can be used to list directory contents and filter the results. To get a list of all the files and folders in a particular directory in the filesystem, use os.scandir().

import os
with os.scandir('.') as entries:
    for entry in entries:
        print(entry.name)

Task

T1 (2p) Lab task presented above

T2 (3p) Use the functions presented in this lab to read text from a data file and count the number of occurrences for each word. Print the words sorted by the number of occurrences.

T3 (2p) Print the Netflix titles into a text file.

T4 (3p) Print the Netflix titles and release years into a text file, using the CSV format, sorted by release year.

Resources