Table of Contents

Lab 6. Relational Databases. MySQL-Python. Charts

In this lab, we are introduced to more advanced SQL (functions, joins, subqueries) that are commonly used when querying databases. Then, we will connect to the database and run queries using Python and we will experiment with a plotting library to visualize the results.

SQL. Functions

SQL functions are used to perform calculations on data. Functions can be aggregate (using multiple values to calculate a result) or scalar (applied on a single value).

SQL Aggregate Functions

SQL aggregate functions return a single value calculated from multiple values in a column.

-- AVG() --
SELECT AVG(column_name) FROM TABLE_NAME
 
-- COUNT() --
SELECT COUNT(column_name) FROM TABLE_NAME;
SELECT COUNT(*) FROM TABLE_NAME;
SELECT COUNT(DISTINCT column_name) FROM TABLE_NAME;
 
-- In MySQL, FIRST() is replaced by: --
SELECT column_name FROM TABLE_NAME
ORDER BY column_name ASC
LIMIT 1;
 
-- In MySQL, LAST() is replaced by: --
SELECT column_name FROM TABLE_NAME
ORDER BY column_name DESC
LIMIT 1;
 
-- MAX() --
SELECT MAX(column_name) FROM TABLE_NAME;
 
-- MIN() --
SELECT MIN(column_name) FROM TABLE_NAME;
 
-- SUM() --
SELECT SUM(column_name) FROM TABLE_NAME;

The (SELECT) DISTINCT statement is used to return only distinct values. For example, the following query returns the distinct positions in the employee table, effectively listing all positions (ids) that are assigned to employees in a simple way.

SELECT DISTINCT position_id FROM employee;

The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns. For example, the following query returns the number of employees for each department. The aggregate function is COUNT(), which counts the employees within each department.

-- Get the number of employees for each department --
SELECT department_id, COUNT(id) AS employee_count FROM employee
GROUP BY department_id;

GROUP BY defines how the aggregate function is applied. What happens if you remove the GROUP BY statement? What happens if you use GROUP BY with the employee id instead?

The HAVING clause is used instead of WHERE when using aggregate functions. For example, the following query returns the number of employees for each department having more than 5 employees.

-- Get the number of employees for each department having more than 5 employees --
SELECT department_id, COUNT(id) AS employee_count FROM employee
GROUP BY department_id
HAVING COUNT(id) > 5;

SQL Scalar Functions

SQL scalar functions return a single value based on the input value.

-- UCASE() --
SELECT UCASE(column_name) FROM TABLE_NAME;
 
-- LCASE() --
SELECT LCASE(column_name) FROM TABLE_NAME;
 
-- MID() --
SELECT MID(column_name, START, LENGTH) AS some_name FROM TABLE_NAME;
 
-- LEN() --
SELECT LEN(column_name) FROM TABLE_NAME;
 
-- ROUND() --
SELECT ROUND(column_name, decimals) FROM TABLE_NAME;
 
-- NOW() --
SELECT NOW() FROM TABLE_NAME;
 
-- FORMAT() --
SELECT FORMAT(column_name, format) FROM TABLE_NAME;

SQL. Querying multiple tables

Database schema

Relational databases are usually structured to avoid duplication of data. For example, a department may be associated to a number of employees by using a one-to-many relationship defined by a FOREIGN KEY constraint. While simple queries can be used to retrieve data from each table, it is often required to combine information from multiple tables, such as selecting a list of employees with information about the department they belong to, filtering data using the WHERE clause with information from other related tables and more complex aggregate functions. For such complex queries, there are two main SQL clauses: using JOINs or subqueries.

Using JOINS can be more efficient than running multiple queries (and then combining the data in the application). The DBMS handles the optimal retrieval of data based on indexing and performs the data transformations internally. This is also useful when requesting data from a remote location as it requires less data to be transferred between the DBMS and the application server.

JOINs

JOINs are used to retrieve data from two or more tables based on the relationships between them. The ON clause is used to match records between tables based on the column names. Consider the following example to list all employees with the name of their position. A way to write this in SQL is by using the WHERE clause for matching the position_id FOREIGN KEY with the id PRIMARY KEY in the one-to-many relationship between employee and position. This method is the same as using INNER JOIN.

-- List the employees name, and the name of their position --
SELECT emp.name AS employee, pos.name AS POSITION
FROM employee AS emp, POSITION AS pos
WHERE pos.id = emp.position_id;

JOINs have better performance compared to subqueries, although might be more difficult to write for complex queries.

(INNER) JOIN

INNER JOIN is used to return ONLY the matching rows from both tables based on the matching condition:

-- List the employees name and the name of their department --
SELECT emp.name AS employee, dept.name AS department FROM employee AS emp
INNER JOIN department AS dept
ON dept.id = emp.department_id;

Multiple related tables can be joined according to the database schema. For example, consider we want to list all employees along with the name of the department they belong to and their position. In the company database (example), the query requires an INNER JOIN between 3 tables: employee (get the employee names), department (get the name of the department), position (get the name of the position).

-- List the employees name, the name of their department and position --
SELECT emp.name AS employee, dept.name AS department, pos.name AS POSITION
FROM employee AS emp
INNER JOIN department AS dept
ON dept.id = emp.department_id
INNER JOIN POSITION AS pos
ON pos.id = emp.position_id;

Click here for other types of joins

Click here for other types of joins

LEFT (OUTER) JOIN

LEFT JOIN is used to return rows from both tables based on the matching condition. The difference from INNER JOIN is that ALL the rows from the first table will be returned and if for some rows there is no matching record on the second table, the result for that column(s) will be NULL. For example, consider the following query to list all employees with the name of their manager. When using INNER JOIN, the general manager will be missing from the list. With LEFT JOIN, the general manager will be listed, with the name of his manager (does not exist) set to NULL.

-- List the employees name and the name of their manager --
SELECT emp.name AS employee_name, manager.name AS manager_name FROM employee AS emp
LEFT JOIN employee AS manager ON emp.manager_id = manager.id
ORDER BY manager_name ASC;

RIGHT (OUTER) JOIN

RIGHT JOIN is used to return rows from both tables based on the matching condition. The difference from INNER JOIN is that ALL the rows from the second table will be returned and if for some rows there is no matching record on the first table, the result for that column(s) will be NULL.

FULL (OUTER) JOIN

FULL JOIN combines LEFT JOIN and RIGHT JOIN. Where rows in the tables do not match, the result set will have NULL values for every column of the table that lacks a matching row.

CROSS JOIN

CROSS JOIN is a simplest form of JOIN which matches each row from one table to all rows of another table (similar to cartesian product in set theory). The result of a CROSS JOIN can be filtered by using a WHERE clause which may then produce the equivalent of an INNER JOIN.

-- List all the employees names, and all positions --
SELECT emp.name AS employee, pos.name AS POSITION
FROM employee AS emp
CROSS JOIN POSITION AS pos
 
-- List all the employees names with their positions using CROSS JOIN and WHERE --
SELECT emp.name AS employee, pos.name AS POSITION
FROM employee AS emp
CROSS JOIN POSITION AS pos
WHERE emp.position_id = pos.id

Subqueries

A subquery is a query nested inside another query. Subqueries can be used with SELECT, INSERT, UPDATE and DELETE statements, along with operators like =, <, >, >=, ⇐, IN, BETWEEN, etc.

SELECT column_name_1, column_name_2
FROM   table_1,
WHERE  column_name OPERATOR
   (
    SELECT column_name
    FROM table_1
    [WHERE]
   )

When compared with JOINs, subqueries are more simple to use and easier to read. JOINs are more efficient and are usually recommended to use instead of subqueries for performance reasons.

Subqueries can be classified as:

Uncorrelated subqueries

Uncorrelated subqueries are independent of the main query (the inner query doesn't depend upon the outer query). For example, the following query returns a list of the employees from any department located in Bucharest. As the employee table is related to the department table using the department_id FOREIGN KEY, the subquery has to find and return the id(s) that belong to the departments having the location Bucharest and then the main query can filter the employees by the department id. In this case, the subquery is used in the WHERE clause of the main query.

-- List the employees from any department in Bucharest (using subqueries) --
SELECT employee.name AS employee
FROM employee
WHERE employee.department_id IN (
    SELECT department.id
    FROM department
    WHERE department.location = "Bucharest"
);
Correlated subqueries

Correlated subqueries depend upon the result of the main query (the inner query depends upon the outer query). For example, the following query returns a list of the employees having a higher salary than the average of their department. In this case, the subquery is used in the WHERE clause of the main query and has to compute the average salary for the department of each employee. The subquery now depends on the department_id of the employee.

-- List the employees having a higher salary than the average of their department --
SELECT A.name, A.salary, A.department_id
FROM employee AS A
WHERE A.salary > (
    SELECT AVG(B.salary)
    FROM employee AS B
    WHERE B.department_id = A.department_id
) 
ORDER BY A.department_id;

There are queries that be implemented either as JOINs or subqueries. The equivalent of the first subquery can be written as follows:

-- List the employees from any department in Bucharest (using JOIN) --
SELECT employee.name AS employee
FROM employee
INNER JOIN department
ON employee.department_id = department.id
WHERE department.location = "Bucharest";

Python database connection. Data structures.

Databases can be accessed from other applications as well. A MySQL driver: MySQL-connector-python is required to access the MySQL Server using Python. Check out Lab 1: Introduction to Python for installing Python and Python packages using pip and the command line. With the MySQL driver installed in the Python environment, the main steps in working with SQL from Python are as follows:

import mysql.connector
 
# Connect to the database server using the provided credentials
# Note: user="root" for admin account
mydb = mysql.connector.connect(
            host="localhost",
            port=3306,
            user="ewis_student",
            passwd="ewis2020",
            database="company"
        )
# Open a cursor
mycursor = mydb.cursor(dictionary=True)
 
# Execute SQL query
mycursor.execute("SELECT * FROM employee")
 
# Get results from the query using the cursor
myresult = mycursor.fetchall()
 
# Close the cursor
mycursor.close()

The database can be located on the same machine (use localhost) but it can also be on a remote machine, in the same network or in the cloud (use IP address/URL). The port is set to 3306 by default, and this corresponds to the port that the database server is running on.

The (database) cursor is a control structure that enables traversal, retrieval, addition and removal of database records in a sequential way. The following steps describe using cursors for database access:

The data is retrieved by default in the form of tuples (similar to lists, with the difference that tuples are immutable). Take for example the result of a query, returning a list of employees in the company database:

[
 (1, 1, 1, None, 'Big Hoss', Decimal('10000.00'), datetime.datetime(2018, 10, 18, 0, 0)),
 (20, 2, 4, 3, 'Captain America', Decimal('10000.00'), datetime.datetime(2019, 04, 26, 0, 0))
]

Using the dictionary=True argument when creating the cursor, the data can be retrieved as dictionary list such as:

[
   {
      'id':1,
      'department_id':1,
      'position_id':1,
      'manager_id':None,
      'name':'Big Hoss',
      'salary':Decimal('10000.00'),
      'hire_date':datetime.datetime(2018,10,18,0,0)
   }, {
      'id':20,
      'department_id':2,
      'position_id':4,
      'manager_id':3,
      'name':'Captain America',
      'salary':Decimal('10000.00'),
      'hire_date':datetime.datetime(2019,04,26,0,0)
   }
]

While the dictionary results are useful for working with data in a Python application, the CSV format is often used for processing data and saving result files. For this, the conversion between a list of dictionaries (the results from the cursor) and a CSV list can be done in Python as follows:

# Function to extract CSV data from a dictionary list
def get_csv_data(result):
    csvdata = []
    for elem in result:
        csvdata.append([elem[k] for k in elem])
    return csvdata

The data can then be written into a CSV file from Python:

def write_csv_file(data):
    with open("result.csv", "w") as f:
        for d in data:
            f.write(",".join([str(col) for col in d]) + "\n")

The CSV format and file content will look like this:

1,1,1,None,Big Hoss,10000.00,2018-10-18 00:00:00
20,2,4,3,Captain America,10000.00,2019-04-26 00:00:00

For working with tables, the pandas library can be used which simplifies many operations such as extracting columns, CRUD operations (creating, reading, updating and deletion of data), filtering and data analysis. Pandas is often used for Machine Learning in the form of dataframes: DataFrame, allows to import data from various structures and file formats: lists, dictionaries, csv files, excel files, etc.

In the following example, data is retrieved from a cursor and added to a pandas DataFrame:

mycursor = mydb.cursor(dictionary=True)
mycursor.execute("SELECT * FROM employee")
myresult = mycursor.fetchall()
mycursor.close()
res = db.run_query("SELECT * FROM employee")
 
# create the pandas dataframe with the results from the database (in dictionary format)
df = pd.DataFrame(res)
# the dataframe can be printed in the console
print(df)
# the dataframe can be converted to list
dlist = df.values.tolist()
print(dlist)
 
# columns can be extracted from the dataframe
names = df["name"].values
print(names)
# the following way works too
salaries = list(df["salary"])
# then we may want to do some additional data processing, for example converting to float (real numbers representation in Python)
salaries = [float(sal) for sal in salaries]

Charts. Matplotlib.

For data visualization, the Matplotlib package can be used with Python (pip3 install matplotlib). matplotlib.pyplot is a collection of functions used for: creating figures, plotting area in a figure, plotting lines in a plotting area, adding labels, etc. The steps for plotting data are as follows:

*optional

Consider the previous example, showing the list of salaries from the company database on a chart:

import matplotlib.pyplot as plt
plt.figure()
plt.stem(salaries)
plt.title("salaries for each employee")
plt.xlabel("employee")
plt.ylabel("salary")
plt.show()

There are multiple ways of representing the data on a chart. Common data visualization methods include:

Plots

Plots are used to display quantitative data such as time series, mathematical functions and are sometimes referred to as line charts.

import matplotlib.pyplot as plt
numbers = [3, 2, 2, 3, 4]
plt.figure()
plt.plot(numbers)
plt.title('some numbers')
plt.xlabel("index")
plt.ylabel("values")
plt.show()

Stemplots are used to display quantitative data in moderate amounts (up to about 50 data points). They are similar to histograms and allow to compare data.

import matplotlib.pyplot as plt
numbers = [3, 2, 2, 3, 4]
plt.figure()
plt.stem(numbers)
plt.title('some numbers')
plt.xlabel("index")
plt.ylabel("values")
plt.show()

Bar charts

A bar chart or bar graph presents categorical data with rectangular bars with heights or lengths proportional to the values. The bars can be plotted vertically or horizontally.

import matplotlib.pyplot as plt
names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]
plt.figure(figsize=(9, 3))
plt.bar(names, values)
plt.show()

Histograms

A histogram is an accurate graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). To construct a histogram, the first step is to “bin” the range of values (i.e. divide the entire range of values into a series of intervals) and then count how many values fall into each interval.

import matplotlib.pyplot as plt
# get data from database (salaries)
salaries = [10000.0, 8000.0, 5000.0, 6000.0, 8000.0, 4500.0, 3500.0, 3000.0, 5000.0, 0.0, 5500.0, 2500.0, 4000.0, 3000.0, 2000.0, 2500.0, 3000.0, 3000.0, 10000.0, 10000.0, 5000.0]
plt.figure()
plt.hist(salaries)
plt.title("number of employees having the same salary")
plt.xlabel("salaries")
plt.ylabel("number of employees")
plt.show()

Exercises

1. Databases (5p)

Task 1. Run the example queries presented in the lab (based on the company database example).

Task 2. Solve 5 of the following queries. Present the SQL and results.

  1. Using aggregate functions, write an SQL query to calculate the sum of all salaries in the company database.
  2. Using aggregate functions, write an SQL query to calculate the average salary in the company database.
  3. Write an SQL query to return a list of employees with the following information about them:
    • the name
    • the name of their department
    • the size of their team (number of employees within the same department)
  4. Write an SQL query to return a list of employees with the following information about them:
    • the name
    • the name of their position
    • the min/average/max salary of the employees on the same position
  5. Write an SQL query to return a list of employees with the following information about them:
    • the name in uppercase
    • the location of their department
  6. Write an SQL query to return a list of employees (name, salary) having a higher salary than the average salary within the company.
  7. Write an SQL query to return a list of employees (name, salary) having a higher salary than their manager.
  8. Write an SQL query to return the employees (name, salary) having the highest salary within each department.
  9. Write an SQL query to return the managers and the number of employees in their teams.
  10. Write an SQL query to list all employees and the year they were hired (hint: use the scalar function DATE_FORMAT(hire_date,'%Y'))
2. MySQL-Python. Charts (5p)

For this part you also need Python installed on your computer.

  • Check out the requirements section (at the end of the page) to install the required tools and setup working with Python on your computer.

Run the example code (lab.py) and check out the generated CSV file that contains the data about the employees in the example database.

Run the example code (plots.py) to display the results of the functions as charts.

Task 3. Based on the examples, create a function to display the results for one of the selected queries in CSV format.

Hint:

Task 4. Based on the examples, create a function to display the results for one of the selected queries as a plot. You may plot a single column, using either plot, stem, barchart or histogram.

Hints:

3. Online evaluation

Resources

You will start from a Python project that is set up for connecting to the database, processing results and showing charts.

  • Install Python.
  • Download the Python project from the archive below.
  • Unzip the archive.
  • Check out readme.txt for installing required packages.
  • Run the examples as described in the exercises.
  • Modify the code to solve the exercises.

Project Archive