The Basics of Relational Databases

Learning the Essentials of Relational Databases

Types of databases

  1. Relational databases (SQL) - data is stored in structured tables with columns and rows defined by schemas.

    • Examples: PostgreSQL, MySQL, SQLite

    • Why pick a relational database: If you have structured data with strong relationships and consistency is a huge priority

    • Use cases:

      • E-commerce - entities have strong relationships to each other (users, orders etc.)

      • Finance systems - complex relationships between entities where consistency is important

      • Others: Healthcare, inventory management, education systems, airlines, government systems etc.

    • Drawbacks:

      • Horizontal scaling can be complex

      • Rigid schema design that can make changes complex

  2. Non-relational databases (No-SQL) - data is stored in flexible structures such as key-value pairs, JSON objects, Graphs, Columns, XML etc.

    • Examples: MongoDB (Document), Redis (Key-Value), Cassandra (Column), Neo4j (Graph)

    • Why pick a non-relational database: If you have unstructured or semi-structured data, where scalability is a huge priority

    • Use cases:

      • Caching data using key value pairs (Redis)

      • Social relationships can be represented using graphs (Neo4j)

      • Collecting telemetry data, logs, analytics

    • Drawbacks:

      • Most DBs sacrifice strong consistency (for scalability)

      • Most DBs don’t have support for complex queries, aggregations and joins

      • Most favor denormalization (harder to maintain duplicate data and storage inefficiencies)

      • Most don’t have schema enforcement, so data integrity can slip through the cracks if not enforced on the application level

💡
This Article focuses on Relational Databases 🙂

Key concepts

Structure - tables, columns, rows, primary keys, foreign keys

A relational database is a collection of related data stored in table.

  • A table is like a spreadsheet with columns and rows - If we want to store a user in the database, the Users entity would be the table.

  • A column is single attribute of the data that defines the type of the data stored - A user has a number of properties identifying them such as name and age, those are columns.

  • A row is a record of one instance of the data - One data entry of a user named Jane Doe aged 31 would be an instance of the data which is a row.

  • A primary key (PK) is a column used to uniquely identify each entry (row) in the table - it can be auto generated or provided at data insertion.

  • A foreign key (FK) is a column that keeps track of another related table - If a user has orders, we would have another table Orders with a FK column (which points back to the User table PK column) such that when find an order, we can easily use the Foreign Key to find the associated user.

Relationships

Relationships define how 2 tables are related. Take an example of a financial system where we have users who have a wallet and can make transactions such as sending money to other users.

  • One to one (1:1) - Each row on one table is linked to only 1 row in the other table. E.g. A user may only have one wallet and a wallet belongs to only one user.

  • One to many (1:N) - One row on one table can be linked with multiple rows in another table. E.g. One user can make multiple transactions and multiple transactions can belong to one user. The Transactions table would have a FK column pointing to the PK of the Users table.

  • Many to many (N:N) - Rows in one table can be linked to multiple rows in another table and vice versa. E.g. In an education system we can have Course and Student tables where a student can enroll into many courses and a course can have many students. To represent the relationship between these 2 tables, we might have a 3rd table called Enrollments that has 2 FK columns.

Joins

Joins are used to combine data from different tables. Types of Joins:

  1. INNER JOIN

  2. OUTER JOIN

    • LEFT (OUTER) JOIN

    • RIGHT (OUTER) JOIN

    • FULL OUTER JOIN

  3. SELF JOIN

  4. CROSS JOIN

To understand joins lets take a simple example of a Users table and an Orders table where the relationship is One to Many (1:N) (one user can have many orders and one order can belong to only one user).

Users table

idname
1Jane Doe
2Lucy Dear
3David Karl

Orders table

idamountuser_id
1201
240NULL
3601

We want to write a query that combines data from both tables. Here’s how the different joins work.

  1. INNER JOIN

Returns all rows from both tables that are related. Here we would return only valid users with valid orders. If a user doesn’t have an order or an order doesn’t have a valid user, the row won’t be returned.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
INNER JOIN orders ON users.id = orders.user_id;

Result

nameorder_idamount
Jane Doe120
Jane Doe360
  1. OUTER JOIN

a) LEFT (OUTER) JOIN

Returns all rows in the left table (User) matching rows in the right table (Order). Here we would return all users, even those without orders.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
LEFT JOIN orders ON users.id = orders.user_id;

Result

nameorder_idamount
Jane Doe120
Jane Doe360
Lucy DearNULLNULL
David KarlNULLNULL

b) RIGHT (OUTER) JOIN

Returns all rows in the right table (Order) and matching rows in the left table (User). Here we would return all orders, whether they have a valid user or not.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
RIGHT JOIN orders ON users.id = orders.user_id;

Result

nameorder_idamount
Jane Doe120
Jane Doe360
NULL240

c) FULL OUTER JOIN

Returns all rows from both tables regardless of the relationship.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id;
nameorder_idamount
Jane Doe120
Jane Doe360
Lucy DearNULLNULL
David KarlNULLNULL
NULL240
  1. CROSS JOIN

A cross join is a very rare join that grows exponentially. It returns every combination of rows from both tables. Here, every user would be paired with every order regardless of the relationship.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
CROSS JOIN orders;
nameorder_idamount
Jane Doe120
Jane Doe240
Jane Doe360
Lucy Dear120
Lucy Dear240
Lucy Dear360
David Karl120
David Karl240
David Karl360
  1. SELF JOIN

A self join is used when we want to join a table with itself and we need to compare rows in the same table. For example, say we have a table of employees where employees have managers (who are also employees) and we want to find all employees and their managers

idnamemanager_id
1Jane Doe3
2Janet Monday3
3Judy Scalar4
4Bruno MarsNULL
SELECT e1.name AS employee, e2.name AS manager
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.id;

Result

employeemanager
Jane DoeJudy Scalar
Janet MondayJudy Scalar
Judy ScalarBruno Mars
💡
INNER JOIN is generally faster because it only returns rows that have matching values in both tables. Other joins, like OUTER JOIN, may include NULL values, which can slow down performance.

Querying

Is there a right way to form a query?

Let’s start by taking a look at the order of execution of statements. When we make a query, the database combines data from various tables and returns a single dataset that we can think of as another new table.

Order of execution: Load tables → Filter → Group → Filter groups → Select columns → Order → Limit

Take a look at this query that returns the top 5 users who have spent more than $500 on orders, considering only orders above $100. Can you tell the order of operation?

SELECT users.id, users.name, SUM(orders.amount) AS total_spent
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.amount > 100
GROUP BY users.id, users.name
HAVING total_spent > 500
ORDER BY total_spent DESC
LIMIT 5;
  1. Identify source table (orders - the Right table) and load it [FROM]

  2. Identify secondary table (users - the Left table) and join the tables using an inner join on user_id [JOIN]

  3. Filter the new dataset so that we only have rows where amount > 100 [WHERE]

  4. Group and aggregate the rows by user id and name. Each row will have a unique user with their name, id and sum of the amount [GROUP BY]

  5. Filter the groups. Now we only have users whose total spent amount > 500 [HAVING]

  6. Select the columns. Now we only have id, name and amount as total_spent columns [SELECT]

  7. Order by total amount spent, with the highest spenders at the top [ORDER BY]

  8. Take the first 5 rows and return that as a single dataset [LIMIT]

Can we optimize this query now that we understand the order of operations? Remember the goal is to reduce the number of rows as early as possible.

  • We can move the where filter into the join such that we return only the essential rows on the join.

    JOIN orders ON orders.user_id = users.id AND orders.amount > 100

  • We could change the join to suit our needs (not in this instance though)

Transactions

A transaction is a group of database operations that are executed as one unit where all operations have to be successful to be commited and saved successfully. If a single operation fails, the database rolls back to the previous state. All or none!

Example: Take a financial system where a user needs to send money from their wallet to another user’s wallet. We would need to perform at least 3 write operations.

  1. Deduct the amount from their wallet balance

  2. Add the amount to the recipient wallet balance

  3. Create a log of the transaction with details like time of the transaction and amount sent

Without transactions we could just perform the 3 write operations one after the other and hope nothing fails. However, if a failure happens, e.g. a recipient doesn’t exist, the sender wallet balance would be reduced with no recipient and no record of the transaction. It wouldn’t be an accurate record of events and we would end up with missing funds.

With a transaction, we would partially make writes updating the sender and recipient balance, and adding a transaction log which would be treated as a single unit. If the updating operation on the recipient fails, the update operation on the sender balance would be rolled back and we’d go back to the state before we started any of the operations. No lost funds! Yay! And if all 3 write operations were successful, we commit and save all three writes.

Indexing

Indexing is an optimization technique that improves the efficiency of reading data from tables.

Analogy 🤓

Think of a table as a book and the index as the table of contents.

If we want to find a chapter in a book, we could go through each page, until we find the exact chapter. On a good day, it could be the first chapter, worst case it could be the last chapter.

Or, we could use the table of contents to find the exact page and jump to it. That’s how database indexing works.

Indexes can either be clustered or non-clustered:

  • A clustered index determines the physical order of data on the table. Primary keys are usually clustered indexes by default.

  • A non-clustered index creates a separate data structure from the table that holds the indexed columns along with pointers (references) to the actual data rows. When records are inserted, updated or deleted, the database engine automatically updates the index.

💡
The implementation of indexes varies between databases, some databases use trees, others hash tables and some use text to represent the indexes.

Here is a sample SQL query creating an index on a column

CREATE INDEX idx_users_email ON users(email);

So, what columns on a table should we index? All these are great candidates for indexing

  • Primary keys (some databases index primary keys automatically)

  • Foreign keys (any column used to join tables)

  • Columns frequently used to filter when querying

  • Columns frequently used to sort or order

  • Columns that have a lot of unique values (if we index a column with low uniqueness, a single index will map to many rows all of which would have to be scanned, beating the purpose of indexing)

When should we avoid indexing?

  • Small tables. We enjoy the benefits of the optimization on large datasets.

  • Columns with few unique values (as mentioned above: if we index a column with low uniqueness, a single index will map to many rows all of which would have to be scanned, beating the purpose of indexing)

  • Columns that are updated frequently. When an indexed column is updated (insert, update or delete), the corresponding index has to be updated which creates extra overhead.

  • Don’t index every column on the table.

💡
Note: Indexing can slow down write operations, be careful. Indexing comes at a cost of storage and maintainance overhead. If you have write heavy operations on an indexed column, the indexes have to be updated with each update which is expensive and can be slow.

Data Normalization

Data normalization is the process of organizing data to reduce redundancy. It involves breaking down large tables into smaller related tables. We normalize data to optimize storage and maintain data consistency.

I bet you’ve heard about a set of rules used for normalization called Normal Forms. They are cumulative and build on top of each other.

If you have a dataset and have no idea about where to start in breaking it down into related tables, normalization rules are a great place to start.

Lets take a look at this dataset and follow the rules of normalization to transform it from 1NF → 2NF → 3NF

OrderIDCustomerNameCustomerPhoneProductIDProductName
1Alice123-456-7890101, 102Laptop, Mouse
2Bob987-654-3210103Keyboard
  1. First Normal Form (1NF)

    • Each table cell should have only one value

    • Each row should be unique

1NF: Some cells have multiple values, we need to break that up and create rows for each value. Each cell now has one single value and each row is unique but by composite key of OrderID and ProductID.

OrderIDCustomerNameCustomerPhoneProductIDProductName
1Alice123-456-7890101Laptop
1Alice123-456-7890102Mouse
2Bob987-654-3210103Keyboard
  1. Second Normal Form (2NF)

    • The table should already satisfy 1NF
  • The table shouldn’t have partial dependencies. i.e. The non-primary columns should depend on the subset of primary columns (all of them).

2NF: We want to remove any partial dependencies on the composite primary keys OrderID and ProductID. If there is a column that doesn’t rely on both primary keys, we move it to a different table. CustomerName and CustomerPhone have nothing to do with ProductID, so we can move those into another table. So now we have 2 tables.

OrderIDCustomerNameCustomerPhone
1Alice123-456-7890
2Bob987-654-3210
IDOrderIDProductIDProductName
11101Laptop
21102Mouse
32103Keyboard
  1. Third Normal Form (3NF)

    • The table should already satisfy 2NF

    • The table shouldn’t have transistive dependencies i.e. A non-primary column should be directly related to the primary column

3NF: In this case ProductName is related to ProductID which is then related to the OrderID, so ProductName is related to OrderID as a transitive dependency which we should remove. So we can move that to a different table and we end up with 3 tables

OrderIDCustomerNameCustomerPhone
1Alice123-456-7890
2Bob987-654-3210
OrderIDProductID
1101
1102
2103
ProductIDProductName
101Laptop
102Mouse
103Keyboard

Data denormalization

In some instances, we would want to add some redundancy to data - we call this denormalization. Why would we denormalize data?

  • If performance is a bottleneck redundancy is acceptable.
  • To improve read query performace - many small tables require many joins which can sometimes be slow and expensive

  • Analytics and reporting - storing aggregated data to reduce expensive recomputing

Use case for denormalization: in a reporting system where performance is critical data might be denormalized to avoid expensive joins, such as storing aggregated sales data in a separate table for fast access.


Best practices

  1. Database design

    • Use proper normalization

    • Use appropriate data types

    • Every table should have a primary key and relationship should use foreign keys

    • Use indexing appropriately - if a column is read frequently, or used to order, filter or join consider indexing it (if the column is updated frequently, maybe don’t index it). Avoid over indexing.

  2. Naming conventions

    • Use plural table names (because tables store many records)

    • Be consistent with the casing (stick to either CamelCase or snake_case, just don’t mix them)

    • Avoid reserved words

    • Use meaningful names e.g. name vs first_name

  3. Query optimization

    • Filter early (to reduce the number of rows to be processed) e.g. use where before joining

    • Paginate your data (Use LIMIT and OFFSET)

    • Avoid selecting all the columns, specify which ones you need

    • Monitor performance of queries to find opportunities for improvement (Use EXPLAIN ANALYZE)

    • Avoid functions in WHERE clauses

  4. Transactions

    • Use transactions for critical operations for atomicity

    • Use short transactions

    • Beware of situations that could lead to deadlock

      💡
      Some databases like PostgreSQL detect deadlocks automatically, but we can still desing to reduce occurrences

Any questions? Leave a comment 🙂