Types of databases

Relational databases (SQL) - data is stored in structured tables with columns and rows defined by schemas.
- Examples: PostgreSQL, MySQL, SQLite
- Why pick a relational database: If you have structured data with strong relationships and consistency is a huge priority
- Use cases:
  - E-commerce - entities have strong relationships to each other (users, orders etc.)
  - Finance systems - complex relationships between entities where consistency is important
  - Others: Healthcare, inventory management, education systems, airlines, government systems etc.
- Drawbacks:
  - Horizontal scaling can be complex
  - Rigid schema design that can make changes complex
Non-relational databases (No-SQL) - data is stored in flexible structures such as key-value pairs, JSON objects, Graphs, Columns, XML etc.
- Examples: MongoDB (Document), Redis (Key-Value), Cassandra (Column), Neo4j (Graph)
- Why pick a non-relational database: If you have unstructured or semi-structured data, where scalability is a huge priority
- Use cases:
  - Caching data using key value pairs (Redis)
  - Social relationships can be represented using graphs (Neo4j)
  - Collecting telemetry data, logs, analytics
- Drawbacks:
  - Most DBs sacrifice strong consistency (for scalability)
  - Most DBs don’t have support for complex queries, aggregations and joins
  - Most favor denormalization (harder to maintain duplicate data and storage inefficiencies)
  - Most don’t have schema enforcement, so data integrity can slip through the cracks if not enforced on the application level

💡

This Article focuses on Relational Databases 🙂

Key concepts

Structure - tables, columns, rows, primary keys, foreign keys

A relational database is a collection of related data stored in table.

A table is like a spreadsheet with columns and rows - If we want to store a user in the database, the Users entity would be the table.
A column is single attribute of the data that defines the type of the data stored - A user has a number of properties identifying them such as name and age, those are columns.
A row is a record of one instance of the data - One data entry of a user named Jane Doe aged 31 would be an instance of the data which is a row.
A primary key (PK) is a column used to uniquely identify each entry (row) in the table - it can be auto generated or provided at data insertion.
A foreign key (FK) is a column that keeps track of another related table - If a user has orders, we would have another table Orders with a FK column (which points back to the User table PK column) such that when find an order, we can easily use the Foreign Key to find the associated user.

Relationships

Relationships define how 2 tables are related. Take an example of a financial system where we have users who have a wallet and can make transactions such as sending money to other users.

One to one (1:1) - Each row on one table is linked to only 1 row in the other table. E.g. A user may only have one wallet and a wallet belongs to only one user.
One to many (1:N) - One row on one table can be linked with multiple rows in another table. E.g. One user can make multiple transactions and multiple transactions can belong to one user. The Transactions table would have a FK column pointing to the PK of the Users table.
Many to many (N:N) - Rows in one table can be linked to multiple rows in another table and vice versa. E.g. In an education system we can have Course and Student tables where a student can enroll into many courses and a course can have many students. To represent the relationship between these 2 tables, we might have a 3rd table called Enrollments that has 2 FK columns.

Joins

Joins are used to combine data from different tables. Types of Joins:

INNER JOIN
OUTER JOIN
- LEFT (OUTER) JOIN
- RIGHT (OUTER) JOIN
- FULL OUTER JOIN
SELF JOIN
CROSS JOIN

To understand joins lets take a simple example of a Users table and an Orders table where the relationship is One to Many (1:N) (one user can have many orders and one order can belong to only one user).

Users table

id	name
1	Jane Doe
2	Lucy Dear
3	David Karl

Orders table

id	amount	user_id
1	20	1
2	40	NULL
3	60	1

We want to write a query that combines data from both tables. Here’s how the different joins work.

INNER JOIN

Returns all rows from both tables that are related. Here we would return only valid users with valid orders. If a user doesn’t have an order or an order doesn’t have a valid user, the row won’t be returned.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
INNER JOIN orders ON users.id = orders.user_id;

Result

name	order_id	amount
Jane Doe	1	20
Jane Doe	3	60

OUTER JOIN

a) LEFT (OUTER) JOIN

Returns all rows in the left table (User) matching rows in the right table (Order). Here we would return all users, even those without orders.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
LEFT JOIN orders ON users.id = orders.user_id;

Result

name	order_id	amount
Jane Doe	1	20
Jane Doe	3	60
Lucy Dear	NULL	NULL
David Karl	NULL	NULL

b) RIGHT (OUTER) JOIN

Returns all rows in the right table (Order) and matching rows in the left table (User). Here we would return all orders, whether they have a valid user or not.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
RIGHT JOIN orders ON users.id = orders.user_id;

Result

name	order_id	amount
Jane Doe	1	20
Jane Doe	3	60
NULL	2	40

c) FULL OUTER JOIN

Returns all rows from both tables regardless of the relationship.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id;

name	order_id	amount
Jane Doe	1	20
Jane Doe	3	60
Lucy Dear	NULL	NULL
David Karl	NULL	NULL
NULL	2	40

CROSS JOIN

A cross join is a very rare join that grows exponentially. It returns every combination of rows from both tables. Here, every user would be paired with every order regardless of the relationship.

SELECT users.name, orders.id AS order_id, orders.amount
FROM users
CROSS JOIN orders;

name	order_id	amount
Jane Doe	1	20
Jane Doe	2	40
Jane Doe	3	60
Lucy Dear	1	20
Lucy Dear	2	40
Lucy Dear	3	60
David Karl	1	20
David Karl	2	40
David Karl	3	60

SELF JOIN

A self join is used when we want to join a table with itself and we need to compare rows in the same table. For example, say we have a table of employees where employees have managers (who are also employees) and we want to find all employees and their managers

id	name	manager_id
1	Jane Doe	3
2	Janet Monday	3
3	Judy Scalar	4
4	Bruno Mars	NULL

SELECT e1.name AS employee, e2.name AS manager
FROM employees e1
INNER JOIN employees e2 ON e1.manager_id = e2.id;

Result

employee	manager
Jane Doe	Judy Scalar
Janet Monday	Judy Scalar
Judy Scalar	Bruno Mars

💡

INNER JOIN is generally faster because it only returns rows that have matching values in both tables. Other joins, like OUTER JOIN, may include NULL values, which can slow down performance.

Querying

Is there a right way to form a query?

Let’s start by taking a look at the order of execution of statements. When we make a query, the database combines data from various tables and returns a single dataset that we can think of as another new table.

Order of execution: Load tables → Filter → Group → Filter groups → Select columns → Order → Limit

Take a look at this query that returns the top 5 users who have spent more than $500 on orders, considering only orders above $100. Can you tell the order of operation?

SELECT users.id, users.name, SUM(orders.amount) AS total_spent
FROM orders
JOIN users ON orders.user_id = users.id
WHERE orders.amount > 100
GROUP BY users.id, users.name
HAVING total_spent > 500
ORDER BY total_spent DESC
LIMIT 5;

Identify source table (orders - the Right table) and load it [FROM]
Identify secondary table (users - the Left table) and join the tables using an inner join on user_id [JOIN]
Filter the new dataset so that we only have rows where amount > 100 [WHERE]
Group and aggregate the rows by user id and name. Each row will have a unique user with their name, id and sum of the amount [GROUP BY]
Filter the groups. Now we only have users whose total spent amount > 500 [HAVING]
Select the columns. Now we only have id, name and amount as total_spent columns [SELECT]
Order by total amount spent, with the highest spenders at the top [ORDER BY]
Take the first 5 rows and return that as a single dataset [LIMIT]

Can we optimize this query now that we understand the order of operations? Remember the goal is to reduce the number of rows as early as possible.

We can move the where filter into the join such that we return only the essential rows on the join.

JOIN orders ON orders.user_id = users.id AND orders.amount > 100
We could change the join to suit our needs (not in this instance though)

Transactions

A transaction is a group of database operations that are executed as one unit where all operations have to be successful to be commited and saved successfully. If a single operation fails, the database rolls back to the previous state. All or none!

Example: Take a financial system where a user needs to send money from their wallet to another user’s wallet. We would need to perform at least 3 write operations.

Deduct the amount from their wallet balance
Add the amount to the recipient wallet balance
Create a log of the transaction with details like time of the transaction and amount sent

Without transactions we could just perform the 3 write operations one after the other and hope nothing fails. However, if a failure happens, e.g. a recipient doesn’t exist, the sender wallet balance would be reduced with no recipient and no record of the transaction. It wouldn’t be an accurate record of events and we would end up with missing funds.

With a transaction, we would partially make writes updating the sender and recipient balance, and adding a transaction log which would be treated as a single unit. If the updating operation on the recipient fails, the update operation on the sender balance would be rolled back and we’d go back to the state before we started any of the operations. No lost funds! Yay! And if all 3 write operations were successful, we commit and save all three writes.

Indexing

Indexing is an optimization technique that improves the efficiency of reading data from tables.

Analogy 🤓

Think of a table as a book and the index as the table of contents.

If we want to find a chapter in a book, we could go through each page, until we find the exact chapter. On a good day, it could be the first chapter, worst case it could be the last chapter.

Or, we could use the table of contents to find the exact page and jump to it. That’s how database indexing works.

Indexes can either be clustered or non-clustered:

A clustered index determines the physical order of data on the table. Primary keys are usually clustered indexes by default.
A non-clustered index creates a separate data structure from the table that holds the indexed columns along with pointers (references) to the actual data rows. When records are inserted, updated or deleted, the database engine automatically updates the index.

💡

The implementation of indexes varies between databases, some databases use trees, others hash tables and some use text to represent the indexes.

Here is a sample SQL query creating an index on a column

CREATE INDEX idx_users_email ON users(email);

So, what columns on a table should we index? All these are great candidates for indexing

Primary keys (some databases index primary keys automatically)
Foreign keys (any column used to join tables)
Columns frequently used to filter when querying
Columns frequently used to sort or order
Columns that have a lot of unique values (if we index a column with low uniqueness, a single index will map to many rows all of which would have to be scanned, beating the purpose of indexing)

When should we avoid indexing?

Small tables. We enjoy the benefits of the optimization on large datasets.
Columns with few unique values (as mentioned above: if we index a column with low uniqueness, a single index will map to many rows all of which would have to be scanned, beating the purpose of indexing)
Columns that are updated frequently. When an indexed column is updated (insert, update or delete), the corresponding index has to be updated which creates extra overhead.
Don’t index every column on the table.

💡

Note: Indexing can slow down write operations, be careful. Indexing comes at a cost of storage and maintainance overhead. If you have write heavy operations on an indexed column, the indexes have to be updated with each update which is expensive and can be slow.

Data Normalization

Data normalization is the process of organizing data to reduce redundancy. It involves breaking down large tables into smaller related tables. We normalize data to optimize storage and maintain data consistency.

I bet you’ve heard about a set of rules used for normalization called Normal Forms. They are cumulative and build on top of each other.

If you have a dataset and have no idea about where to start in breaking it down into related tables, normalization rules are a great place to start.

Lets take a look at this dataset and follow the rules of normalization to transform it from 1NF → 2NF → 3NF

OrderID	CustomerName	CustomerPhone	ProductID	ProductName
1	Alice	123-456-7890	101, 102	Laptop, Mouse
2	Bob	987-654-3210	103	Keyboard

First Normal Form (1NF)
- Each table cell should have only one value
- Each row should be unique

1NF: Some cells have multiple values, we need to break that up and create rows for each value. Each cell now has one single value and each row is unique but by composite key of OrderID and ProductID.

OrderID	CustomerName	CustomerPhone	ProductID	ProductName
1	Alice	123-456-7890	101	Laptop
1	Alice	123-456-7890	102	Mouse
2	Bob	987-654-3210	103	Keyboard

Second Normal Form (2NF)
- The table should already satisfy 1NF

The table shouldn’t have partial dependencies. i.e. The non-primary columns should depend on the subset of primary columns (all of them).

2NF: We want to remove any partial dependencies on the composite primary keys OrderID and ProductID. If there is a column that doesn’t rely on both primary keys, we move it to a different table. CustomerName and CustomerPhone have nothing to do with ProductID, so we can move those into another table. So now we have 2 tables.

OrderID	CustomerName	CustomerPhone
1	Alice	123-456-7890
2	Bob	987-654-3210

ID	OrderID	ProductID	ProductName
1	1	101	Laptop
2	1	102	Mouse
3	2	103	Keyboard

Third Normal Form (3NF)
- The table should already satisfy 2NF
- The table shouldn’t have transistive dependencies i.e. A non-primary column should be directly related to the primary column

3NF: In this case ProductName is related to ProductID which is then related to the OrderID, so ProductName is related to OrderID as a transitive dependency which we should remove. So we can move that to a different table and we end up with 3 tables

OrderID	CustomerName	CustomerPhone
1	Alice	123-456-7890
2	Bob	987-654-3210

OrderID	ProductID
1	101
1	102
2	103

ProductID	ProductName
101	Laptop
102	Mouse
103	Keyboard

Data denormalization

In some instances, we would want to add some redundancy to data - we call this denormalization. Why would we denormalize data?

If performance is a bottleneck redundancy is acceptable.

To improve read query performace - many small tables require many joins which can sometimes be slow and expensive
Analytics and reporting - storing aggregated data to reduce expensive recomputing

Use case for denormalization: in a reporting system where performance is critical data might be denormalized to avoid expensive joins, such as storing aggregated sales data in a separate table for fast access.

Best practices

Database design
- Use proper normalization
- Use appropriate data types
- Every table should have a primary key and relationship should use foreign keys
- Use indexing appropriately - if a column is read frequently, or used to order, filter or join consider indexing it (if the column is updated frequently, maybe don’t index it). Avoid over indexing.
Naming conventions
- Use plural table names (because tables store many records)
- Be consistent with the casing (stick to either CamelCase or snake_case, just don’t mix them)
- Avoid reserved words
- Use meaningful names e.g. name vs first_name
Query optimization
- Filter early (to reduce the number of rows to be processed) e.g. use where before joining
- Paginate your data (Use LIMIT and OFFSET)
- Avoid selecting all the columns, specify which ones you need
- Monitor performance of queries to find opportunities for improvement (Use EXPLAIN ANALYZE)
- Avoid functions in WHERE clauses
Transactions
- Use transactions for critical operations for atomicity
- Use short transactions
- Beware of situations that could lead to deadlock
  
  💡
  
  Some databases like PostgreSQL detect deadlocks automatically, but we can still desing to reduce occurrences

Any questions? Leave a comment 🙂

The Basics of Relational Databases

Learning the Essentials of Relational Databases

Table of contents

Types of databases

Key concepts

Structure - tables, columns, rows, primary keys, foreign keys

Relationships

Joins

INNER JOIN

OUTER JOIN

CROSS JOIN

SELF JOIN

Querying

Transactions

Indexing

Data Normalization

Data denormalization

Best practices