Performances, Databases — Part 1
Let’s talk about cardinality!
Good day!
Intro
There is always a war going on for getting the best performance out of a database! There are several documentations, strategies everywhere which are thorough! Today, I don’t want to bore you with lot of technical jargons, I will try to take a simple, concise approach to help you understand some ideas when I work with a Database!
For simplicity we will keep the idea limited to MongoDB, AWS DocumentDB, et al. But theoretically, it should apply to almost all Databases.
Who is it for?
If you are an established team maintaining a database for a while with slowly and steadily growing data, I am sure you have that story/defect lying around to fix the database query issues!
If you are a new team and trying to design a best performing DB?
How can I extract best performance out of a DB?
Indexes (or Indices)! Duh!
Have the right indices for your query and you are done! No it’s not enough.
No, it’s not enough. It’s an easy escape, a lazy approach to address a performance problem!
Here are some key points, I will advice everyone to follow, and start your own beautiful journey of running a database 🙂
Data distribution + indices
A well designed database with appropriate indices can handle billions of documents! yes, its not a pipedream.
But it totally depends on the data distribution i.e. cardinality. Now lets rephrase it.
A well designed database with appropriate indices and cardinality i.e. data distribution can handle billions of documents! yes, its not a pipedream. And yes it can happen in several milliseconds too!
Before we get into the details of cardinality or data distribution. Let’s take a simple human example. How do we find a topic in a huge text/reference book? There are few ways…
Bookmarks…. Table of contents…. Index
Index of a book is generally at the back of our text/reference books, where the book is practically arranged in alphabetical order of topics/keys.
Lets say you are studying a book on Algorithms, and you want to quickly read on a topic of LinkedLists, just go to the index page, got to the L alphabet of the list, find the page number on the topic Linked List and voila! It only takes few seconds too!
I would like to take this as an analogy for our Database index as well. This is why an index is important, it finds a key on an index which is already sorted and easy to find and being VERY VERY FAST at it!
There is a small catch!
Now that we are clear on how an index helps us find a key/topic very quickly. Now, let’s look at a bad index page.
You have to now search the key/topic linked list. With a badly written index page, it’s so difficult for you to find an appropriate linked list topic! This index page is just showing you several pages where the word “linked list” occurred. You have to now go through each page, read the topic and decide if its the right page you are looking for! This is a waste of time!
Again if I have to use this analogy in a database, lets say the field or index on which we store data has just the word “linked list”, we are just making it difficult for the database to query on a key called “linked list”. It’s “LOW CARDINALITY” in other words, there are too many duplicates of this key on that index! This is where most databases are choking in the world!
You solve this, you solved your database query performance for good! (generally)
I bet, you did not see the first index page in detail, now lets look at it again 🙂
This L section is a well written index! You clearly can differentiate which topic is in which page! Its fast and efficient for us to now find the relevant key/topic.
High Cardinality
An index for a field should only be finalized if its cardinality is high, i.e. it has the MOST unique data. Generally most databases recommend to have 1–2% unique keys for an index, but in practicality sometimes it may not be the case, we can always relax the rule by 2–3%.
Here is an example of the MOST unique index:
EmployeeID in a company is always Unique. If we have an index on the field EmployeeID, its the best index. This query will performance in few milliseconds!
| EmployeeID | FirstName | LastName | Department | Salary |
|------------|-----------|----------|--------------|--------|
| E1001 | John | Doe | Engineering | 75000 |
| E1002 | Jane | Smith | Marketing | 60000 |
| E1003 | Robert | Johnson | HR | 65000 |
| E1004 | Emily | Davis | Finance | 80000 |
| E1005 | Michael | White | Engineering | 70000 |
Here is an example of the LEAST unique index:
In this dataset, the “First Name” column has only five distinct names, resulting in low cardinality. This could be representative of scenarios where a limited set of first names is encountered in a specific context or dataset.
| ID | First Name | Last Name | Age | Country |
|-----|------------|-------------|-----|----------------|
| 1 | John | Smith | 30 | United States |
| 2 | Sarah | Johnson | 25 | Canada |
| 3 | Michael | Doe | 28 | United States |
| 4 | Jessica | Williams | 32 | Australia |
| 5 | Emily | Smith | 35 | United Kingdom |
| 6 | John | Davis | 27 | Canada |
| 7 | Sarah | Anderson | 31 | United States |
| 8 | Michael | Brown | 29 | Australia |
| 9 | Jessica | Wilson | 33 | United Kingdom |
| 10 | Emily | Johnson | 26 | Canada |
| 11 | John | Taylor | 29 | United States |
| 12 | Sarah | White | 34 | Canada |
| 13 | Michael | Harris | 31 | United States |
| 14 | Jessica | Thompson | 27 | Australia |
| 15 | Emily | Clark | 30 | United Kingdom |
| 16 | John | Turner | 28 | Canada |
| 17 | Sarah | Martinez | 33 | United States |
| 18 | Michael | Cooper | 26 | Australia |
| 19 | Jessica | Bailey | 32 | United Kingdom |
| 20 | Emily | Mitchell | 25 | Canada |
| 21 | John | Phillips | 30 | United States |
| 22 | Sarah | Reed | 29 | Canada |
| 23 | Michael | Morgan | 31 | United States |
| 24 | Jessica | Reed | 34 | Australia |
| 25 | Emily | Turner | 27 | United Kingdom |
| 26 | John | Lewis | 32 | Canada |
| 27 | Sarah | Foster | 28 | United States |
| 28 | Michael | Walker | 30 | Australia |
| 29 | Jessica | Collins | 26 | United Kingdom |
| 30 | Emily | Adams | 33 | Canada |
Now, imagine this above collection of data in LARGE SCALE, where there are thousands or millions of datasets with less distinctive first names! Even with an index on First name, its not going to help the DB in performing the query faster!
This is when everyone does a rookie mistake of scaling the hardware of the database to temporarily solve the CPU/Memory issues that arise due to such queries/indices!
Never throw hardware at a performance problem
With the above examples in mind, we can avoid indices (or indexes) on some dangerous data types:
- bools
bools is either true or false! It’s never unique.
- Enumerations/Statuses
If you have a data type where you store statuses of a task like “not started”, “in progress” or “done”, it should not have an index!
Why? Simple, this field will be for most part be “done”. Which is generally the whole table or collection, you are practically asking the database to scan this entire collection if you are querying for “done” status!
- Ranges
Sometimes there is a need for querying on ranges.
Example:
Get me data where a field is: less than a number value/date value
Get me data where a field is: greater than a number value/date value
These are few examples where a single field index is a BAD option. These being extremely low cardinal, the Database has to SCAN every field on the index and sacrifice its CPU/Memory to serve you the query.
Thumb rule: if you are discussing with your team mates to define an index on bool, enums, ranges, you have already dug a grave for your database.
Now, you might be wondering, I have eliminated practically all important data types for an index. This is where compound indices (or indexes), range query strategies, etc will come into play, but that's a topic for another day!
This story is for you to identify if your queries/data are choking because of cardinality.
Based on response of this topic, I will talk about few more stories around performance.
Follow up part here.