Performances, Databases — Part 2
Data types and APIs
Good day!
Before you read this story, please refer to my Part 1, and come back later. Previously… I talked about how an index qualifies to be a good index based on High cardinality. I also listed few BAD single type indices (or indexes) that we should avoid. In this story, we will talk about bool data type from that list.
The infamous bools
Yes bools are bad, especially if you are creating a single field index on it.
Below is an example of nosql documents which store some kind of tasks in task collection/table.
task_id is a Universally Unique Identifier string, all languages have a provision to generate this UUID. As the name suggests its always unique.
task_name is a string type.
description is a string type.
due_date is a string type or can be a date type based on database. Here its ISO8601 format. This can help us in parsing date into a generic date type in our code and extract different time resolutions.
completed is a bool type.
[
{
"task_id": "8a96a0d8-9e36-47d1-88cf-89c07a42d1e1",
"task_name": "Write Blog Post",
"description": "Create an informative blog post on a relevant topic",
"due_date": "2023-07-01T08:00:00Z",
"completed": false
},
{
"task_id": "ba3cfe05-2b7c-4d98-98f7-2283c3e4b0c2",
"task_name": "Prepare Presentation",
"description": "Create a compelling presentation for the upcoming meeting",
"due_date": "2023-07-05T10:30:00Z",
"completed": true
},
{
"task_id": "7f5e65f2-2e80-42a2-bf2e-8a4e3f4b8c89",
"task_name": "Code Refactoring",
"description": "Refactor codebase to improve efficiency and maintainability",
"due_date": "2023-07-10T15:45:00Z",
"completed": false
},
{
"task_id": "b13c9742-2981-4f66-9ef2-07a0db60cc1e",
"task_name": "Review Project Plan",
"description": "Review and update the project plan based on recent changes",
"due_date": "2023-07-15T12:00:00Z",
"completed": true
},
{
"task_id": "e0b0a417-48a7-4f84-9c37-b4e547a7db1c",
"task_name": "Client Meeting",
"description": "Prepare for and conduct a meeting with a key client",
"due_date": "2023-07-20T09:15:00Z",
"completed": false
},
{
"task_id": "c590a6c2-89d1-4f8a-8d16-2533f2462b4a",
"task_name": "Update Documentation",
"description": "Ensure project documentation is up-to-date and comprehensive",
"due_date": "2023-07-25T14:30:00Z",
"completed": true
},
{
"task_id": "d14a2f45-8372-4e61-b5f0-9e73d2f88154",
"task_name": "Test New Feature",
"description": "Conduct thorough testing on the newly implemented feature",
"due_date": "2023-07-30T11:45:00Z",
"completed": false
},
{
"task_id": "8f32e5a4-38d9-46b2-b7c8-33a89d3b86f0",
"task_name": "Attend Workshop",
"description": "Participate in a workshop to enhance skills and knowledge",
"due_date": "2023-08-05T13:20:00Z",
"completed": true
},
{
"task_id": "4cfeb7e8-8bea-4bcf-bd7e-289545032d20",
"task_name": "Project Deployment",
"description": "Coordinate and oversee the deployment of the project",
"due_date": "2023-08-10T16:00:00Z",
"completed": false
},
{
"task_id": "63a849d1-93b4-44d2-8d6a-108c4c53c6b3",
"task_name": "Write Test Cases",
"description": "Create comprehensive test cases for project testing",
"due_date": "2023-08-15T09:30:00Z",
"completed": true
}
]
In these sample documents, we see a data distribution of 50% for the completed bool field. At some point all these bools will be updated to true based on task completion.
Lets talk about why bool is a bad query with a simple query on “completed”:true. Basically we are requesting DB to give us all Tasks whose completed status is true! In most databases, we have an option to profile or explain our queries. The explain data tells us how bad a query and an index will perform. I will use mongodb/documentdb explain for this story.
The above example shows a query and explain combined into a command for MongoDB/AWS DocumentDB. Below is a sample extract from an explain output (generated using ChatGPT AI tool). (Full details on explain is out of the scope of this story, you can find its details here).
{
...
"executionStats": {
"executionSuccess": true,
"nReturned": 1000,
"executionTimeMillis": 5,
"totalKeysExamined": 1000,
"totalDocsExamined": 1000,
"executionStages": {
"stage": "FETCH",
"filter": {
"completed": {
"$eq": true
}
},
"nReturned": 1000,
"executionTimeMillisEstimate": 4,
"works": 1001,
"docsExamined": 1000,
"inputStage": {
"stage": "IXSCAN",
"nReturned": 1000,
"executionTimeMillisEstimate": 2,
"works": 1001,
"indexName": "completed_1",
"isMultiKey": false,
"multiKeyPaths": {
"completed": []
},
"direction": "forward",
"keysExamined": 1000,
}
}
},
...
}
Lets only look at the inputStage to keep things simple.
The input stage basically tells us that it picked a single field index called completed_1 (which is on the bool type), and it performed 1001 works and examined 1000 keys on the index to get all completed = true in this sample collection/table of data/documents.
Imagine work as a small unit of DB to fetch data for you. This is the cost DB has to pay. For simplicity, the more work units and keys examined, the more effort DB puts to perform query and hence more costs in terms of resource/performance. Imagine if this table was a size of 1million and all completed values were true! DB would perform 1million works to fetch all trues. As I mentioned in my Part1, a database can handle data in billions, but with this index, we can see where this query is heading and it will fail on us!
Swiss Knives are bad!
Have you ever used a Swiss knife? Its never useful, it does not do one thing right. It cant be a good scissor, It cant be a good knife, It cant be a good cork opener!
Same is the case with badly designed APIs.
Why am I talking about APIs?
An API generally interfaces with a server (and hence DB). In the example of task collection/table above, an API will make or break the DB based on its design.
A bad API design
completed see explain section above.
example bad API query: Get me all tasks with completed status true!
Matches everything. Lot of work units! (1 million units)
task_name can also be a bad single field index. For obvious reasons this API may only support few tasks in this application. This means the task_name field will be duplicated 10s of 1000s of times across all the 1 million task documents that are stored by various users!
example bad API query: Get me all tasks with the name “Update Documentation”
May match several 10s of 1000s. Unpredictable data size. Lot of work units!
A good API design
task_id is the best bet. When an API caller stores task data through this API, a task_id is returned. The caller can always check for status of THE task_id and since its a UUID with the highest (and best) cardinality, will always do wonders. You can also imagine the work units that DB will consume for this query!
example good API query: Get me THE task with task_id “d14a2f45–8372–4e61-b5f0–9e73d2f88154”
There is only 1 match. Perfect index. 1 work unit!
I know I know, you need more from your API! Generally use cases of listing several tasks/data from such APIs is very common.
The only way out is to ensure your query always produces “high cardinality” by combining some queries and fields into an index.
Compound index
A combination of multiple fields in an index is a compound index (deep dive here).
Note: This story does not talk about “order of the index fields” or “compound index strategies”.
In the task collection/table we can perform different combinations of queries:
task_name + completed
A query example: (pseudo syntax)
"task_name": "Update documentation" AND "completed" : true
We MAY match 1000s of documents with task_name first and on top of those 1000 matches, we filter for completed true. This seems to be a good compound index. Only if we can estimate the count of task_name in these 1 million documents, otherwise this too can be expensive.
This is where we make the API less generic. (my Swiss knife analogy)
Lets add due_date into the index and query.
due_date + task_name + completed
A query example: (pseudo syntax)
("due_date" > "2023-08-05T13:20:00Z" AND "due_date" < "2023-08-06T13:20:00Z")
AND
"task_name": "Update documentation" AND "completed" : true
By adding date into the query clause, we automatically reduced the AMOUNT of data that will possibly match with task_name and completed. This due_date clause here is a “range query” i.e. we are asking the DB to fetch data between two boundaries (narrow down the datasets to work with)
I will leave it to the reader’s imagination for a year’s data of tasks with and without dates in queries. Crazy right?
Strategies
Even with this approach, we cannot predict the amount of data that may be stored in a particular date range. Some range queries/compound index may result in too many matches! This is why I brought the API into the discussion.
No application or API in the world supports or should support queries for indefinite data efficiently and instantly.
And if an application has to support such queries, there are certain strategies we can work with.
Break it down! Always work with a small chunks of datasets.
There are lot of strategies out there, Archiving, Sharding in combination with APIs with cursors and throttles to name a few.
The main motto here is to draw a boundary around our APIs and DBs.
I hope this story gives you a peek into why knowing your “data” and “API” requirements is important.
There will be follow-ups in this journey.