Learn how to efficiently store and query sparse data using ClickHouse. Discover the differences between OLAP and OLTP, optimal schema designs, and best practices for avoiding nullable columns. Explore real-world examples and performance metrics to optimize your data management and analytics workflows.
Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh
As software architects, our team recently faced a challenging problem for a SaaS fleet management company. They were struggling to store and query massive amounts of data from thousands of vehicles, each sending nearly 5,000 different data points every minute.
This scenario is a classic example of handling sparse data, where not all data points are present for every vehicle at any given time. In this blog post, we will explore how we solved this puzzle, turning a data headache into a smooth-running solution.
The Problem
To provide more context, each vehicle can transmit nearly 5,000 keys every minute, referred to as PIDs
, along with their corresponding values. Here are the key characteristics of these keys:
PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if
pid = 81
has an integer value, it will always be an integer.Append-Only Data: The data is append-only, with no updates made after initial entry.
Query Requirements
The primary query requirements involve:
Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.
Our challenge was to figure, how to efficiently store the pid data and retrieve it.
OLTP or OLAP?
In our use case, we need to process large volumes of data for analytical purposes. Our queries always focus on specific columns and require processing millions of data points with each query. Performing this on an OLTP system can be slower and more costly in terms of storage compared to OLAP. This is because OLAP databases can compress columnar data more efficiently, depending on factors like proper use of sort keys, data repetitiveness, etc.
There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.
We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.
Choosing the Right Schema
Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.
The heart of the problem is choosing correct schema for this business use case.
Here are the primary schema designs we considered:
Horizontal (Wide) Table Structure:
This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.
Schema Options
Option 1: Sparse Matrix
Approach: Store data in a sparse matrix format.
Pros:
Different compression algorithms can be used for different values based on their characteristics.
Primary key can consists of vehicle_id and timestamp
Cons:
High RAM requirements during data insertion: This is because ClickHouse creates a separate
.bin
file for each column. During insertion, it must keep all of these files in memory for creation of partsSlower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause
Option 2: Separate Tables for Each Datatype
Approach: Create separate tables for each datatype (integer, float, string).
Pros:
Faster queries: PID can be part of the primary key index
Significantly lower RAM usage during data insertion
No sparseness in the storage: This can lead to higher compression and faster queries.
Cons:
Caller code must distinguish between different tables
Joins may be required to get PIDs of different datatypes.
Below are some stats for data insertion
Option 3: Datatype-Specific Columns in the Same Table
Approach: Use separate columns for different datatypes within the same table.
Pros:
Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
Faster queries: PID can be part of the primary key index
Cons:
Potential sparseness in the data
Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)
Stats for data insertion
Optimizations and Best Practices
Avoid Nullable Columns:
Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.
Option 3.1
We know that when adding a row to the table for a specific PID, only that PID's value needs to be populated, while the other columns can be filled with default values. For instance, if we're adding PID p1
, which only holds an integer value, we can assign default values for the float and string columns, as these aren't relevant for p1
.
Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.
Example Implementation and Metrics
We evaluated the performance of each schema design using a test setup:
Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.
Compression
We used ZSTD compression. You can try with other options as well
Query Performance
We ran below queries for the datasets
Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration
NOTE: Here one thing to note
pid_without_defaults
andpid_with_defaults
tables give result in single query
Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration
Here time taken by pid_integer
, pid_float
and pid_string
are significantly low but pid_with_defaults
and pid_without_defaults
gives data in single query. This can be significant point as querying different table over network in separate call be slower. This can be tackled by querying in parallel if possible.
Q3. Get average of all int and float PIDs
As we can see, pid_integer
and pid_float
are lot faster and use very less memory.
Conclusion and Recommendations
Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.
In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).
Reasons for this decision:
Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.
Additional Considerations
When dealing with sparse data, it's crucial to consider the underlying storage representation. Traditional positional formats used in many RDBMSs can be inefficient for sparse data due to the space occupied by null values.
NOTE FOR READERS
This option worked well for us, but it might not be the best fit for every scenario. After this analysis, we realised that each case requires individual consideration. The key factor in making decisions should be the query patterns of the project and business requirements, as they should guide both the table structure and the choice of database.
References
[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns
Next Steps
If you're facing similar challenges or need expert guidance on setting up and managing your data infrastructure, we are here to help. Our team specializes in providing comprehensive solutions for data storage, querying, and optimization.