Services

Resources

Company

Clickhouse, OLAP, Optimisation

Sep 19, 2024 | 10 min read

Storing Sparse Data Efficiently: A Deep Dive into ClickHouse and Schema Design

Storing Sparse Data Efficiently: A Deep Dive into ClickHouse and Schema Design

Lead Software Engineer

Clickhouse, OLAP, Optimisation

Sep 19, 2024 | 10 min read

Storing Sparse Data Efficiently: A Deep Dive into ClickHouse and Schema Design

Lead Software Engineer

Clickhouse, OLAP, Optimisation

Sep 19, 2024 | 10 min read

Storing Sparse Data Efficiently: A Deep Dive into ClickHouse and Schema Design

Lead Software Engineer

Learn how to efficiently store and query sparse data using ClickHouse. Discover the differences between OLAP and OLTP, optimal schema designs, and best practices for avoiding nullable columns. Explore real-world examples and performance metrics to optimize your data management and analytics workflows.

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

As software architects, our team recently faced a challenging problem for a SaaS fleet management company. They were struggling to store and query massive amounts of data from thousands of vehicles, each sending nearly 5,000 different data points every minute.

This scenario is a classic example of handling sparse data, where not all data points are present for every vehicle at any given time. In this blog post, we will explore how we solved this puzzle, turning a data headache into a smooth-running solution.

The Problem

To provide more context, each vehicle can transmit nearly 5,000 keys every minute, referred to as PIDs , along with their corresponding values. Here are the key characteristics of these keys:

  1. PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.

  2. Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.

  3. Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.

  4. Append-Only Data: The data is append-only, with no updates made after initial entry.

Query Requirements

The primary query requirements involve:

  1. Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.

  2. Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

In our use case, we need to process large volumes of data for analytical purposes. Our queries always focus on specific columns and require processing millions of data points with each query. Performing this on an OLTP system can be slower and more costly in terms of storage compared to OLAP. This is because OLAP databases can compress columnar data more efficiently, depending on factors like proper use of sort keys, data repetitiveness, etc.

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the Right Schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

  1. Horizontal (Wide) Table Structure:

    • This approach involves storing all PIDs as separate columns.

  2. Vertical (Entity-Attribute-Value or EAV) Table Structure:

    • This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema Options

Option 1: Sparse Matrix

  • Approach: Store data in a sparse matrix format.

    • Pros:

      • Different compression algorithms can be used for different values based on their characteristics.

      • Primary key can consists of vehicle_id and timestamp

    • Cons:

      • High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts

      • Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: Separate Tables for Each Datatype

  • Approach: Create separate tables for each datatype (integer, float, string).


    • Pros:

      • Faster queries: PID can be part of the primary key index

      • Significantly lower RAM usage during data insertion

      • No sparseness in the storage: This can lead to higher compression and faster queries.

    • Cons:

      • Caller code must distinguish between different tables

      • Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: Datatype-Specific Columns in the Same Table

  • Approach: Use separate columns for different datatypes within the same table.

    • Pros:

      • Simplified caller code: Caller code does not have to maintain different tables. No joins are required.

      • Faster queries: PID can be part of the primary key index

    • Cons:

      • Potential sparseness in the data

      • Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and Best Practices

  1. Avoid Nullable Columns:

    • Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.

  2. Compression:

    • Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.

  3. Query Performance:

    • Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

We know that when adding a row to the table for a specific PID, only that PID's value needs to be populated, while the other columns can be filled with default values. For instance, if we're adding PID p1, which only holds an integer value, we can assign default values for the float and string columns, as these aren't relevant for p1.

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example Implementation and Metrics

We evaluated the performance of each schema design using a test setup:

  • Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.

  • Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.

  • Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.

  • ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query Performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Here time taken by pid_integer, pid_float and pid_string are significantly low but pid_with_defaults and pid_without_defaults gives data in single query. This can be significant point as querying different table over network in separate call be slower. This can be tackled by querying in parallel if possible.

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and Recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

  1. Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.

  2. Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.

  3. Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.

  4. Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional Considerations

When dealing with sparse data, it's crucial to consider the underlying storage representation. Traditional positional formats used in many RDBMSs can be inefficient for sparse data due to the space occupied by null values.

NOTE FOR READERS

This option worked well for us, but it might not be the best fit for every scenario. After this analysis, we realised that each case requires individual consideration. The key factor in making decisions should be the query patterns of the project and business requirements, as they should guide both the table structure and the choice of database.

References

Next Steps

If you're facing similar challenges or need expert guidance on setting up and managing your data infrastructure, we are here to help. Our team specializes in providing comprehensive solutions for data storage, querying, and optimization.

Jump to Section

Also Checkout

Also Checkout

Also Checkout

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.