Services

Resources

Company

Our Work

Blog

Book a Call

Back to Blog

#Data-Engineering

#How-To

Sep 19, 2024 | 10 min read

ClickHouse schema optimization for sparse data sets

Amit Sadafule

Lead Software Engineer @One2N

Back to Blog

#Data-Engineering

#How-To

Sep 19, 2024 | 10 min read

ClickHouse schema optimization for sparse data sets

Amit Sadafule

Lead Software Engineer @One2N

Back to Blog

#Data-Engineering

#How-To

Sep 19, 2024 | 10 min read

ClickHouse schema optimization for sparse data sets

Amit Sadafule

Lead Software Engineer @One2N

Back to Blog

#Data-Engineering

#How-To

Sep 19, 2024 | 10 min read

ClickHouse schema optimization for sparse data sets

Amit Sadafule

Lead Software Engineer @One2N

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

As software architects, our team recently faced a challenging problem for a SaaS fleet management company. They were struggling to store and query massive amounts of data from thousands of vehicles, each sending nearly 5,000 different data points every minute.

This scenario is a classic example of handling sparse data, where not all data points are present for every vehicle at any given time. In this blog post, we will explore how we solved this puzzle, turning a data headache into a smooth-running solution.

The problem

To provide more context, each vehicle can transmit nearly 5,000 keys every minute, referred to as PIDs , along with their corresponding values. Here are the key characteristics of these keys:

PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.
Append-Only Data: The data is append-only, with no updates made after initial entry.

Query requirements

The primary query requirements involve:

Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

In our use case, we need to process large volumes of data for analytical purposes. Our queries always focus on specific columns and require processing millions of data points with each query. Performing this on an OLTP system can be slower and more costly in terms of storage compared to OLAP. This is because OLAP databases can compress columnar data more efficiently, depending on factors like proper use of sort keys, data repetitiveness, etc.

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the right schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

Horizontal (Wide) Table Structure:
- This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
- This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema options

Option 1: sparse matrix

Approach: Store data in a sparse matrix format.
- Pros:
  - Different compression algorithms can be used for different values based on their characteristics.
  - Primary key can consists of vehicle_id and timestamp
- Cons:
  - High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts
  - Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: separate tables for each datatype

Approach: Create separate tables for each datatype (integer, float, string).
- Pros:
  - Faster queries: PID can be part of the primary key index
  - Significantly lower RAM usage during data insertion
  - No sparseness in the storage: This can lead to higher compression and faster queries.
- Cons:
  - Caller code must distinguish between different tables
  - Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: datatype-specific columns in the same table

Approach: Use separate columns for different datatypes within the same table.
- Pros:
  - Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
  - Faster queries: PID can be part of the primary key index
- Cons:
  - Potential sparseness in the data
  - Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and best practices

Avoid Nullable Columns:
- Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
- Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
- Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

We know that when adding a row to the table for a specific PID, only that PID's value needs to be populated, while the other columns can be filled with default values. For instance, if we're adding PID p1, which only holds an integer value, we can assign default values for the float and string columns, as these aren't relevant for p1.

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example implementation and metrics

We evaluated the performance of each schema design using a test setup:

Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Here time taken by pid_integer, pid_float and pid_string are significantly low but pid_with_defaults and pid_without_defaults gives data in single query. This can be significant point as querying different table over network in separate call be slower. This can be tackled by querying in parallel if possible.

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional considerations

When dealing with sparse data, it's crucial to consider the underlying storage representation. Traditional positional formats used in many RDBMSs can be inefficient for sparse data due to the space occupied by null values.

NOTE FOR READERS

This option worked well for us, but it might not be the best fit for every scenario. After this analysis, we realised that each case requires individual consideration. The key factor in making decisions should be the query patterns of the project and business requirements, as they should guide both the table structure and the choice of database.

References

[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns

Next Steps

If you're facing similar challenges or need expert guidance on setting up and managing your data infrastructure, we are here to help. Our team specializes in providing comprehensive solutions for data storage, querying, and optimization.

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

The problem

PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.
Append-Only Data: The data is append-only, with no updates made after initial entry.

Query requirements

The primary query requirements involve:

Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the right schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

Horizontal (Wide) Table Structure:
- This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
- This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema options

Option 1: sparse matrix

Approach: Store data in a sparse matrix format.
- Pros:
  - Different compression algorithms can be used for different values based on their characteristics.
  - Primary key can consists of vehicle_id and timestamp
- Cons:
  - High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts
  - Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: separate tables for each datatype

Approach: Create separate tables for each datatype (integer, float, string).
- Pros:
  - Faster queries: PID can be part of the primary key index
  - Significantly lower RAM usage during data insertion
  - No sparseness in the storage: This can lead to higher compression and faster queries.
- Cons:
  - Caller code must distinguish between different tables
  - Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: datatype-specific columns in the same table

Approach: Use separate columns for different datatypes within the same table.
- Pros:
  - Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
  - Faster queries: PID can be part of the primary key index
- Cons:
  - Potential sparseness in the data
  - Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and best practices

Avoid Nullable Columns:
- Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
- Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
- Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example implementation and metrics

We evaluated the performance of each schema design using a test setup:

Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional considerations

NOTE FOR READERS

References

[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns

Next Steps

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

The problem

PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.
Append-Only Data: The data is append-only, with no updates made after initial entry.

Query requirements

The primary query requirements involve:

Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the right schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

Horizontal (Wide) Table Structure:
- This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
- This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema options

Option 1: sparse matrix

Approach: Store data in a sparse matrix format.
- Pros:
  - Different compression algorithms can be used for different values based on their characteristics.
  - Primary key can consists of vehicle_id and timestamp
- Cons:
  - High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts
  - Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: separate tables for each datatype

Approach: Create separate tables for each datatype (integer, float, string).
- Pros:
  - Faster queries: PID can be part of the primary key index
  - Significantly lower RAM usage during data insertion
  - No sparseness in the storage: This can lead to higher compression and faster queries.
- Cons:
  - Caller code must distinguish between different tables
  - Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: datatype-specific columns in the same table

Approach: Use separate columns for different datatypes within the same table.
- Pros:
  - Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
  - Faster queries: PID can be part of the primary key index
- Cons:
  - Potential sparseness in the data
  - Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and best practices

Avoid Nullable Columns:
- Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
- Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
- Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example implementation and metrics

We evaluated the performance of each schema design using a test setup:

Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional considerations

NOTE FOR READERS

References

[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns

Next Steps

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

The problem

PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.
Append-Only Data: The data is append-only, with no updates made after initial entry.

Query requirements

The primary query requirements involve:

Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the right schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

Horizontal (Wide) Table Structure:
- This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
- This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema options

Option 1: sparse matrix

Approach: Store data in a sparse matrix format.
- Pros:
  - Different compression algorithms can be used for different values based on their characteristics.
  - Primary key can consists of vehicle_id and timestamp
- Cons:
  - High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts
  - Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: separate tables for each datatype

Approach: Create separate tables for each datatype (integer, float, string).
- Pros:
  - Faster queries: PID can be part of the primary key index
  - Significantly lower RAM usage during data insertion
  - No sparseness in the storage: This can lead to higher compression and faster queries.
- Cons:
  - Caller code must distinguish between different tables
  - Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: datatype-specific columns in the same table

Approach: Use separate columns for different datatypes within the same table.
- Pros:
  - Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
  - Faster queries: PID can be part of the primary key index
- Cons:
  - Potential sparseness in the data
  - Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and best practices

Avoid Nullable Columns:
- Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
- Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
- Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example implementation and metrics

We evaluated the performance of each schema design using a test setup:

Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional considerations

NOTE FOR READERS

References

[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns

Next Steps

Reviewed and edited by Chinmay Naik, Saurabh Hirani, Spandan Ghosh

The problem

PID Data is Time-Series Data: Each PID represents a specific characteristic of a vehicle at a particular timestamp.
Sparse Data: Not all PIDs need to be present for the same vehicle at any given time, making the data sparse.
Consistent Datatypes: The datatype of a PID’s value is consistent; for example, if pid = 81 has an integer value, it will always be an integer.
Append-Only Data: The data is append-only, with no updates made after initial entry.

Query requirements

The primary query requirements involve:

Retrieving PID Values: Retrieving a set of PID values for a group of vehicles within a specified time range.
Aggregation Queries: Performing aggregation queries (e.g., average, count, sum) on float and integer PID values.

Our challenge was to figure, how to efficiently store the pid data and retrieve it.

OLTP or OLAP?

There are many OLAP databases are available in market. e.g. AWS Redshift, Apache Druid, ClickHouse, etc.

We chose ClickHouse for its cost-effectiveness, low maintenance requirements, strong community support, easy setup, and scalability.

Choosing the right schema

Our problems didn’t disappear just by changing the DB. They just became tractable and within the solvable bounds of ClickHouse.

The heart of the problem is choosing correct schema for this business use case.

Here are the primary schema designs we considered:

Horizontal (Wide) Table Structure:
- This approach involves storing all PIDs as separate columns.
Vertical (Entity-Attribute-Value or EAV) Table Structure:
- This structure involves storing each PID as a separate row. We evaluated two main vertical table structures: separate tables for each datatype and datatype-specific columns within the same table.

Schema options

Option 1: sparse matrix

Approach: Store data in a sparse matrix format.
- Pros:
  - Different compression algorithms can be used for different values based on their characteristics.
  - Primary key can consists of vehicle_id and timestamp
- Cons:
  - High RAM requirements during data insertion: This is because ClickHouse creates a separate .bin file for each column. During insertion, it must keep all of these files in memory for creation of parts
  - Slower querying on PIDs as PID is not part of the primary key. In our case, most of the queries require PIDs to be in WHERE clause

Option 2: separate tables for each datatype

Approach: Create separate tables for each datatype (integer, float, string).
- Pros:
  - Faster queries: PID can be part of the primary key index
  - Significantly lower RAM usage during data insertion
  - No sparseness in the storage: This can lead to higher compression and faster queries.
- Cons:
  - Caller code must distinguish between different tables
  - Joins may be required to get PIDs of different datatypes.

Below are some stats for data insertion

CREATE TABLE pid_integer (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Int32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_float (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value Float32
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

CREATE TABLE pid_string (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    pid_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_integer FROM INFILE '/path/to/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_float FROM INFILE '/path/to/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_string FROM INFILE '/path/to/pid_string_data.csv'

Option 3: datatype-specific columns in the same table

Approach: Use separate columns for different datatypes within the same table.
- Pros:
  - Simplified caller code: Caller code does not have to maintain different tables. No joins are required.
  - Faster queries: PID can be part of the primary key index
- Cons:
  - Potential sparseness in the data
  - Higher RAM requirements during data insertion than Option 2 (still lesser than Option 1)

Stats for data insertion

CREATE TABLE pid_without_defaults (
    vehicle_id String,
    timestamp DateTime,
    pid_name String,
    int_value Nullable(Int32),
    float_value Nullable(Float32),
    string_value Nullable(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);

#Load data using below queries
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_without_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Optimizations and best practices

Avoid Nullable Columns:
- Using default values instead of nulls can improve compression and query performance. This because with default values ClickHouse have to store and refer to extra storage required for null references.
Compression:
- Efficient compression algorithms like ZSTD can significantly reduce storage requirements. Ensure that the chosen compression method aligns with the characteristics of your data.
Query Performance:
- Optimize queries by ensuring the primary key includes the PID, which can significantly speed up query performance.

Option 3.1

Based on above assumption, we can represent above table as below. Notice that we have added default values in place irrelevant value columns.

CREATE TABLE pid_with_defaults (
		vehicle_id String,
		timestamp DateTime,
		pid_name String,
		int_value Int32,
		float_value Float32,
		string_value String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (vehicle_id, timestamp, pid_name);


#Load data using below queries

INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_int_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_float_data.csv' FORMAT CSV
INSERT INTO pid_with_defaults FROM INFILE '/path/to/separate_columns/pid_string_data.csv'

Example implementation and metrics

We evaluated the performance of each schema design using a test setup:

Data Generation: 2000 PIDs per vehicle, with 70% float values, 29% integer values, and 1% string values.
Data Volume: 3 billion PID values for 1000 vehicles over a 24-hour period.
Machine Setup: Mac OS Sonoma, Apple M3 MacBook Air, 16GB RAM, 8 cores.
ClickHouse Version: v24.8.1.

Compression

We used ZSTD compression. You can try with other options as well

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
    round(usize / size, 2) AS compr_rate,
    sum(rows) AS rows,
    count() AS part_count
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY size DESC

Query performance

We ran below queries for the datasets

Q1. Get average of int and float PID values for set of vehicles and for entire 24hr duration

select pid_name, avg(int_value), avg(float_value) from pid_without_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(int_value), avg(float_value) from pid_with_defaults where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_integer where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;
select pid_name, avg(pid_value) from pid_float where  vehicle_id IN ('2633597931', '1304513181', '5069583100', '4742595504', '9892749918') and timestamp BETWEEN '2024-08-11' and '2024-08-12' group by pid_name;

NOTE: Here one thing to note pid_without_defaults and pid_with_defaults tables give result in single query

Q2. Get all PIDs of all pid values for a vehicle and for entire 2hr duration

select pid_name, int_value, float_value, string_value from pid_without_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, int_value, float_value, string_value from pid_with_defaults where  vehicle_id = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_integer where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_string where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';
select pid_name, pid_value from pid_float where  vehicle_id  = '2633597931' and timestamp BETWEEN '2024-08-11 12:00:00' and '2024-08-11 13:59:59';

Q3. Get average of all int and float PIDs

select avg(int_value) from pid_without_defaults;
select avg(float_value) from pid_without_defaults;
select avg(int_value) from pid_with_defaults;
select avg(float_value) from pid_with_defaults;
select avg(pid_value) from pid_integer;
select avg(pid_value) from pid_float;

As we can see, pid_integer and pid_float are lot faster and use very less memory.

Conclusion and recommendations

Based on our observations, storing data in a vertical structure can reduce RAM usage during batch inserts, speed up queries, and improve indexing efficiency.

In our case, we opted for the approach of using datatype-specific columns with default values (Option 3.1).

Reasons for this decision:

Simplified Caller Code: Using datatype-specific columns within the same table simplifies the caller code and reduces the need for joins.
Query Performance: Separate tables for each datatype offer higher query performance but require more complex caller code.
Optimizations: Avoid nullable columns, use efficient compression algorithms, and optimize queries to include PIDs in the primary key.
Less network calls: Reduced the number of network calls by avoiding queries across multiple tables.

Additional considerations

NOTE FOR READERS

References

[ClickHouse - Lightning Fast Analytics for Everyone] https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf
[Extending RDBMSs To Support Sparse Datasets] https://pages.cs.wisc.edu/~alanh/sparse.pdf
[Too Wide or Not Too Wide | That is the ClickHouse Question] https://altinity.com/blog/too-wide-or-not-too-wide-that-is-the-clickhouse-question
[Avoid Nullable Columns] https://clickhouse.com/docs/en/cloud/bestpractices/avoid-nullable-columns

Next Steps

Jump to section

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 29, 2025 | 9 min read

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

Srivatsa RV

SRE @One2N

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

April 10, 2025 | 5 min read

GitHub runners fundamentals and self-hosted runner setup

Rajesh Jangid

SRE

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

June 25, 2025 | 3 min read

DARE to question your alerts?

Saurabh Hirani

Principal SRE @One2N

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

June 18, 2025 | 5 min read

Implementing secure error handling in Go for B2B SaaS applications

Mohit Kumar

Software Engineer

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

June 4, 2025 | 5 min read

Deploying a scalable NATS cluster part 1: core architecture and considerations

Barun Debnath

SRE @One2N

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

May 14, 2025 | 8 min read

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

Mihir Bhagwat

SRE @One2N

Sanket Rajgiri

SRE @One2N

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Blog

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Services

Resources

Company

ClickHouse schema optimization for sparse data sets

ClickHouse schema optimization for sparse data sets

ClickHouse schema optimization for sparse data sets

ClickHouse schema optimization for sparse data sets

ClickHouse schema optimization for sparse data sets

Share

Jump to section

Related posts

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

Transforming alerting with GitOps - a journey in automating Elasticsearch alerts

This blog tells you how to approach alerting from first principles while on the ELK Stack. We cover how to capture the right signals that a NOC team looks at, structure them as alert definitions and operationalize them relying on GitOps. This ensures teams can act on alerts confidently.

GitHub runners fundamentals and self-hosted runner setup

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

GitHub runners fundamentals and self-hosted runner setup

This post is an introduction to Github runners, different ways in which they can be deployed, and a comparison between managed and self hosted runners.

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

DARE to question your alerts?

Alert fatigue is real and it's hurting your on-call teams. This post breaks down how to make alerting systems smarter through regular alert analysis. Learn how to turn noisy alerts into actionable insights that drive real resilience.

Implementing secure error handling in Go for B2B SaaS applications

A centralized error handling library helped our team improve user experience and strengthen security in Go-based microservices by replacing leaky, inconsistent error messages with structured, user-friendly responses.

Deploying a scalable NATS cluster part 1: core architecture and considerations

In this blog, we provide a detailed overview of the NATS architecture, key cluster design considerations, and best practices for deploying a scalable and reliable NATS messaging system. We cover topics like high availability, fault tolerance, message durability, and the infrastructure requirements.

Optimizing MongoDB backup strategy: lessons from achieving a 1-Hour RPO

This post walks through how we implemented a disaster recovery solution for a MongoDB cluster running on Google Kubernetes Engine (GKE) using the MongoDB Community Operator

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content

Subscribe for more such content