Database Design for Tick and Bar Data Storage
Storing financial data requires a robust database design. Tick and bar data have distinct characteristics. A well-designed schema optimizes query performance and storage efficiency. This lesson outlines effective database structures for both data types.
Tick Data Storage
Tick data represents every price change. It demands high-volume, high-velocity storage. Each tick records a timestamp, price, and volume. A relational database might struggle with the sheer volume. NoSQL databases, specifically time-series databases, offer better solutions.
Consider a time-series database like InfluxDB or TimescaleDB (a PostgreSQL extension). These databases optimize for time-ordered data. They compress data efficiently. They also provide specialized functions for time-based queries.
For a relational database approach, design a schema for minimal overhead. Create a table for each asset. Alternatively, use a single table with an asset identifier. The single table approach simplifies data ingestion but complicates indexing.
Example Schema (Single Table, Relational):
CREATE TABLE tick_data (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
symbol VARCHAR(10) NOT NULL,
timestamp DATETIME(6) NOT NULL, -- Microsecond precision
price DECIMAL(18, 6) NOT NULL,
volume BIGINT NOT NULL,
exchange VARCHAR(10),
INDEX (symbol, timestamp) -- Compound index for fast queries
);
CREATE TABLE tick_data (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
symbol VARCHAR(10) NOT NULL,
timestamp DATETIME(6) NOT NULL, -- Microsecond precision
price DECIMAL(18, 6) NOT NULL,
volume BIGINT NOT NULL,
exchange VARCHAR(10),
INDEX (symbol, timestamp) -- Compound index for fast queries
);
This tick_data table stores every trade for all symbols. The id column is a unique identifier. symbol identifies the asset, e.g., 'SPY' or 'AAPL'. timestamp records the exact time of the tick. Use microsecond precision for high-frequency data. price stores the trade price. volume stores the trade size. exchange notes the venue.
A compound index on (symbol, timestamp) enables rapid retrieval of all ticks for a specific symbol within a time range. Without this index, queries would scan the entire table. For example, retrieving all 'MSFT' ticks between 2023-01-01 09:30:00 and 2023-01-01 09:35:00 would be instantaneous.
Data partitioning improves performance for very large tables. Partition the tick_data table by symbol or timestamp. For instance, create separate partitions for each month of data. This limits the data scanned during a query.
Consider data archiving. Tick data accumulates rapidly. Store recent data in a high-performance database. Move older data to cheaper, slower storage, like Amazon S3 or Google Cloud Storage. Access archived data less frequently.
Bar Data Storage
Bar data aggregates ticks into fixed time intervals. Common intervals include 1-minute, 5-minute, 1-hour, or 1-day bars. Each bar contains open, high, low, close prices, and volume (OHLCV). Bar data volume is significantly lower than tick data. Relational databases manage bar data effectively.
Example Schema (Relational):
CREATE TABLE bar_data (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
symbol VARCHAR(10) NOT NULL,
interval_seconds INT NOT NULL, -- e.g., 60 for 1-minute, 3600 for 1-hour
timestamp DATETIME NOT NULL, -- Bar start time
open_price DECIMAL(18, 6) NOT NULL,
high_price DECIMAL(18, 6) NOT NULL,
low_price DECIMAL(18, 6) NOT NULL,
close_price DECIMAL(18, 6) NOT NULL,
volume BIGINT NOT NULL,
INDEX (symbol, interval_seconds, timestamp) -- Compound index
);
CREATE TABLE bar_data (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
symbol VARCHAR(10) NOT NULL,
interval_seconds INT NOT NULL, -- e.g., 60 for 1-minute, 3600 for 1-hour
timestamp DATETIME NOT NULL, -- Bar start time
open_price DECIMAL(18, 6) NOT NULL,
high_price DECIMAL(18, 6) NOT NULL,
low_price DECIMAL(18, 6) NOT NULL,
close_price DECIMAL(18, 6) NOT NULL,
volume BIGINT NOT NULL,
INDEX (symbol, interval_seconds, timestamp) -- Compound index
);
The bar_data table stores OHLCV information. symbol identifies the asset. interval_seconds defines the bar duration. For example, '60' means 1-minute bars. timestamp marks the start of the bar. open_price, high_price, low_price, close_price record the respective prices. volume aggregates trade volume during the bar interval.
The compound index (symbol, interval_seconds, timestamp) supports efficient queries for specific symbols, timeframes, and date ranges. For example, retrieving all 5-minute bars for 'GOOG' for the last week executes quickly.
Store different bar intervals in the same table. The interval_seconds column differentiates them. This simplifies schema management. Alternatively, create separate tables for each interval (e.g., bar_data_1min, bar_data_5min). Separate tables can offer marginal performance gains for very large datasets but increase schema complexity.
Populate bar data from tick data or direct vendor feeds. If using tick data, aggregate ticks into bars. This process requires careful handling of edge cases, like market holidays or early closes.
Data Integrity and Maintenance
Data integrity holds top importance. Implement robust data validation during ingestion. Check for duplicate records. Ensure timestamps are sequential. Validate price and volume values. Prices should be positive. Volumes should be non-negative.
Regularly back up your databases. Implement a disaster recovery plan. Data loss means losing your historical edge.
Monitor database performance. Optimize queries using EXPLAIN plans. Add or modify indexes as needed. Database maintenance, like re-indexing or vacuuming, keeps performance optimal.
Consider data normalization. For example, store symbol in a separate symbols table and use a foreign key in tick_data and bar_data. This reduces redundancy but adds join overhead for queries. For financial time-series, denormalization often improves read performance. The examples above prioritize read speed by including symbol directly.
The chosen database technology impacts scalability. PostgreSQL with TimescaleDB scales well for both tick and bar data. Dedicated time-series databases like InfluxDB excel with tick data. Cloud-based solutions like Amazon RDS or Google Cloud SQL simplify infrastructure management.
A well-structured database underpins any robust mean reversion strategy. It ensures fast, reliable access to the data required for backtesting and live trading.
