Bulk Data Loading
Load massive CSV files (10GB+) at 178K rows/second
Bulk Data Loading
Note: For datasets > 10GB, use the Bulk Loader. It bypasses the standard ingestion pipeline for 100x speed.
🚀 Performance Capabilities
- Throughput: ~178,000 rows/second (validated on 156M rows).
- Scale: Successfully tested on 130GB files.
- Technology: Uses
LOAD DATA INFILEdirectly into MariaDB ColumnStore, bypassing row-by-row Java processing.
📄 Control File Configuration
To load messy enterprise data without ETL, use a JSON Control File:
{
"timestampColumn": "METRIC_DATE",
"columnMappings": {
"SOLD_AMOUNT": "Sold Amount",
"LOAN_VALUE": "Loan Value"
},
"excludeColumns": ["internal_id"],
"columnSemantics": {
"SOLD_AMOUNT": "Only for merchandise sales. Returns 0 for loans."
}
}- Why? It renames columns during load and injects semantic meaning for the AI immediately.
🎲 The SCOOP_RAND System Column
SCOOP_RAND System ColumnFor huge datasets, sorting by RAND() is too slow.
- Scoop adds a
SCOOP_RANDcolumn (float 0-1) during load. - Benefit: To sample 1%, we query
WHERE SCOOP_RAND < 0.01. This scans only 1% of disk blocks (Extent Elimination), making ML analysis instant on 100M+ rows.
🧹 Sparse Data Handling
- Problem: Transaction logs often have 99% NULLs in specific metric columns.
- Solution: The loader uses a "Sparse Inference" engine. It only needs 10 non-empty values to correctly type a column as Integer/Decimal, ignoring the millions of NULLs.
Updated about 3 hours ago