Bulk Data Loading

Load massive CSV files (10GB+) at 178K rows/second

Bulk Data Loading

Note: For datasets > 10GB, use the Bulk Loader. It bypasses the standard ingestion pipeline for 100x speed.

🚀 Performance Capabilities

  • Throughput: ~178,000 rows/second (validated on 156M rows).
  • Scale: Successfully tested on 130GB files.
  • Technology: Uses LOAD DATA INFILE directly into MariaDB ColumnStore, bypassing row-by-row Java processing.

📄 Control File Configuration

To load messy enterprise data without ETL, use a JSON Control File:

{
  "timestampColumn": "METRIC_DATE",
  "columnMappings": {
    "SOLD_AMOUNT": "Sold Amount",
    "LOAN_VALUE": "Loan Value"
  },
  "excludeColumns": ["internal_id"],
  "columnSemantics": {
    "SOLD_AMOUNT": "Only for merchandise sales. Returns 0 for loans."
  }
}
  • Why? It renames columns during load and injects semantic meaning for the AI immediately.

🎲 The SCOOP_RAND System Column

For huge datasets, sorting by RAND() is too slow.

  • Scoop adds a SCOOP_RAND column (float 0-1) during load.
  • Benefit: To sample 1%, we query WHERE SCOOP_RAND < 0.01. This scans only 1% of disk blocks (Extent Elimination), making ML analysis instant on 100M+ rows.

🧹 Sparse Data Handling

  • Problem: Transaction logs often have 99% NULLs in specific metric columns.
  • Solution: The loader uses a "Sparse Inference" engine. It only needs 10 non-empty values to correctly type a column as Integer/Decimal, ignoring the millions of NULLs.