Bulk Data Loading Guide

Load massive CSV files (10GB+) into Scoop quickly and efficiently

Bulk Data Loading

Purpose: Learn how to load massive CSV files (10GB+) into Scoop quickly and efficiently. Audience: Data Engineers, Admins


Overview

Scoop's Bulk Loader is designed for speed. Use it when you need to ingest large historical datasets (10GB - 100GB+) without waiting days for standard processing. It bypasses the standard analysis pipeline to get your data ready for querying in minutes, not hours.

Quick Start

The fastest way to load a file is using the command line interface (CLI).

./scripts/runscoop --bulkload /path/to/your/large_file.csv --workspaceID W123

This will:

  1. Scan the first 100,000 rows to detect column types.
  2. Create a new Inbox and Table automatically.
  3. Load the data directly into the database.

Advanced Configuration: Control Files

For messy enterprise data, you often need more control than the default "auto-detect" offers. Use a Control File (JSON) to rename columns, filter out junk, and set descriptions during the load.

1. Create a Control File (load_config.json)

{
  "datasetName": "Sales History 2024",
  "description": "Complete transaction history for Q1-Q4",
  "timestampColumn": "TXN_DATE",
  "timestampFormat": "yyyy-MM-dd HH:mm:ss",
  "columnMappings": {
    "AMT_SOLD": "Sales Amount",
    "CUST_ID": "Customer ID",
    "STORE_NUM": "Store Number"
  },
  "excludeColumns": ["INTERNAL_LOG_ID", "SYS_CREATED_BY"],
  "columnSemantics": {
    "AMT_SOLD": "Use this for revenue calculations. Returns 0 for returns."
  }
}

Key Fields:

  • datasetName: The friendly name shown in Scoop UI.
  • timestampColumn: The CSV column representing the transaction time (Crucial for time-series analysis).
  • columnMappings: Rename cryptic CSV headers (e.g., AMT_SOLD) to business-friendly names (e.g., Sales Amount).
  • excludeColumns: List of columns to completely skip loading.
  • columnSemantics: Pre-load the AI with knowledge about what a column means, preventing hallucination later.

2. Run with Configuration

./scripts/runscoop --bulkload /path/to/your/large_file.csv \
  --workspaceID W123 \
  --bulkloadconfig ./load_config.json

Optimizing Performance

Schema Scanning

By default, Scoop scans the first 100,000 rows to guess data types (Integer vs String).

  • Faster Load: Use --scanrows 10000 (Less accurate type detection, but faster start).
  • Better Accuracy: Use --scanrows 1000000 (If your file has data that looks like numbers but contains text later on).

Skipping Statistics

Scoop automatically calculates column statistics (Min, Max, Null %) after loading. For ultra-fast ingestion, you can skip this and run it later.

  • Flag: --skipStatistics

Troubleshooting Common Issues

"CSV file not found"

  • Ensure the path is absolute (e.g., /home/user/data/file.csv).
  • Ensure the file is readable by the user running the script.

"Failed to create table"

  • Check if your CSV headers contain special characters. While Scoop tries to sanitize them, simple alphanumeric headers are safest.

"Columns detected as wrong type"

  • If a column is 99% empty, Scoop might guess it's a String. Increase --scanrows to find non-empty values further down the file.

Metadata Save Errors

  • Never run a build (./gradlew build) while a bulk load is running. It can cause the system to lose track of the table metadata. If this happens, you may need to reload the file.