Bulk Data Loading Guide

Load massive CSV files (10GB+) into Scoop quickly and efficiently

Bulk Data Loading

Purpose: Learn how to load massive CSV files (10GB+) into Scoop quickly and efficiently. Audience: Data Engineers, Admins

Overview

Scoop's Bulk Loader is designed for speed. Use it when you need to ingest large historical datasets (10GB - 100GB+) without waiting days for standard processing. It bypasses the standard analysis pipeline to get your data ready for querying in minutes, not hours.

Quick Start

The fastest way to load a file is using the command line interface (CLI).

./scripts/runscoop --bulkload /path/to/your/large_file.csv --workspaceID W123

This will:

Scan the first 100,000 rows to detect column types.
Create a new Inbox and Table automatically.
Load the data directly into the database.

Advanced Configuration: Control Files

For messy enterprise data, you often need more control than the default "auto-detect" offers. Use a Control File (JSON) to rename columns, filter out junk, and set descriptions during the load.

1. Create a Control File (`load_config.json`)

{
  "datasetName": "Sales History 2024",
  "description": "Complete transaction history for Q1-Q4",
  "timestampColumn": "TXN_DATE",
  "timestampFormat": "yyyy-MM-dd HH:mm:ss",
  "columnMappings": {
    "AMT_SOLD": "Sales Amount",
    "CUST_ID": "Customer ID",
    "STORE_NUM": "Store Number"
  },
  "excludeColumns": ["INTERNAL_LOG_ID", "SYS_CREATED_BY"],
  "columnSemantics": {
    "AMT_SOLD": "Use this for revenue calculations. Returns 0 for returns."
  }
}

Key Fields:

datasetName: The friendly name shown in Scoop UI.
timestampColumn: The CSV column representing the transaction time (Crucial for time-series analysis).
columnMappings: Rename cryptic CSV headers (e.g., AMT_SOLD) to business-friendly names (e.g., Sales Amount).
excludeColumns: List of columns to completely skip loading.
columnSemantics: Pre-load the AI with knowledge about what a column means, preventing hallucination later.

2. Run with Configuration

./scripts/runscoop --bulkload /path/to/your/large_file.csv \
  --workspaceID W123 \
  --bulkloadconfig ./load_config.json

Optimizing Performance

Schema Scanning

By default, Scoop scans the first 100,000 rows to guess data types (Integer vs String).

Faster Load: Use --scanrows 10000 (Less accurate type detection, but faster start).
Better Accuracy: Use --scanrows 1000000 (If your file has data that looks like numbers but contains text later on).

Skipping Statistics

Scoop automatically calculates column statistics (Min, Max, Null %) after loading. For ultra-fast ingestion, you can skip this and run it later.

Flag: --skipStatistics

Troubleshooting Common Issues

"CSV file not found"

Ensure the path is absolute (e.g., /home/user/data/file.csv).
Ensure the file is readable by the user running the script.

"Failed to create table"

Check if your CSV headers contain special characters. While Scoop tries to sanitize them, simple alphanumeric headers are safest.

"Columns detected as wrong type"

If a column is 99% empty, Scoop might guess it's a String. Increase --scanrows to find non-empty values further down the file.

Metadata Save Errors

Never run a build (./gradlew build) while a bulk load is running. It can cause the system to lose track of the table metadata. If this happens, you may need to reload the file.

Updated 2 months ago