Bulk Data Loading Guide
Load massive CSV files (10GB+) into Scoop quickly and efficiently
Bulk Data Loading
Purpose: Learn how to load massive CSV files (10GB+) into Scoop quickly and efficiently. Audience: Data Engineers, Admins
Overview
Scoop's Bulk Loader is designed for speed. Use it when you need to ingest large historical datasets (10GB - 100GB+) without waiting days for standard processing. It bypasses the standard analysis pipeline to get your data ready for querying in minutes, not hours.
Quick Start
The fastest way to load a file is using the command line interface (CLI).
./scripts/runscoop --bulkload /path/to/your/large_file.csv --workspaceID W123This will:
- Scan the first 100,000 rows to detect column types.
- Create a new Inbox and Table automatically.
- Load the data directly into the database.
Advanced Configuration: Control Files
For messy enterprise data, you often need more control than the default "auto-detect" offers. Use a Control File (JSON) to rename columns, filter out junk, and set descriptions during the load.
1. Create a Control File (load_config.json)
load_config.json){
"datasetName": "Sales History 2024",
"description": "Complete transaction history for Q1-Q4",
"timestampColumn": "TXN_DATE",
"timestampFormat": "yyyy-MM-dd HH:mm:ss",
"columnMappings": {
"AMT_SOLD": "Sales Amount",
"CUST_ID": "Customer ID",
"STORE_NUM": "Store Number"
},
"excludeColumns": ["INTERNAL_LOG_ID", "SYS_CREATED_BY"],
"columnSemantics": {
"AMT_SOLD": "Use this for revenue calculations. Returns 0 for returns."
}
}Key Fields:
datasetName: The friendly name shown in Scoop UI.timestampColumn: The CSV column representing the transaction time (Crucial for time-series analysis).columnMappings: Rename cryptic CSV headers (e.g.,AMT_SOLD) to business-friendly names (e.g.,Sales Amount).excludeColumns: List of columns to completely skip loading.columnSemantics: Pre-load the AI with knowledge about what a column means, preventing hallucination later.
2. Run with Configuration
./scripts/runscoop --bulkload /path/to/your/large_file.csv \
--workspaceID W123 \
--bulkloadconfig ./load_config.jsonOptimizing Performance
Schema Scanning
By default, Scoop scans the first 100,000 rows to guess data types (Integer vs String).
- Faster Load: Use
--scanrows 10000(Less accurate type detection, but faster start). - Better Accuracy: Use
--scanrows 1000000(If your file has data that looks like numbers but contains text later on).
Skipping Statistics
Scoop automatically calculates column statistics (Min, Max, Null %) after loading. For ultra-fast ingestion, you can skip this and run it later.
- Flag:
--skipStatistics
Troubleshooting Common Issues
"CSV file not found"
- Ensure the path is absolute (e.g.,
/home/user/data/file.csv). - Ensure the file is readable by the user running the script.
"Failed to create table"
- Check if your CSV headers contain special characters. While Scoop tries to sanitize them, simple alphanumeric headers are safest.
"Columns detected as wrong type"
- If a column is 99% empty, Scoop might guess it's a String. Increase
--scanrowsto find non-empty values further down the file.
Metadata Save Errors
- Never run a build (
./gradlew build) while a bulk load is running. It can cause the system to lose track of the table metadata. If this happens, you may need to reload the file.
Updated about 3 hours ago