This document is the official technical specification for the KDD Cup 2026 Data Agents competition, detailing the packaging method for participant submissions, Docker container runtime environment, I/O format constraints, hardware resource limits, model invocation methods, and scoring mechanism.
All participating teams must carefully read and strictly comply with all requirements in this document before submitting their solutions. The evaluation system will run containers exactly as described in this document. Any submission that does not comply with the specifications will result in evaluation failure or a score of 0.
Note: If there are any discrepancies between this document and other descriptions on the official website, this document's latest version shall prevail. If you have any questions, please contact the organizing committee on Discord or WeChat community.
During evaluation, the complete directory structure inside the container is as follows. The participant's Agent must traverse all task subdirectories under /input, process them sequentially, and write results to the corresponding /output/task_<id>/ directory.
/input/ # Read-only mount, no writes allowed
└── task_<id>/ # One subdirectory per task, id is unique identifier
├── task.json # Task metadata (task_id, difficulty, question)
└── context/ # Context data required for the task
├── csv/ # Optional: Structured CSV files
├── db/ # Optional: SQLite database files
├── json/ # Optional: Structured JSON files
├── doc/ # Optional: Data documentation (Markdown, etc.)
└── knowledge.md # Optional: Background knowledge document
/output/ # Read-write, participants write prediction results
└── task_<id>/ # Corresponds one-to-one with input task directory
└── prediction.csv # [REQUIRED OUTPUT] Participant's prediction answerNote:
The task metadata for each task is stored in task.json, with the following field descriptions:
| Field Name | Type | Description |
|---|---|---|
| task_id | string | Unique task identifier, consistent with /input/task_<id>/ directory name |
| difficulty | string | Difficulty level: easy / medium / hard / extreme |
| question | string | Natural language analysis question describing the data analysis task to be answered |
task.json example:
{
"task_id": "task_11",
"difficulty": "medium",
"question": "Which product category generated the highest total revenue in Q3? List each category and its total revenue, sorted by revenue descending."
}The data source types under the context/ directory vary by task. The participant's Agent must detect and read the subdirectories and files that actually exist:
| Subdirectory | Content Type | Description |
|---|---|---|
| csv/ | Structured Tables | One or more CSV files, directly readable with pandas and other tools |
| db/ | SQLite Database | One or more .sqlite / .db files containing multiple relational tables |
| json/ | Structured Data | Semi-structured data files in JSON format |
| doc/ | Data Documentation | Data reports and analysis documents in Markdown or other formats |
| knowledge.md | Background Knowledge | Background knowledge document related to the task, including business definitions and terminology explanations |
Correspondence between difficulty and data sources:
| Difficulty | Data Modality | Document Scale | Core Challenge |
|---|---|---|---|
| Easy | CSV/JSON + knowledge.md | Short text | Python code generation and data analysis |
| Medium | CSV/JSON + db/ + knowledge.md | Medium | Text-to-SQL + multi-source data analysis |
| Hard | Same as above + doc/ data documents | ~10K–128K tokens | Unstructured document reasoning |
| Extreme | Same as Hard | >128K tokens | Long context engineering and memory management |
For each task, the participant's Agent must write prediction results to /output/task_<id>/prediction.csv. This file is in standard CSV format (UTF-8 encoding) with the following requirements:
prediction.csv example (corresponding question: list total sales by category):
category,total_revenue Electronics,4200000.00 Clothing,1850000.00 Food,930000.00
This competition uses Docker images as the only form of solution submission. Participants need to package the complete Agent program and its runtime dependencies into a Docker image, export it as a compressed archive, and submit it via email to the organizers. After receiving the image, the evaluation system will uniformly start the container in a controlled environment, inject runtime configurations, execute the Agent, and collect output results—all fully automated without participant intervention.
To ensure unique traceability of submitted images, participants must strictly follow the format below to name Docker images and compressed archives:
| Item | Format & Description |
|---|---|
| Image Name | <team_id>:v<N> |
| Archive Filename | <team_id>_v<N>.tar.gz (colon replaced with underscore in filename) |
| <team_id> | Unique team identifier assigned by the system after registration (e.g., team0042) |
| <N> | Submission sequence number, starting from 1 and incrementing (e.g., v1 for 1st submission, v3 for 3rd) |
Example (Team ID is team0042, 3rd submission):
Note: The image tag and archive filename must completely match the above specifications, and team_id must be exactly consistent with the team ID assigned by the organizers. Submissions with non-compliant naming will be rejected by the evaluation system. Each submission must increment the version number <N>; previously submitted version numbers cannot be reused.
📅 Team ID Assignment Schedule
The organizers will assign team IDs in order of team registration from April 21-23, 2026 (AoE) and send emails to team leaders.
| Requirement | Specification |
|---|---|
| Base Image | No restriction |
| Startup Entry | Must set ENTRYPOINT or CMD, directly executable via docker run without additional parameters |
| Root Privileges | Allowed to run as root user, but modifying /input mount directory is prohibited |
📧 Submission via Email
This competition uses Google Drive for image submission. Participants upload the Docker image archive to their own Google Drive, set sharing permissions, and then send the sharing link to the organizers via email.
Step 1: Upload image archive to Google Drive
Step 2: Send submission email
Send the following information to the organizers' email (email address: [email protected]):
Email example:
Subject: [KDDCup2026 Data Agents] Submission - team0000 - v1 Team ID: team0000 Version: v1 Sharing link: https://drive.google.com/file/d/1QTBRom51ejitPLe9Ke_HKi1PZkAyWKOg/view?usp=share_link
| Submission Requirement | Description |
|---|---|
| Sharing Permission | Must be set to "Anyone with the link can view"; otherwise, organizers cannot download and the submission is invalid |
| File Naming | Archive filename must comply with <team_id>_v<N>.tar.gz format and match the version number in the email |
| File Retention | Do not delete or modify the file on Google Drive until receiving the organizers' evaluation completion notification |
| Link Validity | Ensure the sharing link remains valid throughout the evaluation period; link expiration will void the submission |
After the image is uploaded, the evaluation system will execute the following steps in sequence:
Step 1: Decompress and load the image
docker load -i team0042_v3.tar.gz
Step 2: Start container for evaluation (conceptual representation, actual parameters generated by evaluation platform)
docker run --rm \ --network=eval_net \ # Internal network (only access to evaluation system services) --cpus=16 \ # CPU core limit --memory=64g \ # Memory hard limit --memory-swap=64g \ # Disable swap space -v /eval/data/input:/input:ro \ # All task inputs (read-only) -v /eval/<submission_id>/output:/output:rw \ # All task outputs (read-write) -v /eval/<submission_id>/logs:/logs:rw \ # Solution runtime log files (read-write) -e MODEL_API_URL=<model_url> \ # Internal LLM model service URL -e MODEL_API_KEY=<api_key> \ # Internal LLM model API Key -e MODEL_NAME=<model_name> \ # Internal LLM model access name team0042:v3 # Participant solution image name
Note: Both /input and /output mount the entire dataset directory (containing all tasks). The container must traverse all task_<id> subdirectories under /input and process them one by one.
| Environment Variable Name | Description |
|---|---|
| MODEL_API_URL | Internal network deployed Qwen3.5-35B-A3B model service address (compatible with OpenAI Chat Completions API); all model calls must go through this address |
| MODEL_API_KEY | Model service authentication Key (injected by evaluation system; participants do not need to know the specific value in advance) |
| MODEL_NAME | Model name used during evaluation, value is "qwen3.5-35b-a3b" |
Core logic that main.py should implement (pseudocode):
import os, json
from pathlib import Path
input_root = Path("/input")
output_root = Path("/output")
for task_dir in sorted(input_root.iterdir()):
task_meta = json.loads((task_dir / "task.json").read_text())
task_id = task_meta["task_id"]
question = task_meta["question"]
context = task_dir / "context"
# ... Run Agent, generate DataFrame result ...
out_dir = output_root / task_id
out_dir.mkdir(parents=True, exist_ok=True)
result.to_csv(out_dir / "prediction.csv", index=False)The evaluation system allocates an independent /logs directory (read-write) for each submission through Docker mounting. When a participant's solution exits abnormally, the organizers can extract the contents of the /logs directory for participant debugging reference.
Participants must ensure that /logs contains complete logs needed for troubleshooting, especially output when the program crashes. Since stdout/stderr are not automatically persisted after container exit, it is recommended to redirect both standard output and standard error to a log file in the startup command:
# Synchronously save stdout/stderr in ENTRYPOINT script or startup command python main.py 2>&1 | tee /logs/runtime.log
Important: The organizers will review /logs contents. Any logs containing leaked test set data, gold answers, or other private information will result in the loss of debug support eligibility for that submission; serious cases will result in disqualification.
| Resource Type | Limit Specification |
|---|---|
| CPU | 16 cores (vCPU), x86-64 architecture |
| Memory | 64 GB RAM; container will be OOM killed if exceeded, all results for that submission voided |
| GPU | Participant containers do not have GPU; model inference is uniformly completed by the evaluation system (see Section 5) |
| Runtime Limit | Total runtime limit for all tasks is 12 hours; container forcibly terminated after timeout |
Note: 12 hours is the total time limit for all tasks, not per task. Please allocate processing time for each task reasonably and implement timeout protection in the code to ensure that results for completed tasks can be written out in time. Additionally, we recommend participants use multi-threading/multi-processing for batch parallel acceleration of different input samples (refer to the max_workers setting in the Starter Kit code).
This competition adopts a "develop freely, evaluate uniformly" strategy for model usage:
| Stage | Model Source | Description |
|---|---|---|
| Development / Local Debugging | Participant's choice (any LLM) | Organizers do not provide development API; participants apply for and use various model services on their own |
| Official Evaluation (After Submission) | Qwen3.5-35B-A3B (Uniformly Deployed by Organizers) | Evaluation system injects model service address via environment variables when starting container; all participants use the same model to ensure fairness |
Key Design Principle: Participant code must read model service address and API Key from environment variables and cannot hardcode them in the image. This way, the same code points to the participant's own model during local development and automatically switches to the organizers' uniformly deployed Qwen3.5-35B-A3B after submission.
When starting participant containers, the evaluation system injects runtime environment variables in the form of docker run -e KEY=VALUE. Participant programs must read these configurations from environment variables at runtime and must not hardcode API URLs, API Keys, and other information in code or images.
| Environment Variable Name | Description |
|---|---|
| MODEL_API_URL | Internal network deployed Qwen3.5-35B-A3B model service address (compatible with OpenAI Chat Completions API) |
| MODEL_API_KEY | Model service authentication Key (injected by evaluation system; participants do not need to know the specific value in advance) |
| MODEL_NAME | Model name used during evaluation, value is "qwen3.5-35b-a3b" |
⚠️ Warning: It is strictly prohibited to hardcode API Keys, API URLs, and other sensitive configurations in code, configuration files, or Docker images. Environment variables injected by the evaluation system at startup are the only legitimate configuration source.
Method 1: Using OpenAI SDK (Recommended)
import os
from openai import OpenAI
# Read from environment variables—set your own values during local development,
# injected by system during evaluation
client = OpenAI(
base_url=os.environ["MODEL_API_URL"],
api_key=os.environ.get("MODEL_API_KEY", "EMPTY"),
)
model_name = os.environ.get("MODEL_NAME", "qwen3.5-35b-a3b")
response = client.chat.completions.create(
model=model_name,
messages=[
{"role": "system", "content": "You are a data analysis agent."},
{"role": "user", "content": question},
],
...,
)
answer = response.choices[0].message.contentMethod 2: Using requests to directly call HTTP interface
import os, requests
resp = requests.post(
os.environ["MODEL_API_URL"] + "/v1/chat/completions",
headers={"Authorization": f"Bearer {os.environ.get('MODEL_API_KEY', 'EMPTY')}"},
json={
"model": os.environ.get("MODEL_NAME", "qwen3.5-35b-a3b"),
"messages": [{"role": "user", "content": question}],
"temperature": 0.0,
},
...,
)
answer = resp.json()["choices"][0]["message"]["content"]| Access Type | Policy |
|---|---|
| External Internet | Completely blocked during evaluation; cannot access any public IPs or domain names |
| Internal Model Service | Allowed, accessed through the address specified by MODEL_API_URL environment variable |
| Other Internal Services | Blocked; container can only access model service endpoint |
| Inter-Container Communication | Prohibited; each submission has only one container, running independently throughout |
| Host Ports | All blocked, no ports exposed externally |
⚠️ Warning: Container network has blocked all external internet access. Any attempt to call external LLM services will fail due to network blocking. Please ensure that model calls in the code completely rely on the MODEL_API_URL environment variable.
The evaluation system uses a unified automated evaluation program to perform offline batch scoring of participant submitted prediction results. Participants need to submit result files in the specified format (such as prediction.csv), and the system will compare them against corresponding standard answers for evaluation and generate final scores.
All submissions run in the same evaluation environment, and the evaluation process is deterministic execution to ensure consistency and fairness of evaluation results.
Evaluation uses a column-level content consistency matching method (column-level matching), with the core process as follows:
Evaluation introduces a light penalty for redundant predictions based on coverage (Recall) to more comprehensively measure prediction result quality.
Definitions:
First calculate:
Recall = Matched Columns / Gold Columns
Final score is:
Score = Recall - λ · (Extra Columns / Predicted Columns)
Where:
This scoring method moderately constrains the "redundancy level" of prediction results while maintaining a Recall orientation, with the following design goals:
Evaluation not only focuses on "whether all are found" (Recall), but also encourages "as concise and accurate as possible" prediction results.
Before constructing column signatures, the evaluation system normalizes cell content to reduce the impact of format differences (such as standardization of numeric and time types).
| Type | Normalization Rule |
|---|---|
| Null Values | Empty string, "null", "none", "nan", "nat", "<na>" (case-insensitive) → unified to empty string "" |
| Numeric | Parsed with Decimal, rounded to 2 decimal places (ROUND_HALF_UP). E.g., 4200000 → "4200000.00", 0.005 → "0.01" |
| Date | ISO 8601 format (YYYY-MM-DD). E.g., "2024-3-1" must be "2024-03-01" |
| DateTime | With timezone: converted to UTC (ending with Z); without timezone: keep original ISO format |
| String | Remove leading/trailing whitespace and \r\n; rest preserved as-is (case-sensitive) |
| Name Fields | First name + last name as two columns OR combined as full name in one column both accepted. E.g., "John" + "Smith" or "John Smith" |
⚠️ Warning: String comparison is case-sensitive. For example, "East Asia" and "east asia" are considered different values. Please ensure prediction output matches the original string format in the data source.
| prediction.csv | gold.csv | Result & Reason |
|---|---|---|
| Columns A, B, C all exist and values match completely | Columns B, C | ✅ High score: covers all gold columns; but slightly below perfect due to extra column A |
| Column B exists, C missing | Columns B, C | ⚠️ Low score: incomplete coverage of gold columns |
| Columns B, C exist but B values mismatch | Columns B, C | ⚠️ Low score: partial column match failure |
| Contains many irrelevant columns | Columns B, C | ⚠️ Significantly lower score: many redundant columns incur penalty |
| prediction.csv does not exist | Any | ❌ Score 0: missing file |
To ensure reasonable allocation of evaluation resources, submission frequency is subject to the following rules:
| Limit Item | Rule |
|---|---|
| Submission Prerequisite | Must wait for the previous submission's evaluation results before making the next submission |
| Evaluation Timeliness | Organizers will start evaluation within 24 hours of receiving submission; specific result time depends on team code's actual runtime |
| Daily Submission Limit | Each team can submit at most 1 time per day |
| Phase 1 Total Submissions | Each team can submit at most 30 times during Phase 1 |
| Leaderboard Score | Displays highest score |
| Image Version | Each submission requires uploading complete image package; incremental updates not supported |
Note:
The following behaviors will result in evaluation failure; serious cases will result in disqualification:
Yes. One container processes all tasks in the entire dataset. The Agent needs to traverse all task_<id> directories under /input, run inference independently for each task, and write results to the corresponding /output/task_<id>/prediction.csv.
Yes. During development and local debugging, participants can freely use any LLM (OpenAI, Anthropic, local models, etc.). Organizers do not intervene or provide development APIs. However, code must read model address and Key from environment variables and cannot be hardcoded. After submitting the Docker image, the evaluation system will inject environment variables pointing to Qwen3.5-35B-A3B, and the code will automatically switch to the organizers' unified model.
No. Column names do not participate in scoring; the evaluation script only compares each column's value vector (column_values signature). Column names can be named arbitrarily; semantic names are recommended for easier debugging.
The evaluation script normalizes all numeric values to 2 decimal places (rounding). For example, predicted value 4200000 and gold value 4200000.00 are considered the same. It is recommended to use sufficient precision (at least 2 decimal places) when predicting.
12 hours is the total time limit for the entire container runtime, including processing time for all tasks. It is recommended to set separate timeout control for each task in the code and write out <task_id>/prediction.csv immediately after processing each task to prevent timeout from causing loss of completed results.
Not fixed. Different tasks may contain different combinations of data sources (csv/, db/, json/, doc/, knowledge.md). The Agent should dynamically detect subdirectories and files that actually exist under context/ rather than assuming a fixed structure.