Commencing Cortex Code Desktop: Building a LEGO Data Pipeline from Scratch

I was lucky enough a few weeks ago to sit in on a Snowflake AI session ran by The Information Lab's one and only Will Sutton. This session was my first real introduction to Cortex Code Desktop (Coco), however, perhaps after getting too distracted getting our agent to respond in different ways (with "like a caveman" or "be sarcastic" some of my favourite) we were unable to complete the session.

Thankfully, I had the time earlier this week to complete the final task of that session: Using Coco to create a data pipeline centred around data we hope to pull from the Rebrickable API.

What's the Goal?

The Rebrickable API exposes structured data about every LEGO set, theme, colour, and minifigure ever produced. My task was as follows:

Extract data from four API endpoints (themes, colours, sets, minifigs)
Land the raw JSON into Snowflake
Create dimension views on top, ready for downstream analysis

Nothing groundbreaking but the kind of end-to-end pipeline that can still take days to fully complete, and exactly the kind of task I needed to see where Coco may come into use throughout my standard workday.

First Impressions: Conversational Architecture

When first working within Coco one aspect that stuck out to me was how instead of just writing prompts and getting a wall of code back to copy & paste the process seemed a lot more conversational. The agent would first ask me any follow up questions it had, propose an approach, ask for my thoughts or preferences, then execute directly by running SQL, reading & writing files, and testing outputs.

I started the session by describing what I wanted: "Build an ingestion pipeline from the Rebrickable API into Snowflake." Coco's first move wasn't to write code, it was to research the API. It figured out the authentication pattern, pagination structure, rate limits, and response shapes, then came back with an architecture plan for me to approve before writing a line of Python.

It really felt like Coco was trying to follow data engineering principles as best as possible. Plan first, build second.

Setting Guardrails with AGENTS.md

Before any data flowed, I set up some boundaries for my agent, as this project would involve using a personal API key and frankly I didn't feel like receiving a hefty bill from later on. The project has an AGENTS.md file — a simple markdown (rules) file that tells Coco what it can and cannot do:

The API key comes from the REBRICKABLE_API_KEY environment variable only
All objects must live in my selected schema
No hardcoded credentials anywhere

I also added a security hook — a shell script in .cortex/hooks/ that intercepts every tool call and blocks any attempt to read the .env file, within which my API key sat. This created a hard gate preventing the agent from accidentally exposing secrets regardless of what it's asked to do.

This gave me the confidence to let Coco to operate with more autonomy going forward. If the guardrails are solid, you can trust the agent to move fast.

From Raw JSON to Clean Tables

The API returns data as JSON, messy nested objects that aren't easy to query directly. The pipeline handles this in two layers:

Raw tables store the JSON exactly as it arrives from the API, one object per row. This way the original data is always preserved in case we need it later.
Dimension views sit on top and pull out the useful fields into proper columns (set name, release year, number of parts, etc.). These are what I'd actually query for analysis.

Coco built both layers, including a helper view that resolves LEGO's theme hierarchy (e.g. mapping "Star Wars > Clone Wars" back to just "Star Wars") so that any future analysis can group sets by their top-level franchise without extra work.

What Surprised Me

The value of the planning phase. Coco spent time understanding API pagination, proposing architectures, and confirming naming conventions before generating a single function. Meaning the code it produced was cleaner in the end because of that initial time and cost to plan.

It validates itself. After building the modelled views, Coco ran DESCRIBE VIEW against each one to confirm column names and types actually existed. This aimed to prevent a common AI issue, hallucinating plausible column names that don't correspond to reality.

It remembers context across the session. Later steps referenced specific row counts (e.g., 27,069 sets, 494 themes) that it learned during ingestion. There was no re-explaining required between pipeline phases.

Guardrails save time. With the .env hook and AGENTS.md in place, I didn't need to review every operation for accidental API key exposure. I could focus instead on the outcomes and decisions Coco was making.

Final Thoughts

This was a single-session build. API research, pipeline development, and a complete modelled layer, all through conversation. I made architectural decisions and set boundaries but Coco did all the work.

It's not magic. You still need to understand what good data engineering looks like, but it was able to help someone like myself with little practical data engineering experience complete a task which would have taken me multiple days in a couple of hours. While leaving with confidence that the work completed is correct.

The repository is on GitHub and the data is ready for whatever comes next.

Author:

Tobin Hardy

View Profile