In this post we’ll build a small, fully local data pipeline you can run on your own machine in under an hour:
- take a public, moderately complex JSON dataset (nested fields, arrays)
- ingest it into a local Kafka cluster twice:
- once as raw JSON (as-is payloads)
- once as Avro, backed by Schema Registry
- consume from both topics using kcat
- load the consumed data into DuckDB for quick, in-memory analysis and exploration
The point isn’t “big data” scale. It’s to get a practical feel for two ingestion styles (raw JSON vs Avro+Schema Registry), what the payloads look like on the wire, and how easy it is to pivot from streaming ingestion to ad-hoc analysis.
DuckDB is a small, fast analytical database that runs in-process (no server to run). It shines for local exploration: you can load JSON/CSV/Parquet into memory and query it with SQL immediately. In this POC we’ll use it as a lightweight “analysis scratchpad” after pulling messages out of Kafka.
You can replicate everything locally with:
- Docker + Docker Compose
pythonjqkcat- DuckDB (CLI)
No cloud services, no managed Kafka, no external databases. Just a reproducible docker-compose.yml, a small dataset file, and a few commands you can copy/paste.
As we go, we’ll fill in a simple structure:
docker-compose.yml– Kafka + Schema Registrydata/– downloaded dataset (NDJSON)schemas/– Avro schema(s)scripts/– tiny helpers for producing/transforming datablog.md– the post itself (this file)
Next up: bringing up Kafka + Schema Registry via Docker Compose, then validating connectivity with kcat.
We’ll run a minimal, single-node Kafka-compatible setup locally using Redpanda. Redpanda provides the Kafka API and a Schema Registry–compatible endpoint in a single container, which keeps the stack small and fast to start.
This setup is intentionally development-only: no security, no persistence tuning, no high availability. The goal is a predictable local environment that works well with standard Kafka tooling like kcat.
Intentionally ephemeral The cluster does not persist data between runs. When we shut it down, all topics, messages, and schemas are gone. For a POC and a tutorial-style workflow, this is a feature: every run starts from a clean slate.
Once the container is running, Kafka will be reachable on localhost:9092 and Schema Registry on localhost:8081.
Start the stack
docker compose up -dSanity checks
# Kafka API
kcat -L -b localhost:9092
# Schema Registry
curl http://localhost:8081/subjectsIf both commands succeed, we have a working local Kafka + Schema Registry environment and can move on to ingesting data.
All commands from here on are executed on the host machine, not inside the container.
For this POC we’ll use a public JSON dataset containing metadata about movies extracted from Wikipedia. Each record represents a single movie and includes a mix of simple fields and nested structures:
- scalar values (title, year)
- arrays (cast, genres)
- optional / missing fields across records
This makes it a good fit for comparing raw JSON ingestion with schema-based Avro ingestion without being overly large or artificial.
Dataset
- Source: https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json
- Size: ~3,600 records
- Structure: JSON array of objects
We’ll treat one movie = one Kafka message.
Since Kafka producers work best with newline-delimited JSON (one object per line), we first convert the dataset into NDJSON.
Download and prepare
mkdir -p data
curl -o data/movies.json \
https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json
# Convert JSON array to NDJSON (one movie per line)
jq -c '.[]' data/movies.json > data/movies.ndjsonAt this point we have a static, local dataset that’s easy to replay, inspect, and re-ingest. In the next step, we’ll produce these records into Kafka as raw JSON, exactly as they appear in the file.
With the dataset prepared as NDJSON (one JSON object per line), we can now ingest it into Kafka as raw JSON. This is the simplest ingestion style: we don’t enforce a schema at the broker level and we don’t involve Schema Registry. Each movie becomes one Kafka message value.
First, create the topic using Redpanda’s rpk tool:
docker exec -it redpanda rpk topic create json.raw -p 1 -r 1Then produce the file into Kafka. The -l flag tells kcat to treat each line as one message:
kcat -P -b localhost:9092 -t json.raw -l data/movies.ndjsonFinally, consume a few messages to confirm everything landed as expected:
kcat -C -b localhost:9092 -t json.raw -o -5 -eAt this point we have the dataset in Kafka as plain JSON, ready to be consumed and analysed later (and to compare against the Avro + Schema Registry ingestion path next).
Next, we ingest the same dataset again — this time as Avro, backed by Schema Registry. Compared to the raw JSON topic, this gives us explicit structure, validation, and a foundation for schema evolution.
We start by defining a minimal Avro schema that matches the shape of the Wikipedia movie data: scalar fields (title, year), arrays (cast, genres), and several optional fields.
The producer itself is a small Python script that:
- reads
data/movies.ndjson - normalises records to the Avro schema
- registers the schema automatically with Schema Registry
- produces Avro-encoded messages to Kafka
python3 -m pip install --upgrade \
confluent-kafka fastavro attrs cachetools authlibdocker exec -it redpanda rpk topic create avro.movies -p 1 -r 1python3 scripts/produce_movies_avro.py \
--bootstrap localhost:9092 \
--schema-registry http://localhost:8081 \
--topic avro.movies \
--input data/movies.ndjsonOnce the producer finishes, you can verify that the schema was registered:
curl http://localhost:8081/subjectsYou should see a subject named avro.movies-value.
At this point, the same dataset exists in Kafka in two forms:
json.raw→ raw, schemaless JSONavro.movies→ Avro with schema enforcement
In the next step, we’ll consume from these topics using kcat and load the data into DuckDB for in-memory analysis.
To explore the data interactively, we’ll pull records out of Kafka with kcat and load them into DuckDB for ad-hoc SQL analysis. DuckDB works very well with newline-delimited JSON, which makes it a natural fit for this kind of workflow.
One small detail to be aware of: cast is a reserved keyword in DuckDB, so we’ll quote it as "cast" in all queries.
This variant streams records directly from Kafka into DuckDB and runs a few queries immediately. Nothing is written to disk.
kcat -C -b localhost:9092 \
-t json.raw \
-o beginning -e -q \
-f '%s\n' \
| duckdb :memory: -c "
CREATE TABLE movies_json AS
SELECT * FROM read_json_auto('/dev/stdin');
SELECT COUNT(*) AS movies FROM movies_json;
SELECT g AS genre, COUNT(*) AS cnt
FROM movies_json, UNNEST(genres) AS t(g)
GROUP BY 1
ORDER BY cnt DESC
LIMIT 10;
SELECT c AS actor, COUNT(*) AS cnt
FROM movies_json, UNNEST(\"cast\") AS t(c)
GROUP BY 1
ORDER BY cnt DESC
LIMIT 10;
SELECT title, year
FROM movies_json
WHERE list_contains(\"cast\", 'Anthony Hopkins')
ORDER BY year, title;
"If your kcat build supports Avro decoding, you can do the same with the Avro topic:
kcat -C -b localhost:9092 \
-t avro.movies \
-o beginning -e -q \
-s value=avro \
-r http://localhost:8081 \
-f '%s\n' \
| duckdb :memory: -c "
CREATE TABLE movies_avro AS
SELECT * FROM read_json_auto('/dev/stdin');
SELECT COUNT(*) AS movies FROM movies_avro;
"This is ideal for quick checks and copy-paste experiments.
If you want to explore the data more freely, it’s often nicer to keep the DuckDB CLI open.
# Raw JSON
kcat -C -b localhost:9092 \
-t json.raw \
-o beginning -e -q \
-f '%s\n' \
> /tmp/movies.jsonl
# Avro (decoded)
kcat -C -b localhost:9092 \
-t avro.movies \
-o beginning -e -q \
-s value=avro \
-r http://localhost:8081 \
-f '%s\n' \
> /tmp/movies-avro.jsonlduckdb :memory:Inside the DuckDB shell:
CREATE TABLE movies_json AS
SELECT * FROM read_json_auto('/tmp/movies.jsonl');
CREATE TABLE movies_avro AS
SELECT * FROM read_json_auto('/tmp/movies-avro.jsonl');
SELECT COUNT(*) FROM movies_json;
SELECT c AS actor, COUNT(*) AS cnt
FROM movies_json, UNNEST("cast") AS t(c)
GROUP BY 1
ORDER BY cnt DESC
LIMIT 10;
SELECT title, year
FROM movies_json
WHERE list_contains("cast", 'Anthony Hopkins')
ORDER BY year, title;This gives us a clean end-to-end loop:
- Kafka for ingestion and replay
- Avro + Schema Registry when we want structure and guarantees
- DuckDB as a fast, zero-setup SQL scratchpad for local analysis
At this point, you can re-consume the topics, tweak queries, or compare raw JSON vs Avro-decoded data with almost no friction.
For larger topics, it’s often unnecessary to consume everything. kcat makes it easy to read just a subset.
Read only the last N messages:
kcat -C -b localhost:9092 -t json.raw -o -1000 -e -q -f '%s\n'Stop after a fixed number of messages:
kcat -C -b localhost:9092 -t json.raw -o beginning -c 500 -f '%s\n'Both options work seamlessly with the DuckDB ingestion patterns shown earlier and keep local experiments fast and predictable.
So far we’ve used a non-compacted topic, which is the simplest model: every produced message is retained (subject to retention settings) and consuming from the beginning replays the full history.
For some use cases, especially when Kafka represents current state rather than events, log-compacted topics are a better fit.
Non-compacted topics
- Append-only event history
- Replaying the topic shows everything that ever happened
- Well suited for analytics, auditing, and batch-style exploration (like this POC)
Compacted topics
- Kafka retains only the latest value per key
- Replaying the topic reconstructs the current state
- Well suited for reference data, lookup tables, and materialised views
When consuming compacted topics with kcat, keep in mind:
- you may see fewer records than were originally produced
- keys matter — without stable keys, compaction has no effect
- loading into DuckDB typically represents a snapshot, not a full history
In practice, both models work well with DuckDB. The key is knowing whether you want to analyse event history (non-compacted) or current state (compacted) when pulling data out of Kafka.
In under an hour, we built a fully local, reproducible pipeline: public JSON data → Kafka → (optionally) Avro + Schema Registry → DuckDB for analysis. The setup is intentionally small, but the patterns scale.
A few key takeaways:
- Raw JSON is hard to beat for speed and flexibility when exploring data or bootstrapping pipelines.
- Avro + Schema Registry adds structure and safety with very little additional effort once the schema is in place.
- kcat remains an incredibly powerful glue tool, especially when combined with NDJSON.
- DuckDB is an excellent companion for Kafka when you want fast, local, SQL-based analysis without standing up another service.
This kind of workflow is ideal for POCs, debugging, data exploration, and teaching — and it maps cleanly to more production-grade setups when needed.