Skip to content

thriving-dev/kafka-kcat-duckdb-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kcat + DuckDB: The Fastest Way to Explore Kafka Topics Locally

In this post we’ll build a small, fully local data pipeline you can run on your own machine in under an hour:

  • take a public, moderately complex JSON dataset (nested fields, arrays)
  • ingest it into a local Kafka cluster twice:
    • once as raw JSON (as-is payloads)
    • once as Avro, backed by Schema Registry
  • consume from both topics using kcat
  • load the consumed data into DuckDB for quick, in-memory analysis and exploration

The point isn’t “big data” scale. It’s to get a practical feel for two ingestion styles (raw JSON vs Avro+Schema Registry), what the payloads look like on the wire, and how easy it is to pivot from streaming ingestion to ad-hoc analysis.

Why DuckDB here?

DuckDB is a small, fast analytical database that runs in-process (no server to run). It shines for local exploration: you can load JSON/CSV/Parquet into memory and query it with SQL immediately. In this POC we’ll use it as a lightweight “analysis scratchpad” after pulling messages out of Kafka.

What you’ll need

You can replicate everything locally with:

  • Docker + Docker Compose
  • python
  • jq
  • kcat
  • DuckDB (CLI)

No cloud services, no managed Kafka, no external databases. Just a reproducible docker-compose.yml, a small dataset file, and a few commands you can copy/paste.

Repo layout

As we go, we’ll fill in a simple structure:

  • docker-compose.yml – Kafka + Schema Registry
  • data/ – downloaded dataset (NDJSON)
  • schemas/ – Avro schema(s)
  • scripts/ – tiny helpers for producing/transforming data
  • blog.md – the post itself (this file)

Next up: bringing up Kafka + Schema Registry via Docker Compose, then validating connectivity with kcat.

Docker Compose: Kafka + Schema Registry (via Redpanda)

We’ll run a minimal, single-node Kafka-compatible setup locally using Redpanda. Redpanda provides the Kafka API and a Schema Registry–compatible endpoint in a single container, which keeps the stack small and fast to start.

This setup is intentionally development-only: no security, no persistence tuning, no high availability. The goal is a predictable local environment that works well with standard Kafka tooling like kcat.

Intentionally ephemeral The cluster does not persist data between runs. When we shut it down, all topics, messages, and schemas are gone. For a POC and a tutorial-style workflow, this is a feature: every run starts from a clean slate.

Once the container is running, Kafka will be reachable on localhost:9092 and Schema Registry on localhost:8081.

Start the stack

docker compose up -d

Sanity checks

# Kafka API
kcat -L -b localhost:9092

# Schema Registry
curl http://localhost:8081/subjects

If both commands succeed, we have a working local Kafka + Schema Registry environment and can move on to ingesting data.

All commands from here on are executed on the host machine, not inside the container.

Dataset: Wikipedia Movie Data (JSON)

For this POC we’ll use a public JSON dataset containing metadata about movies extracted from Wikipedia. Each record represents a single movie and includes a mix of simple fields and nested structures:

  • scalar values (title, year)
  • arrays (cast, genres)
  • optional / missing fields across records

This makes it a good fit for comparing raw JSON ingestion with schema-based Avro ingestion without being overly large or artificial.

Dataset

We’ll treat one movie = one Kafka message.

Since Kafka producers work best with newline-delimited JSON (one object per line), we first convert the dataset into NDJSON.

Download and prepare

mkdir -p data

curl -o data/movies.json \
  https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json

# Convert JSON array to NDJSON (one movie per line)
jq -c '.[]' data/movies.json > data/movies.ndjson

At this point we have a static, local dataset that’s easy to replay, inspect, and re-ingest. In the next step, we’ll produce these records into Kafka as raw JSON, exactly as they appear in the file.

Produce to Kafka topic json.raw

With the dataset prepared as NDJSON (one JSON object per line), we can now ingest it into Kafka as raw JSON. This is the simplest ingestion style: we don’t enforce a schema at the broker level and we don’t involve Schema Registry. Each movie becomes one Kafka message value.

First, create the topic using Redpanda’s rpk tool:

docker exec -it redpanda rpk topic create json.raw -p 1 -r 1

Then produce the file into Kafka. The -l flag tells kcat to treat each line as one message:

kcat -P -b localhost:9092 -t json.raw -l data/movies.ndjson

Finally, consume a few messages to confirm everything landed as expected:

kcat -C -b localhost:9092 -t json.raw -o -5 -e

At this point we have the dataset in Kafka as plain JSON, ready to be consumed and analysed later (and to compare against the Avro + Schema Registry ingestion path next).

Produce Avro records to avro.movies

Next, we ingest the same dataset again — this time as Avro, backed by Schema Registry. Compared to the raw JSON topic, this gives us explicit structure, validation, and a foundation for schema evolution.

We start by defining a minimal Avro schema that matches the shape of the Wikipedia movie data: scalar fields (title, year), arrays (cast, genres), and several optional fields.

The producer itself is a small Python script that:

  • reads data/movies.ndjson
  • normalises records to the Avro schema
  • registers the schema automatically with Schema Registry
  • produces Avro-encoded messages to Kafka

Install dependencies

python3 -m pip install --upgrade \
  confluent-kafka fastavro attrs cachetools authlib

Create the topic

docker exec -it redpanda rpk topic create avro.movies -p 1 -r 1

Produce Avro data

python3 scripts/produce_movies_avro.py \
  --bootstrap localhost:9092 \
  --schema-registry http://localhost:8081 \
  --topic avro.movies \
  --input data/movies.ndjson

Once the producer finishes, you can verify that the schema was registered:

curl http://localhost:8081/subjects

You should see a subject named avro.movies-value.

At this point, the same dataset exists in Kafka in two forms:

  • json.raw → raw, schemaless JSON
  • avro.movies → Avro with schema enforcement

In the next step, we’ll consume from these topics using kcat and load the data into DuckDB for in-memory analysis.

From Kafka to DuckDB (in-memory)

To explore the data interactively, we’ll pull records out of Kafka with kcat and load them into DuckDB for ad-hoc SQL analysis. DuckDB works very well with newline-delimited JSON, which makes it a natural fit for this kind of workflow.

One small detail to be aware of: cast is a reserved keyword in DuckDB, so we’ll quote it as "cast" in all queries.


Variant 1: Direct stream → DuckDB (one-liner)

This variant streams records directly from Kafka into DuckDB and runs a few queries immediately. Nothing is written to disk.

Raw JSON topic

kcat -C -b localhost:9092 \
  -t json.raw \
  -o beginning -e -q \
  -f '%s\n' \
| duckdb :memory: -c "
  CREATE TABLE movies_json AS
  SELECT * FROM read_json_auto('/dev/stdin');

  SELECT COUNT(*) AS movies FROM movies_json;

  SELECT g AS genre, COUNT(*) AS cnt
  FROM movies_json, UNNEST(genres) AS t(g)
  GROUP BY 1
  ORDER BY cnt DESC
  LIMIT 10;

  SELECT c AS actor, COUNT(*) AS cnt
  FROM movies_json, UNNEST(\"cast\") AS t(c)
  GROUP BY 1
  ORDER BY cnt DESC
  LIMIT 10;

  SELECT title, year
  FROM movies_json
  WHERE list_contains(\"cast\", 'Anthony Hopkins')
  ORDER BY year, title;
"

Avro topic (decoded via Schema Registry)

If your kcat build supports Avro decoding, you can do the same with the Avro topic:

kcat -C -b localhost:9092 \
  -t avro.movies \
  -o beginning -e -q \
  -s value=avro \
  -r http://localhost:8081 \
  -f '%s\n' \
| duckdb :memory: -c "
  CREATE TABLE movies_avro AS
  SELECT * FROM read_json_auto('/dev/stdin');

  SELECT COUNT(*) AS movies FROM movies_avro;
"

This is ideal for quick checks and copy-paste experiments.


Variant 2: Kafka → temp file → interactive DuckDB shell

If you want to explore the data more freely, it’s often nicer to keep the DuckDB CLI open.

Consume from Kafka into a temporary file

# Raw JSON
kcat -C -b localhost:9092 \
  -t json.raw \
  -o beginning -e -q \
  -f '%s\n' \
  > /tmp/movies.jsonl

# Avro (decoded)
kcat -C -b localhost:9092 \
  -t avro.movies \
  -o beginning -e -q \
  -s value=avro \
  -r http://localhost:8081 \
  -f '%s\n' \
  > /tmp/movies-avro.jsonl

Open DuckDB and query interactively

duckdb :memory:

Inside the DuckDB shell:

CREATE TABLE movies_json AS
SELECT * FROM read_json_auto('/tmp/movies.jsonl');

CREATE TABLE movies_avro AS
SELECT * FROM read_json_auto('/tmp/movies-avro.jsonl');

SELECT COUNT(*) FROM movies_json;

SELECT c AS actor, COUNT(*) AS cnt
FROM movies_json, UNNEST("cast") AS t(c)
GROUP BY 1
ORDER BY cnt DESC
LIMIT 10;

SELECT title, year
FROM movies_json
WHERE list_contains("cast", 'Anthony Hopkins')
ORDER BY year, title;

This gives us a clean end-to-end loop:

  • Kafka for ingestion and replay
  • Avro + Schema Registry when we want structure and guarantees
  • DuckDB as a fast, zero-setup SQL scratchpad for local analysis

At this point, you can re-consume the topics, tweak queries, or compare raw JSON vs Avro-decoded data with almost no friction.


Limiting how much data you read from Kafka

For larger topics, it’s often unnecessary to consume everything. kcat makes it easy to read just a subset.

Read only the last N messages:

kcat -C -b localhost:9092 -t json.raw -o -1000 -e -q -f '%s\n'

Stop after a fixed number of messages:

kcat -C -b localhost:9092 -t json.raw -o beginning -c 500 -f '%s\n'

Both options work seamlessly with the DuckDB ingestion patterns shown earlier and keep local experiments fast and predictable.

Compacted vs. non-compacted topics

So far we’ve used a non-compacted topic, which is the simplest model: every produced message is retained (subject to retention settings) and consuming from the beginning replays the full history.

For some use cases, especially when Kafka represents current state rather than events, log-compacted topics are a better fit.

Non-compacted topics

  • Append-only event history
  • Replaying the topic shows everything that ever happened
  • Well suited for analytics, auditing, and batch-style exploration (like this POC)

Compacted topics

  • Kafka retains only the latest value per key
  • Replaying the topic reconstructs the current state
  • Well suited for reference data, lookup tables, and materialised views

When consuming compacted topics with kcat, keep in mind:

  • you may see fewer records than were originally produced
  • keys matter — without stable keys, compaction has no effect
  • loading into DuckDB typically represents a snapshot, not a full history

In practice, both models work well with DuckDB. The key is knowing whether you want to analyse event history (non-compacted) or current state (compacted) when pulling data out of Kafka.


Conclusion & takeaways

In under an hour, we built a fully local, reproducible pipeline: public JSON data → Kafka → (optionally) Avro + Schema Registry → DuckDB for analysis. The setup is intentionally small, but the patterns scale.

A few key takeaways:

  • Raw JSON is hard to beat for speed and flexibility when exploring data or bootstrapping pipelines.
  • Avro + Schema Registry adds structure and safety with very little additional effort once the schema is in place.
  • kcat remains an incredibly powerful glue tool, especially when combined with NDJSON.
  • DuckDB is an excellent companion for Kafka when you want fast, local, SQL-based analysis without standing up another service.

This kind of workflow is ideal for POCs, debugging, data exploration, and teaching — and it maps cleanly to more production-grade setups when needed.

About

A fully local, tutorial-style data pipeline you can run in under an hour: spin up Kafka + Schema Registry via Redpanda (Docker Compose), ingest a public JSON dataset as both raw JSON and Avro, inspect messages with `kcat`, and load the results into DuckDB for fast, in-memory SQL analysis—no cloud services required.

Resources

License

Stars

Watchers

Forks

Contributors

Languages