A hands-on example showing how LanceDB and DuckDB work together on multimodal data. The repo ingests the Amazon Berkeley Objects (ABO) dataset into a LanceDB catalog with CLIP image embeddings and text embeddings, then uses DuckDB with the Lance extension to query, join, and materialize results via SQL.
The LanceDB products table stores product metadata, image paths, a CLIP multimodal vector (multimodal_vec), and a text-semantic vector (text_vec). DuckDB attaches the Lance directory as a namespace and runs SQL directly on top of it -- vector search, joins to local DuckDB tables, and aggregations to answer questions about the data.
Requires Python 3.12+.
# Sync dependencies from pyproject.toml
uv sync
# Add dependencies as needed
uv add ...Download the listings metadata and small images from the ABO dataset page and extract them into data/:
After extracting, the layout should look like:
data/
abo-listings/listings/metadata/ # listings_*.json.gz files
abo-images-small/images/
metadata/images.csv.gz # image metadata
small/ # downscaled image filesEmbeddings are computed and written to LanceDB in batches, so progress is saved incrementally.
# Full ingestion (~145K products)
uv run python ingest.py
# Smoke test with a subset
uv run python ingest.py --limit 200
# Larger batches (default is 32)
uv run python ingest.py --batch-size 128The output LanceDB directory is ./abo-products-lance by default (override with --output-root).
Search the catalog by text using the text_vec embedding column:
uv run python text_search.pyThis encodes the text query with the intfloat/multilingual-e5-base multilingual embedding model and runs a top-k vector search over the Lance table via DuckDB.
Before running the image search, generate a local sales.duckdb file with synthetic sales rows linked to shoe products in the Lance catalog:
uv run python create_duckdb.pyThis reads shoe item_ids from the Lance table and creates a sales table with 100 randomly assigned purchases. The image search script joins against this table.
Search by visual concept using the CLIP multimodal_vec column and join results to the sales table:
uv run python image_search.pyThis encodes the query "beige shoes" with CLIP, retrieves the nearest products, and joins them to sales.duckdb.
You can also query the Lance table directly from the DuckDB CLI:
duckdbINSTALL lance;
LOAD lance;
ATTACH './abo-products-lance' AS abo (TYPE LANCE);
SELECT item_id, title, brand, product_type
FROM abo.main.products
LIMIT 10;