Skip to content

Commit 743e74d

Browse files
authored
feat: Added multimodal support (#83)
* Added from_dataset class method * Added from_dataset class method * Optimized from_dataset class method * Optimized from_dataset class method * Simplified from_dataset class method * Simplified from_dataset class method * Added DatasetLike protocol * Updated tests * Updated tests * Improved code * Improved code * Added optional datasets dependency * Updated tests * Renamed testfile * Improved code, refactored utils * Updated tests * Simplified tests * Simplified tests * Improved coverage * Consolidated tests * Consolidated tests * Simplified tests * Simplified tests * Generalized hashing functions to support complex types * Removed complex method * Updated docstrings * Moved functions to records * Removed from_dataset integration * Updated docs and tagline * Updated docs * Updated docs and citation info * Updated docs * Updated variable names * Updated variable names * Added image benchmarks * Updated docs * Updated docs * Added informative errors when passing non-text data without a custom encoder * Added informative errors when passing non-text data without a custom encoder * Bumped version * Updated docs
1 parent 8b61df4 commit 743e74d

18 files changed

+977
-223
lines changed

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
cff-version: 1.2.0
22
message: "If you use SemHash in your research, please cite it as below."
3-
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
3+
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
44
authors:
55
- family-names: "van Dongen"
66
given-names: "Thomas"
@@ -14,7 +14,7 @@ date-released: "2025-01-05"
1414

1515
preferred-citation:
1616
type: software
17-
title: "SemHash: Fast Semantic Text Deduplication & Filtering"
17+
title: "SemHash: Fast Multimodal Semantic Deduplication & Filtering"
1818
authors:
1919
- family-names: "van Dongen"
2020
given-names: "Thomas"

Makefile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,18 @@ install: venv
99
uv run pre-commit install
1010

1111
install-no-pre-commit:
12-
uv pip install ".[dev]"
12+
uv pip install ".[dev,all]"
1313

1414
fix:
1515
uv run pre-commit run --all-files
1616

1717
test:
1818
uv run pytest --cov=semhash --cov-report=term-missing
19+
20+
benchmark-text:
21+
uv run python -m benchmarks.run_text_benchmarks
22+
23+
benchmark-image:
24+
uv run python -m benchmarks.run_image_benchmarks
25+
26+
benchmark: benchmark-text benchmark-image

README.md

Lines changed: 158 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
<h2 align="center">
44
<img width="30%" alt="SemHash logo" src="assets/images/semhash_logo_v2.png"><br/>
5-
Fast Semantic Text Deduplication & Filtering
5+
Fast Multimodal Semantic Deduplication & Filtering
66
</h2>
77

88

@@ -38,9 +38,9 @@
3838
</div>
3939

4040

41-
SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation from [Model2Vec](https://github.com/MinishLab/model2vec) with efficient ANN-based similarity search through [Vicinity](https://github.com/MinishLab/vicinity).
41+
SemHash is a lightweight, multimodal library for semantic deduplication, outlier filtering, and representative sample selection. Text works out of the box with fast [Model2Vec](https://github.com/MinishLab/model2vec) embeddings, and images, audio, and other modalities are supported with custom encoders.
4242

43-
SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
43+
SemHash supports both single-dataset operations (clean a training set) and cross-dataset operations (deduplicate test against train). It works with simple lists and complex multi-column datasets, and includes inspection tools to help you understand and refine results. All operations use [Vicinity](https://github.com/MinishLab/vicinity) for efficient similarity search.
4444

4545
## Quickstart
4646

@@ -49,6 +49,8 @@ Install the package with:
4949
pip install semhash
5050
```
5151

52+
### Text Deduplication, Filtering & Representative Sampling
53+
5254
Deduplicate a single dataset, filter outliers, and find representative samples with the following code (note: the examples assume you have `datasets` installed, which you can install with `pip install datasets`):
5355

5456
```python
@@ -71,7 +73,35 @@ filtered_texts = semhash.self_filter_outliers().selected
7173
representative_texts = semhash.self_find_representative().selected
7274
```
7375

74-
Or, deduplicate across two datasets, filter outliers, and find representative samples with the following code (e.g., eliminating train/test leakage):
76+
### Image Deduplication, Filtering & Representative Sampling
77+
78+
Deduplicate an image dataset, filter outliers, and find representative samples using a vision model (requires `pip install sentence-transformers`):
79+
80+
```python
81+
from datasets import load_dataset
82+
from sentence_transformers import SentenceTransformer
83+
from semhash import SemHash
84+
85+
# Load an image dataset and vision model
86+
model = SentenceTransformer('clip-ViT-B-32')
87+
dataset = load_dataset("uoft-cs/cifar10", split="test")
88+
89+
# Initialize a SemHash instance with the 'img' column
90+
semhash = SemHash.from_records(list(dataset), columns=["img"], model=model)
91+
92+
# Deduplicate the images
93+
deduplicated_images = semhash.self_deduplicate().selected
94+
95+
# Filter outliers
96+
filtered_images = semhash.self_filter_outliers().selected
97+
98+
# Find representative images
99+
representative_images = semhash.self_find_representative().selected
100+
```
101+
102+
### Cross-Dataset Deduplication, Filtering & Representative Sampling
103+
104+
Deduplicate across two datasets, filter outliers, and find representative samples (e.g., eliminating train/test leakage):
75105

76106
```python
77107
from datasets import load_dataset
@@ -93,13 +123,12 @@ filtered_test_texts = semhash.filter_outliers(records=test_texts, outlier_percen
93123

94124
# Find representative texts in the test data against the training data,
95125
# optionally with a specific selection size
96-
representative_test_texts = semhash.find_representative(
97-
records=test_texts, selection_size=10).selected
98-
99-
126+
representative_test_texts = semhash.find_representative(records=test_texts, selection_size=10).selected
100127
```
101128

102-
Or, deduplicate multi-column dataset, filter outliers, and find representative samples with the following code (e.g., deduplicating a QA dataset):
129+
### Multi-Column Deduplication
130+
131+
Deduplicate multi-column datasets (e.g., deduplicating a QA dataset):
103132

104133
```python
105134
from datasets import load_dataset
@@ -116,15 +145,9 @@ semhash = SemHash.from_records(records=records, columns=["question", "context"])
116145

117146
# Deduplicate the records
118147
deduplicated_records = semhash.self_deduplicate().selected
119-
120-
# Filter outliers from the records
121-
filtered_texts = semhash.self_filter_outliers().selected
122-
123-
# Find representative texts in the records
124-
representative_texts = semhash.self_find_representative().selected
125148
```
126149

127-
The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.
150+
The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate objects (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.
128151

129152
The `filter_outliers`, `self_filter_outliers`, `find_representative`, and `self_find_representative` functions return a [FilterResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#179). This object stores the found outliers/representative samples.
130153

@@ -212,14 +235,11 @@ The following code snippet shows how to deduplicate across two datasets, filter
212235
from datasets import load_dataset
213236
from semhash import SemHash
214237

215-
# Initialize a SemHash instance
216-
semhash = SemHash()
217-
218238
# Load two datasets to deduplicate
219239
train_texts = load_dataset("ag_news", split="train")["text"]
220240
test_texts = load_dataset("ag_news", split="test")["text"]
221241

222-
# Initialize a SemHash instance
242+
# Initialize a SemHash instance with the training data
223243
semhash = SemHash.from_records(records=train_texts)
224244

225245
# Deduplicate the test data against the training data
@@ -265,6 +285,70 @@ representative_records = semhash.self_find_representative().selected
265285

266286
</details>
267287

288+
<details>
289+
<summary> Deduplicate, filter outliers, and find representative samples on image datasets </summary>
290+
<br>
291+
292+
You can bring your own encoder for any modality by implementing the Encoder protocol. Here's an example using a vision model from timm for image deduplication:
293+
294+
```python
295+
from datasets import load_dataset
296+
import timm
297+
import torch
298+
from semhash import SemHash
299+
300+
# Requires: pip install timm torch datasets
301+
302+
# Create a custom image encoder
303+
class VisionEncoder:
304+
"""Custom encoder using timm models. Implements the Encoder protocol."""
305+
306+
def __init__(self, model_name: str = "mobilenetv3_small_100.lamb_in1k"):
307+
self.model = timm.create_model(model_name, pretrained=True, num_classes=0).eval()
308+
data_config = timm.data.resolve_model_data_config(self.model)
309+
self.transform = timm.data.create_transform(**data_config, is_training=False)
310+
311+
def encode(self, inputs, batch_size: int = 128):
312+
"""Encode a batch of PIL images into embeddings."""
313+
import numpy as np
314+
315+
# Convert grayscale to RGB if needed
316+
rgb_inputs = [img.convert("RGB") if img.mode != "RGB" else img for img in inputs]
317+
318+
# Process in batches to avoid memory issues
319+
all_embeddings = []
320+
with torch.no_grad():
321+
for i in range(0, len(rgb_inputs), batch_size):
322+
batch_inputs = rgb_inputs[i : i + batch_size]
323+
batch = torch.stack([self.transform(img) for img in batch_inputs])
324+
embeddings = self.model(batch).numpy()
325+
all_embeddings.append(embeddings)
326+
327+
return np.vstack(all_embeddings)
328+
329+
# Load image dataset
330+
dataset = load_dataset("uoft-cs/cifar10", split="test")
331+
train_data = [{"img": img, "id": i} for i, img in enumerate(dataset["img"][:100])]
332+
test_data = [{"img": img, "id": i} for i, img in enumerate(dataset["img"][100:150])]
333+
334+
# Initialize SemHash with the custom vision encoder
335+
semhash = SemHash.from_records(train_data, columns=["img"], model=VisionEncoder())
336+
337+
# Single-dataset operations
338+
deduplicated = semhash.self_deduplicate().selected
339+
outliers = semhash.self_filter_outliers().selected
340+
representatives = semhash.self_find_representative().selected
341+
342+
# Cross-dataset operations
343+
test_deduplicated = semhash.deduplicate(test_data).selected
344+
test_outliers = semhash.filter_outliers(test_data).selected
345+
test_representatives = semhash.find_representative(test_data, selection_size=10).selected
346+
```
347+
348+
The Encoder protocol requires only an `encode(inputs, **kwargs)` method that returns a numpy array. This makes it easy to integrate any embedding model for any modality.
349+
350+
</details>
351+
268352
<details>
269353
<summary> Using custom encoders </summary>
270354
<br>
@@ -400,14 +484,65 @@ representative_texts = semhash.self_find_representative().selected
400484
```
401485
</details>
402486

487+
<details>
488+
<summary> Initializing from a HuggingFace Dataset </summary>
489+
<br>
490+
You can easily use SemHash with HuggingFace Datasets by converting them to a list:
491+
492+
```python
493+
from datasets import load_dataset
494+
from semhash import SemHash
495+
496+
# Load a HuggingFace dataset
497+
dataset = load_dataset("ag_news", split="train")
498+
499+
# Convert to list and initialize SemHash
500+
semhash = SemHash.from_records(records=list(dataset), columns=["text"])
501+
502+
# Deduplicate, filter outliers, and find representative samples
503+
deduplicated_texts = semhash.self_deduplicate().selected
504+
filtered_texts = semhash.self_filter_outliers().selected
505+
representative_texts = semhash.self_find_representative().selected
506+
```
507+
508+
This also works with multi-column datasets:
509+
510+
```python
511+
from datasets import load_dataset
512+
from semhash import SemHash
513+
514+
# Load a multi-column dataset
515+
dataset = load_dataset("squad_v2", split="train")
516+
517+
# Convert to list and initialize with multiple columns
518+
semhash = SemHash.from_records(records=list(dataset), columns=["question", "context"])
519+
520+
# Deduplicate the records
521+
deduplicated_records = semhash.self_deduplicate().selected
522+
```
523+
</details>
524+
403525

404526

405527

406528
## Benchmarks
407529

408-
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both single-dataset deduplication and train/test deduplication across a variety of datasets. For example, deduplicating 1.8M records takes only ~83 seconds on CPU.
530+
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both text and image deduplication across a variety of datasets. For example, deduplicating text 1.8M records takes only ~83 seconds on CPU.
531+
532+
For detailed benchmark results and analysis, see the [benchmarks directory](benchmarks/README.md).
409533

410-
For detailed benchmark results including performance metrics across 17 datasets, as well as code to reproduce the benchmarks, see the [benchmarks directory](benchmarks/README.md).
534+
### Running Benchmarks
535+
536+
```bash
537+
# Run text benchmarks
538+
make benchmark-text
539+
540+
# Run image benchmarks
541+
make benchmark-image
542+
543+
# Run all benchmarks
544+
make benchmark
545+
```
411546

412547
## License
413548

@@ -419,7 +554,7 @@ If you use SemHash in your research, please cite the following:
419554
```bibtex
420555
@software{minishlab2025semhash,
421556
author = {{van Dongen}, Thomas and Stephan Tulkens},
422-
title = {SemHash: Fast Semantic Text Deduplication \& Filtering},
557+
title = {SemHash: Fast Multimodal Semantic Deduplication \& Filtering},
423558
year = {2025},
424559
publisher = {Zenodo},
425560
doi = {10.5281/zenodo.17265942},

0 commit comments

Comments
 (0)