You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fast Multimodal Semantic Deduplication & Filtering
6
6
</h2>
7
7
8
8
@@ -38,9 +38,9 @@
38
38
</div>
39
39
40
40
41
-
SemHash is a lightweight and flexible tool for deduplicating datasets, filtering outliers, and finding representative samples using semantic similarity. It combines fast embedding generation from[Model2Vec](https://github.com/MinishLab/model2vec)with efficient ANN-based similarity search through [Vicinity](https://github.com/MinishLab/vicinity).
41
+
SemHash is a lightweight, multimodal library for semantic deduplication, outlier filtering, and representative sample selection. Text works out of the box with fast[Model2Vec](https://github.com/MinishLab/model2vec)embeddings, and images, audio, and other modalities are supported with custom encoders.
42
42
43
-
SemHash supports both single-dataset deduplication & filtering (e.g., cleaning up a train set by removing duplicates and outliers) and multi-dataset deduplication & filtering (e.g., ensuring no overlap between a test set and a train set). It works with simple datasets, such as text lists, and more complex ones, like multi-column QA datasets. Additionally, it includes functions to inspect deduplication results, making it easier to understand and refine your data cleaning process.
43
+
SemHash supports both single-dataset operations (clean a training set) and cross-dataset operations (deduplicate test against train). It works with simple lists and complex multi-column datasets, and includes inspection tools to help you understand and refine results. All operations use [Vicinity](https://github.com/MinishLab/vicinity) for efficient similarity search.
44
44
45
45
## Quickstart
46
46
@@ -49,6 +49,8 @@ Install the package with:
49
49
pip install semhash
50
50
```
51
51
52
+
### Text Deduplication, Filtering & Representative Sampling
53
+
52
54
Deduplicate a single dataset, filter outliers, and find representative samples with the following code (note: the examples assume you have `datasets` installed, which you can install with `pip install datasets`):
The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate object (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.
150
+
The `deduplicate` and `self_deduplicate` functions return a [DeduplicationResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#L58). This object stores the deduplicated corpus, a set of duplicate objects (along with the objects that caused duplication), and several useful functions to further inspect the deduplication result.
128
151
129
152
The `filter_outliers`, `self_filter_outliers`, `find_representative`, and `self_find_representative` functions return a [FilterResult](https://github.com/MinishLab/semhash/blob/main/semhash/datamodels.py#179). This object stores the found outliers/representative samples.
130
153
@@ -212,14 +235,11 @@ The following code snippet shows how to deduplicate across two datasets, filter
<summary> Deduplicate, filter outliers, and find representative samples on image datasets </summary>
290
+
<br>
291
+
292
+
You can bring your own encoder for any modality by implementing the Encoder protocol. Here's an example using a vision model from timm for image deduplication:
293
+
294
+
```python
295
+
from datasets import load_dataset
296
+
import timm
297
+
import torch
298
+
from semhash import SemHash
299
+
300
+
# Requires: pip install timm torch datasets
301
+
302
+
# Create a custom image encoder
303
+
classVisionEncoder:
304
+
"""Custom encoder using timm models. Implements the Encoder protocol."""
The Encoder protocol requires only an `encode(inputs, **kwargs)` method that returns a numpy array. This makes it easy to integrate any embedding model for any modality.
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both single-dataset deduplication and train/test deduplication across a variety of datasets. For example, deduplicating 1.8M records takes only ~83 seconds on CPU.
530
+
SemHash is extremely fast and scales to large datasets with millions of records. We've benchmarked both text and image deduplication across a variety of datasets. For example, deduplicating text 1.8M records takes only ~83 seconds on CPU.
531
+
532
+
For detailed benchmark results and analysis, see the [benchmarks directory](benchmarks/README.md).
409
533
410
-
For detailed benchmark results including performance metrics across 17 datasets, as well as code to reproduce the benchmarks, see the [benchmarks directory](benchmarks/README.md).
534
+
### Running Benchmarks
535
+
536
+
```bash
537
+
# Run text benchmarks
538
+
make benchmark-text
539
+
540
+
# Run image benchmarks
541
+
make benchmark-image
542
+
543
+
# Run all benchmarks
544
+
make benchmark
545
+
```
411
546
412
547
## License
413
548
@@ -419,7 +554,7 @@ If you use SemHash in your research, please cite the following:
419
554
```bibtex
420
555
@software{minishlab2025semhash,
421
556
author = {{van Dongen}, Thomas and Stephan Tulkens},
422
-
title = {SemHash: Fast Semantic Text Deduplication \& Filtering},
557
+
title = {SemHash: Fast Multimodal Semantic Deduplication \& Filtering},
0 commit comments