Index
Source Code: https://github.com/VictorAut/liken
phrasal verb
/ˈlaɪ.kən/
to say that something is similar to or has the same qualities as something else
Why...
Liken provides enhanced deduplication tooling for DataFrames.
The key features are:
- Near deduplication tooling
- Fuzzy string matching deduper
- TF-IDF tokenization deduper
- LSH tokenization deduper
- Jaccard set deduper
- Cosine set deduper
- Pandas API extension
- Composable, rules-based, deduplication pipelines
- Predicate dedupers for rules
- Record linkage and canonicalization
- Built-in Preprocessors
- Pandas, Polars, Modin, Ray, Dask and PySpark support
- Customizable in pure Python
- Synthetic record creation
- Easy to understand syntax
- Dummy datasets for practice
Liken aims to answer the call for as-easy-to-use near deduplication as possible, with as natural and easy to understand syntax as possible.
Cut boilerplate code to simple deduplication pipelines with Liken.
Supported DataFrame Libraries
Installation
Install with pip:
Install with uv:
Extras
Liken supports pandas and polars in the default installation. Liken also supports multiple other DataFrame libraries, install them optionally:
uv pip install 'liken[dask]' # deduplicate dask dataframes
uv pip install 'liken[modin]' # deduplicate modin dataframes
uv pip install 'liken[ray]' # deduplicate ray datasets
uv pip install 'liken[pyspark]' # deduplicate pyspark dataframes
uv pip install 'liken[all]' # deduplicate with any of the above
Use liken In Your Code
import liken as lk
df = ... # e.g. read data
df = (
lk.dedupe(df)
.apply(lk.fuzzy())
.drop_duplicates("name")
)
Pandas Affordances
Liken's focus is on composable, complex, deduplication pipelines that scale to distributed datasets. But, extra-easy integration is provided for Pandas DataFrames.
If you are a pandas user looking for intuitive near-deduplication Pandas API extension and little more, head to the Coming from Pandas? section!
License
Liken is licensed under the Apache-2.0 License. See the LICENSE file for more details.