Index

PyPI - Python Version Tests Coverage License

Source Code: https://github.com/VictorAut/liken

Liken:
phrasal verb
/ˈlaɪ.kən/
to say that something is similar to or has the same qualities as something else

Why...

Liken provides enhanced deduplication tooling for DataFrames.

The key features are:

Near deduplication tooling
Exploratory duplicate-rate profiling
Fuzzy string matching deduper
TF-IDF tokenization deduper
LSH tokenization deduper
Jaccard set deduper
Cosine set deduper
Pandas API extension
Composable, rules-based, deduplication pipelines
Predicate dedupers for rules
Record linkage and canonicalization
Built-in Preprocessors
Pandas, Polars, Modin, Ray, Dask and PySpark support
Customizable in pure Python
Synthetic record creation
Easy to understand syntax
Dummy datasets for practice

Liken aims to answer the call for as-easy-to-use near deduplication as possible, with as natural and easy to understand syntax as possible.

Cut boilerplate code to simple deduplication pipelines with Liken.

Supported DataFrame Libraries

Installation

Install with pip:

pip install liken

Install with uv:

uv pip install liken

Extras

Liken supports pandas and polars in the default installation. Liken also supports multiple other DataFrame libraries, install them optionally:

pipuv

pip install 'liken[dask]'     # deduplicate dask dataframes
pip install 'liken[modin]'    # deduplicate modin dataframes
pip install 'liken[ray]'      # deduplicate ray datasets
pip install 'liken[pyspark]'  # deduplicate pyspark dataframes
pip install 'liken[all]'      # deduplicate with any of the above

uv pip install 'liken[dask]'    # deduplicate dask dataframes
uv pip install 'liken[modin]'   # deduplicate modin dataframes
uv pip install 'liken[ray]'     # deduplicate ray datasets
uv pip install 'liken[pyspark]' # deduplicate pyspark dataframes
uv pip install 'liken[all]'     # deduplicate with any of the above

Use `liken` In Your Code

import liken as lk

df = ... # e.g. read data

df = (
    lk.dedupe(df)
    .apply(lk.fuzzy())
    .drop_duplicates("name")
)

Jump to the tutorial to dive deeper into how to build incrementally complex pipelines.

Pandas Affordances

Liken's focus is on composable, complex, deduplication pipelines that scale to distributed datasets. But, extra-easy integration is provided for Pandas DataFrames.

If you are a pandas user looking for intuitive near-deduplication Pandas API extension and little more, head to the Coming from Pandas? section!

Agent Skills

Liken makes available agent skills for use in agentic workflows.

Install the bundle from the tessl registry:

tessl install victoraut/liken-skills

The bundle contains one skill per API tier:

Skill	Teaches
`liken`	Overview, and which API to reach for
`liken-dedupers`	Applying built-in dedupers
`liken-pipelines`	Pipelines with AND/OR/NOT rules and built-in preprocessors
`liken-custom-dedupers`	Writing your own dedupers in pure Python
`liken-record-linkage`	Canonicalization and synthetic records
`liken-backends-performance`	Backend selection, scaling and performance

Using the skills

Once installed, agent-skill-aware tools (Claude Code, Cursor, and others) discover the skills automatically and load the relevant one on demand. Pin a version for reproducibility, e.g. tessl install victoraut/liken-skills@0.1.0. See the tessl documentation for managing installed skills.

License

Liken is licensed under the Apache-2.0 License. See the LICENSE file for more details.