Skip to content

Index

Liken Liken

PyPI Version PyPI - Python Version PyPI Downloads Tests Coverage License


Source Code: https://github.com/VictorAut/liken


Liken:
phrasal verb
/ˈlaɪ.kən/
to say that something is similar to or has the same qualities as something else

Why...

Liken provides enhanced deduplication tooling for DataFrames.

The key features are:

  • Near deduplication tooling
  • Fuzzy string matching deduper
  • TF-IDF tokenization deduper
  • LSH tokenization deduper
  • Jaccard set deduper
  • Cosine set deduper
  • Pandas API extension
  • Composable, rules-based, deduplication pipelines
  • Predicate dedupers for rules
  • Record linkage and canonicalization
  • Built-in Preprocessors
  • Pandas, Polars, Modin, Ray, Dask and PySpark support
  • Customizable in pure Python
  • Synthetic record creation
  • Easy to understand syntax
  • Dummy datasets for practice

Liken aims to answer the call for as-easy-to-use near deduplication as possible, with as natural and easy to understand syntax as possible.

Cut boilerplate code to simple deduplication pipelines with Liken.

Supported DataFrame Libraries

Installation

Install with pip:

pip install liken

Install with uv:

uv pip install liken

Extras

Liken supports pandas and polars in the default installation. Liken also supports multiple other DataFrame libraries, install them optionally:

pip install 'liken[dask]'     # deduplicate dask dataframes
pip install 'liken[modin]'    # deduplicate modin dataframes
pip install 'liken[ray]'      # deduplicate ray datasets
pip install 'liken[pyspark]'  # deduplicate pyspark dataframes
pip install 'liken[all]'      # deduplicate with any of the above
uv pip install 'liken[dask]'    # deduplicate dask dataframes
uv pip install 'liken[modin]'   # deduplicate modin dataframes
uv pip install 'liken[ray]'     # deduplicate ray datasets
uv pip install 'liken[pyspark]' # deduplicate pyspark dataframes
uv pip install 'liken[all]'     # deduplicate with any of the above

Use liken In Your Code

import liken as lk

df = ... # e.g. read data

df = (
    lk.dedupe(df)
    .apply(lk.fuzzy())
    .drop_duplicates("name")
)
Jump to the tutorial to dive deeper into how to build incrementally complex pipelines.

Pandas Affordances

Liken's focus is on composable, complex, deduplication pipelines that scale to distributed datasets. But, extra-easy integration is provided for Pandas DataFrames.

If you are a pandas user looking for intuitive near-deduplication Pandas API extension and little more, head to the Coming from Pandas? section!

License

Liken is licensed under the Apache-2.0 License. See the LICENSE file for more details.