Skip to content

Applying Dedupers

In the First Steps you found out how to replicate exact deduplication with Liken — in fact it was the exact deduper in use. It came bundled with dedupe when you called the drop_duplicates function with no other deduper.

To use a built-in deduper, a deduper is applied with the apply function:

import liken as lk

df = (
    lk.dedupe(df)
    .apply(lk.fuzzy())
    .drop_duplicates("address")
)

Single Dedupers

If you only need a single deduper, use it straight in an apply function as seen above, i.e. apply(lk.fuzzy()), or any other deduper. The column or columns to dedupe on are passed in drop_duplicates.

Usage of single dedupers is limited — you can only every use a single deduper on a single set of columns.

Coming from Pandas?

When it comes it comes to single dedupers, as above, Liken is easy to use, but especially so if you are coming from Pandas. Special affordances have been made to supply you with the means to use Pandas's drop_duplicates in a "fuzzy manner". To do this, simply import liken, and pass the deduper as an accessor to your pandas dataframe. Any keyword arguments that usually get passed to the deduper, now simply get passed to drop_duplicates:

import liken as lk # Only works if you import liken!
import pandas as pd

df = pd.read_csv("...")

df = df.fuzzy.drop_duplicates("address", threshold=0.6) # (1)!
  1. kwargs for fuzzy are delegated to drop_duplicates. This is also true for other dedupers; for example, when using the tfidf deduper, the ngram kwarg would be passed in drop_duplicates too i.e.df.tfidf.drop_duplicates("address", threshold=0.6, ngram=3)`
import liken as lk
import pandas as pd

df = pd.read_csv("...")

df = (
    lk.dedupe(df)
    .apply(lk.fuzzy(threshold=0.6)) # (1)!
    .drop_duplicates("address")
)
  1. kwargs are applied in the function, as defined by the API

Pandas affordances are limited to fuzzy, tfidf, lsh, jaccard, and cosine. Also, this special use is limited to single dedupers, and does not support the application of collections of dedupers, as shown next.

Pandas affordances

Liken's Pandas extension is only useable if you actually import liken!

Collections of Dedupers

Liken supports deduplicating with a collection of dedupers. This allows:

  • Deduplicating multiple sets of columns with different dedupers
  • Defining several dedupers to be run sequentially on a set of columns

Collections are supported in two formats. Dictionaries provide quick and easy composability; pipelines provide fully-featured compsability with support for logical rules and built-in preprocessors.

Dictionaries

When defining a collection as a dictionary, drop_duplicates no longer accepts a column label argument — columns will now be defined as the keys to the dictionary.

import liken as lk

collection = {
    "email": lk.exact(),
    "address": (
        lk.fuzzy(threshold=0.98),
        lk.tfidf(threshold=0.9, ngram=(1, 2), topn=1),
    ),
}

df = (
    lk.dedupe(df)
    .apply(collection)
    .drop_duplicates(keep="first")
)

In the above example, the defined collection reads as "Deduplicate exact emails. Then, similar addresses using Fuzzy and then TF-iDF. Finally, any deduplicate records that have 3 out of 4 of those categories matching".

keep arg

The keep argument accepts the literals "first" or "last" which defines which record will be kept from a duplicate set of records, based on their position in the dataframe.

Pipelines of Dedupers

Liken exposes a pipeline builder function for you to build complex, composable pipelines.

At a minimum, pipelines can replicate a dictionary collection. For example, the dictionary collection we saw above can be instead represented as:

import liken as lk

pipeline = (
    lk.pipeline()
    .step(lk.col("email").exact())
    .step(lk.col("address").fuzzy(threshold=0.98))
    .step(lk.col("address").tfidf(threshold=0.9, ngram=(1, 2), topn=1))
)

df = (
    lk.dedupe(df)
    .apply(pipeline)
    .drop_duplicates()
)

A pipeline has the following features:

  • Each step in a pipeline represents a deduplication step.
  • Column access is provided by lk.col expression.
  • Dedupers are provided as method calls to the lk.col expression.

AND semantics

Pipelines support combining the effects of multiple dedupers using implicit and statements.

AND semantics are supported in Liken when lists of dedupers are passed to a step in a pipeline. All conditions in a step must match for records to be linked by the pipeline:

import liken as lk

pipeline = (
    lk.pipeline()
    .step(
        [
            lk.col("address").fuzzy(),
            lk.col("address").str_len(min=10),
        ] # AND: both conditions must hold
    )
    .step(lk.col("email").fuzzy(threshold=0.98))
)

df = (
    lk.dedupe(df)
    .apply(pipeline)
    .drop_duplicates()
)

In the above case, for the first step, both conditions must hold — similar addresses will only be deduplicated if the length of addresses has a minimum length of 10 characters.

Effective combinations of dedupers

AND semantics are supported between any Liken deduper but are especially effective when combining a similarity deduper with a predicate deduper. Additionally, Liken features an optimisation, "Rule Predication" which enforces the execution of the predicate deduper first given an AND semantic step — the subsequent dedupers will then only operate on the subset of identical records collected by the first predicate deduper. This optimisation works becase by their nature predicate dedupers operate close to O(n) whilst similarity dedupers generally operate at O(n2).

OR semantics

OR semantics behaviour is captured by distinct steps in a pipeline.

OR semantics are actually implicitely supported when using dictionaries and are best understood in comparison with AND semantics:

import liken as lk

pipeline = (
    lk.pipeline()
    .step(lk.col("address").fuzzy())
    .step(lk.col("address").str_len(min=10))
) # OR: either condition must hold
import liken as lk

pipeline = (
    lk.pipeline()
    .step(
        [
            lk.col("address").fuzzy(),
            lk.col("address").str_len(min=10),
        ]
    )
) # AND: both conditions must hold
OR in dictionaries

OR semantics are achieved in with dictionaries. If you are just using OR semantics in pipeline, consider sticking to defining collections of dedupers are dictionaries, which are simpler to use.

NOT semantics

Predicate dedupers can be inverted to form a NOT semantic by using the ~ operator on the column accessor lk.col expression:

import liken as lk

pipeline = (
    lk.pipeline()
    .step(
        [
            lk.col("address").fuzzy(),
            ~lk.col("address").isna(), # NOT null
        ]
    )
)

Preprocessors

Pipelines support the addition of a powerful feature: preprocessors. Liken's preprocessors transform data solely within the internals of the library for the purposes of deduplication whilst still returning data to you in the original format.

Preprocessors can be used to refine deduplication pipelines, reduce boilerplate code preprocessing code, reduce the number of "dummy" columns that you have to maintain, and reduces the risk of unacceptable false positive rates.

Preprocessors are available in the liken.preprocessors module and can be made available to the overall pipeline scope, a step in the pipeline, or on column only.

import liken as lk

pipeline = (
    lk.pipeline(preprocessors=lk.preprocessors.lower())
    .step(
        [
            lk.col("email").fuzzy(),
            ~lk.col("email").isna(),
        ],
    )
    .step(lk.col("address").tfidf())
)
import liken as lk

pipeline = (
    lk.pipeline()
    .step(
        [
            lk.col("email").fuzzy(),
            ~lk.col("email").isna(),
        ],
        preprocessors=lk.preprocessors.lower()
    )
    .step(lk.col("address").tfidf())
)
import liken as lk

pipeline = (
    lk.pipeline()
    .step(
        [
            lk.col("email").fuzzy(),
            ~lk.col("email").isna(),
        ],
    )
    .step(lk.col("address", preprocessors=lk.preprocessors.lower()).tfidf())
)

A single preprocessor can be passed, or multiple, if passed as a list:

import liken as lk

pipeline = (
    lk.pipeline(
        preprocessors=[
            lk.preprocessors.lower(),
            lk.preprocessors.ascii_fold(),
            lk.preprocessors.remove_punctuation(),
        ]
    )
    .step(lk.col("address").tfidf())
)

Preprocessors are propagated in a top-down manner, but overriden buttom-up. So, a pipeline level preprocessor will propagate to each step and column accessor on, but will be respectively overriden if preprocessors are defined there:

pipeline = (
    lk.pipeline(preprocessors=[lk.preprocessors.ascii_fold()])
    .step(
        [
            lk.col("email").fuzzy(),  # preprocessed by step's preprocessor, `alnum`.
            ~lk.col(
                "address",
                preprocessors=[lk.preprocessors.lower()],
            ).isna(),  # uses it's own preprocessor, `lower`.
        ],
        preprocessors=[lk.preprocessors.alnum()],  # defines the step's preprocessor
    )
    .step(
        lk.col("address").tfidf()
    )  # defaults to the pipeline's preprocessor, `ascii_fold`.
)

Summary

Different collections of dedupers, whether a single deduper, a dictionary or a pipeline, are best suited to different use cases:

Collection Pandas extension Quick tasks Multiple columns Logical rule semantics Preprocessors
Single ✅ ✅
Dict ✅ ✅
Pipeline ✅ ✅ ✅