First Steps

Installation

Introduction

Code blocks shown in this tutorial assume that a DataFrame, labelled df, will be available at runtime. No efforts are made to specify the nature of the data in df, the emphasis is on how to set up near deduplication correctly with Liken. There are datasets available for experimentation in the liken.datasets module for easy access to dummy data.

Instantiating

A DataFrame must be passed to the top-level dedupe function.

PandasPolarsModinDaskRayPySpark

import liken as lk
import pandas as pd

df = pd.read_csv("...")

df = (
    lk.dedupe(df)
    # ...
)

import liken as lk
import polars as pl

df = pl.read_csv(...)

df = (
    lk.dedupe(df)
    # ...
)

import liken as lk
import modin.pandas as pd

df = pd.read_csv("...")

df = (
    lk.dedupe(df)
    # ...
)

import liken as lk
import dask.dataframe as dd

df = dd.read_csv("...")

df = (
    lk.dedupe(df)
    # ...
)

import liken as lk
import ray

df = ray.data.read_csv("...")

df = (
    lk.dedupe(df)
    # ...
)

import liken as lk
from pyspark.sql import SparkSession

spark = SparkSession(**kwargs)

df = spark.read.parquet("...")

df = (
    lk.dedupe(df, spark_session=spark)
    # ...
)

The Simplest Example

For the simplest use cases, Liken aims to provide familiar-feeling exact deduplication, without too much ceremony:

Single ColumnMultiple Columns

import liken as lk

df = dedupe(df).drop_duplicates("address")

import liken as lk

df = dedupe(df).drop_duplicates(columns=["address", "email"])

However, dataframe records may not be exactly repeated:

id	address	email
1	london	fizzpop@yahoo.com
2	tokyo	FizzPop@yahoo.com
3	paris	a@msn.fr

"fizzpop" and "FizzPop" aren't exactly the same, but likely are.

This dummy dataset contains 3 unique emails. Using drop_duplicates straight from pandas won't do anything here, as "fizzpop@yahoo.com" and "FizzPop@yahoo.com" are not the same strings, nor will the above "The Simplest Example"

Near Deduplication

When things aren't exactly the same, you can still deduplicate data. Liken is built so that you can focus on defining what you want out of a near-deduplication process. The goal will be to be able to define neat and clear-cut ways to deduplicate data with the least amount of code possible. Before looking at how to use dedupers, let's look at what dedupers are available.

Built-in Dedupers

Liken comes with many deduplication methods built-in:

		Deduper	Description
Similarity	single-column	`exact`	You've already seen this in use implicitely in The Simplest Example
Similarity	single-column	`fuzzy`	Fuzzy string matching
Similarity	single-column	`tfidf`	String token matching with Tf-Idf
Similarity	single-column	`lsh`	String token matching with Locality Sensitive Hashing (LSH)
Similarity	compound-column	`jaccard`	Multi column similarity based on intersection of categorical data
Similarity	compound-column	`cosine`	Multi column similarity based on dot product of numerical data
Predicate	single-column	`isna`	Records where the column value is null/`None`
Predicate	single-column	`isin`	Records where the column value is in a list of members
Predicate	single-column	`str_startswith`	Records where the string starts with a pattern
Predicate	single-column	`str_endswith`	Records where the string ends with a pattern
Predicate	single-column	`str_contains`	Records where the string contains a pattern. Accepts Regex.
Predicate	single-column	`str_len`	Records where the string length is bounded by a minimum and maximum length

Single-column dedupers apply to single columns and are implementation of near string matching. Compound-column dedupers are set operations where the values of the set are the values of the columns in a given record. Similarity dedupers have a threshold argument. Predicate dedupers choose an outcome based on a discrete outcome (e.g. is null / not null).

To use dedupers, you have to apply them, which is covered in the next tutorial.