First Steps
Installation
See Installation.
Introduction
Code blocks shown in this tutorial assume that a DataFrame, labelled df, will be available at runtime. No efforts are made to specify the nature of the data in df, the emphasis is on how to set up near deduplication correctly with Liken. There are datasets available for experimentation in the liken.datasets module for easy access to dummy data.
Instantiating
A DataFrame must be passed to the top-level dedupe function.
The Simplest Example
For the simplest use cases, Liken aims to provide familiar-feeling exact deduplication, without too much ceremony:
However, dataframe records may not be exactly repeated:
| id | address | |
|---|---|---|
| 1 | london | fizzpop@yahoo.com |
| 2 | tokyo | FizzPop@yahoo.com |
| 3 | paris | a@msn.fr |
"fizzpop" and "FizzPop" aren't exactly the same, but likely are.
This dummy dataset contains 3 unique emails. Using drop_duplicates straight from pandas won't do anything here, as "fizzpop@yahoo.com" and "FizzPop@yahoo.com" are not the same strings, nor will the above "The Simplest Example"
Near Deduplication
When things aren't exactly the same, you can still deduplicate data. Liken is built so that you can focus on defining what you want out of a near-deduplication process. The goal will be to be able to define neat and clear-cut ways to deduplicate data with the least amount of code possible. Before looking at how to use dedupers, let's look at what dedupers are available.
Built-in Dedupers
Liken comes with many deduplication methods built-in:
| Deduper | Description | ||
|---|---|---|---|
| Similarity | single-column | exact |
You've already seen this in use implicitely in The Simplest Example |
| Similarity | single-column | fuzzy |
Fuzzy string matching |
| Similarity | single-column | tfidf |
String token matching with Tf-Idf |
| Similarity | single-column | lsh |
String token matching with Locality Sensitive Hashing (LSH) |
| Similarity | compound-column | jaccard |
Multi column similarity based on intersection of categorical data |
| Similarity | compound-column | cosine |
Multi column similarity based on dot product of numerical data |
| Predicate | single-column | isna |
Records where the column value is null/None |
| Predicate | single-column | isin |
Records where the column value is in a list of members |
| Predicate | single-column | str_startswith |
Records where the string starts with a pattern |
| Predicate | single-column | str_endswith |
Records where the string ends with a pattern |
| Predicate | single-column | str_contains |
Records where the string contains a pattern. Accepts Regex. |
| Predicate | single-column | str_len |
Records where the string length is bounded by a minimum and maximum length |
Single-column dedupers apply to single columns and are implementation of near string matching. Compound-column dedupers are set operations where the values of the set are the values of the columns in a given record. Similarity dedupers have a threshold argument. Predicate dedupers choose an outcome based on a discrete outcome (e.g. is null / not null).
To use dedupers, you have to apply them, which is covered in the next tutorial.