Record Linkage
Up to now you've learnt how to use dedupers with apply, specifically within the context of dropping duplicates.
But, what if you want to retain your duplicate instances? And instead simply label them as such?
Liken supports Record Linkage, where the deduplication process you are doing is not to drop data from your DataFrame, but rather to link it together. So, a deduper that defines a fuzzy string deduplication of an address column will label the duplicates as duplicates rather than dropping them. Records are instead canonicalized.
Retaining records as known duplicates instead of dropping duplicates is known as Record Linkage in Liken. This is also known as Entity Resolution, in other literature. The link is provided by a canonical record, which in Liken is identified by the auto-generated canonical_id column.
Canonicalization
Let's look at a dummy dataset, df:
| uid | address | |
|---|---|---|
| a001 | london | fizzpop@yahoo.com |
| a002 | tokyo | fizzpop@yahoo.co.uk |
| a003 | paris | a@msn.fr |
Two very clearly similar emails exist.
We're going to aim to link the above email addresses. To do so, swap .drop_duplicates with .canonicalize, and collect the results:
import liken as lk
df = (
lk.dedupe(df)
.apply(lk.fuzzy(threshold=0.85))
.canonicalize(
"email",
keep="first",
)
.collect() # (1)!
)
canonicalizedoes not return the dataframe, unlikedrop_duplicates. It needs to be collected first!
Now, df looks the same, with an extra canonical_id column:
| uid | address | canonical_id | |
|---|---|---|---|
| a001 | london | fizzpop@yahoo.com | 0 |
| a002 | tokyo | fizzpop@yahoo.co.uk | 0 |
| a003 | paris | a@msn.fr | 2 |
The two email addresses are linked to the canonical record "0".
.canonicalize creates a new canonical_id field. Any repeated canonical_id is a duplicate. In this instance that was an auto-incrementing numeric field. As such, the repeated canonical_id represents the index position in the DataFrame of the canonical record.
You can control this behaviour by passing an explicit label to the id argument of .canonicalize. In that case, the canonical_id will become a copy of the defined id, or simply a reference to itself if it already exists. For example:
import liken as lk
df = (
lk.dedupe(df)
.apply(fuzzy(threshold=0.85))
.canonicalize(
"email",
keep="first",
id="uid", # `id` arg included
)
.collect()
)
Now, checkout the variation in the output of df:
| uid | address | canonical_id | |
|---|---|---|---|
| a001 | london | fizzpop@yahoo.com | a001 |
| a002 | tokyo | fizzpop@yahoo.co.uk | a001 |
| a003 | paris | a@msn.fr | a003 |
Canonical records are no longer identified by index position in the DataFrame, but instead based on a pre-existing (unique) identifier.
Synthetic Records
A canonical record can be linked to several child records. Use .synthesize to create a new canonical record that coalesces the values of the fields of the various child records: