Iterative Workloads
Reminder
In the Record Linkage tutorial you found out that Liken creates a canonical_id. By defult this canonical_id is an autoincrementing numeric identifier starting from zero.
In this chapter we explore the configuring needed to canonicalize a dataset iteratively. By iteratively we mean with the same dataset — for example a dataset of customers that is appended to with new customers in a given time interval.
!!! note Here we explore the implications for batch workloads, especially for datasets that tend to append data
Canonical IDs
A new canonical ID everytime we instantiate a Dedupe class isn't going to be practical for our use case. In fact, given our use case, we're likely to already have a canonical ID (literally an Liken canonical_id, or another). So we should use that instead and pass it in as a string identifier to the id argument of the canonicalize function. See the tutorial for a recap.
The Problem
Liken does not currently possess preprocessing capabilities. For iterative, batch workloads, you will have to do carry out preprocessing steps yourself. The suggested steps to take are:
- Add a column,
canonical_id, that is an auto-incrementing numeric identified starting from the length of the dataset you will be appending to (N) and add one, i.e.N+1->n+1wherenis the length of the append dataset. - Append ("stack") your datasets.
- Instantiate
Dedupeand passid="canonical_id"to the canonicalizer.
!!! warning This process is going to be a lot easier with numeric ids. It's possible to use string identifiers but it makes the process of incrementing on append datasets much harder to manage and reason about
Decision Tree
flowchart TD
df{{"`DataFrame already has a **canonical_id**?`"}}
id1{{"`**id** defined in canonicalize()?`"}}
id2{{"`**id** defined in canonicalize()?`"}}
idiscanonical{{"`**id** is the same as **canonical_id**?`"}}
autoincrement("`Create a new autoincrementing **canonical_id**`")
copy("`Copy **id** to create **canonical_id**`")
overwrite("`Copy **id** to overwrite **canonical_id**`")
existing("`Use existing **canonical_id**`")
df-- yes -->id1
df-- no -->id2
id1-- no -->existing
id1-- yes -->idiscanonical
idiscanonical-- yes -->existing
idiscanonical-- no -->overwrite
id2-- no -->autoincrement
id2-- yes -->copy