liken.custom
Define custom dedupers
liken.custom.register(f)
Register a custom function as a deduper.
Custom functions can be registered for use as dedupers recognised by the
Dedupe class. Use register as a decorator around the custom callable.
The custom callable must accept a generic array-like object representing the contents of one or more DataFrame columns. The concrete column backing this array is resolved only when the deduper is applied.
The expected function signature is:
function(array, **kwargs)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
f
|
liken.custom.PairGenerator
|
A custom callable that returns integer pairs of indices identifying similar pairs in an array. Accepted callables are functions or generators, where generators are preferred. |
required |
Returns:
| Type | Description |
|---|---|
typing.Callable
|
Callable |
Raises:
| Type | Description |
|---|---|
TypeError
|
If any positional arguments are used when calling the registered deduper. |
Example
Registering a custom deduper
import liken as lk
@lk.custom.register
def custom_deduper(array, **kwargs):
# your code here
yield ...
df = (
lk.dedupe(df)
.apply(custom_deduper(**kwargs))
.drop_duplicates("address")
)
E.g. the following Custom exact string-length deduplication deduper:
@lk.custom.register
def eq_str_len(array):
n = len(array)
for i in range(n):
for j in range(i + 1, n):
if len(array[i]) == len(array[j]):
yield i, j
Applying the deduper:
df = (
lk.dedupe(df)
.apply(eq_str_len()) # array arg implicitely passed
.drop_duplicates("address")
)
Before:
+------+-----------+
| id | address |
+------+-----------+
| 1 | london |
| 2 | paris |
| 3 | tokyo |
+------+-----------+
"tokyo" and "paris" have the same length, so reduced:
+------+-----------+
| id | address |
+------+-----------+
| 1 | london |
| 2 | paris |
+------+-----------+
Keyword-only enforcement:
Deduper(df).apply(my_func(is_upper_caps=True)) # OK
Deduper(df).apply(my_func(True)) # Raises TypeError
Source code in src/liken/custom.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |