liken.preprocessors
liken.preprocessors.strip()
liken.preprocessors.lower()
liken.preprocessors.alnum()
liken.preprocessors.remove_punctuation()
liken.preprocessors.normalize_unicode(form='NFKD')
liken.preprocessors.ascii_fold()
Converts alphabetic, numeric, and symbolic characters that are not in the Basic Latin Unicode block (first 127 ASCII characters) to their ASCII equivalent, if one exists. For example, the filter changes à to a.
Source code in src/liken/preprocessors.py
liken.preprocessors.remove_stopwords(words=None, language='english')
Remove stopwords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
words
|
list[str] | None
|
A list of words to ignore. If defined, |
None
|
language
|
str
|
The language to use for the stop words dictionary |
'english'
|
Source code in src/liken/preprocessors.py
liken.preprocessors.normalize_names()
Normalize personal names.
Preserves only first name, middle name and last name. Titles and nicknames are stripped. Commas are cleaned.
liken.preprocessors.normalize_company()
Normalize company names.
Strips common company name nomenclature e.g. "Ltd.", or "LLC".