Text Normalization & Metrics

class smith_utils.text.Relation(*values)[source]

Bases: Enum

CASE_INSENSITIVE_MATCH = 2
EXACT_MATCH = 1
NORMALIZED_SPACE_MATCH = 4
NO_STRUCTURAL_MATCH = 5
WHITESPACE_TRIMMED_MATCH = 3
class smith_utils.text.Result(classification: Relation, damerau_levenshtein_distance: int, jaro_winkler_score: float, similarity_percentage: float)[source]

Bases: object

Represents the result of a comparison, encapsulating classification and similarity metrics.

This class encapsulates the result of a comparison operation between two entities, including their relationship classification and various similarity metrics. It provides a method to retrieve a string representation of the relationship classification.

classification: Relation
damerau_levenshtein_distance: int
get_relation_string() str[source]
jaro_winkler_score: float
similarity_percentage: float
class smith_utils.text.StringDistance[source]

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:

Possible relationship classification between two strings.

Return damerau_levenshtein_distance:

Integer distance calculated using the Damerau-Levenshtein algorithm.

Return jaro_winkler_score:

A float score indicating similarity using the Jaro-Winkler algorithm.

Return similarity_percentage:

A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) Result[source]
static calculate_damerau_levenshtein(s1: str, s2: str) int[source]

Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) float[source]
static classify(a: str, b: str) Relation[source]
static equals_ignore_case(a: str, b: str) bool[source]
static strip_all(s: str) str[source]

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) str[source]
class smith_utils.text.UnicodeCharNameRecord(index, codepoint, char, name)[source]

Bases: NamedTuple

char: str

Alias for field number 2

codepoint: str

Alias for field number 1

index: int

Alias for field number 0

name: str

Alias for field number 3

smith_utils.text.analyze_pair(a: str, b: str, ignore_case: bool = False) Result[source]
smith_utils.text.make_unicode_char_name_records(text: str) list[UnicodeCharNameRecord][source]
smith_utils.text.normalize_file_to_lf(input_path: Path, output_path: Path, *, chunk_size: int = 1048576) dict[source]

Normalize line endings to LF in a streaming fashion.

Returns:

  • newline_type: str (LF, CRLF, CR, MIXED, NONE)

  • bytes_in: int

  • bytes_out: int

Return type:

dict with

smith_utils.text.normalize_newlines_stream(src: BinaryIO, dst: BinaryIO, chunk_size: int = 1048576) Literal['LF', 'CRLF', 'CR', 'MIXED', 'NONE'][source]

Normalize newlines in a stream, replacing CR and CRLF with LF, and writing the result to the destination stream.

This function processes a binary stream to unify newline characters into a single format (LF). It reads the source stream in chunks, processes each chunk to replace CR and CRLF with LF, and writes the normalized output to the destination stream. Additionally, it returns the newline type present in the original data: “LF”, “CRLF”, “CR”, or “MIXED”. If no newlines are present, it returns “NONE”.

Parameters:
  • src (BinaryIO) – The source stream to read binary data from.

  • dst (BinaryIO) – The destination stream to write the normalized binary data to.

  • chunk_size (int) – The size of chunks to read from the source stream. The default is 1024 * 1024 bytes (1 MB).

Returns:

A string representing the type of newline characters found in

the source stream, which can be “LF”, “CRLF”, “CR”, “MIXED”, or “NONE”.

Return type:

NewlineType

smith_utils.text.normalize_text(text, ignore_case=True, remove_all_whitespace=False, nfkc=True)[source]

Normalizes the input text by applying transformations such as Unicode normalization, case folding, and whitespace handling.

Parameters:
  • text (str or None) – The input text to normalize_text. If None, an empty string is returned.

  • ignore_case (bool, optional) – Whether to convert the text to lowercase. Defaults to True.

  • remove_all_whitespace (bool, optional) – Whether to remove all internal whitespace and trim outer whitespace. Defaults to False.

  • nfkc (bool, optional) – Whether to apply Unicode normalization using NFKC form. Defaults to True.

Returns:

The normalized text.

Return type:

str