Text Normalization & Metrics¶
- class smith_utils.text.Relation(*values)[source]¶
Bases:
Enum- CASE_INSENSITIVE_MATCH = 2¶
- EXACT_MATCH = 1¶
- NORMALIZED_SPACE_MATCH = 4¶
- NO_STRUCTURAL_MATCH = 5¶
- WHITESPACE_TRIMMED_MATCH = 3¶
- class smith_utils.text.Result(classification: Relation, damerau_levenshtein_distance: int, jaro_winkler_score: float, similarity_percentage: float)[source]¶
Bases:
objectRepresents the result of a comparison, encapsulating classification and similarity metrics.
This class encapsulates the result of a comparison operation between two entities, including their relationship classification and various similarity metrics. It provides a method to retrieve a string representation of the relationship classification.
- damerau_levenshtein_distance: int¶
- jaro_winkler_score: float¶
- similarity_percentage: float¶
- class smith_utils.text.StringDistance[source]¶
Bases:
objectProvides functionality for calculating string distances and relationships between two strings based on various algorithms.
This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.
- Return classification:
Possible relationship classification between two strings.
- Return damerau_levenshtein_distance:
Integer distance calculated using the Damerau-Levenshtein algorithm.
- Return jaro_winkler_score:
A float score indicating similarity using the Jaro-Winkler algorithm.
- Return similarity_percentage:
A percentage similarity score between two strings.
- static calculate_damerau_levenshtein(s1: str, s2: str) int[source]¶
Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.
- class smith_utils.text.UnicodeCharNameRecord(index, codepoint, char, name)[source]¶
Bases:
NamedTuple- char: str¶
Alias for field number 2
- codepoint: str¶
Alias for field number 1
- index: int¶
Alias for field number 0
- name: str¶
Alias for field number 3
- smith_utils.text.make_unicode_char_name_records(text: str) list[UnicodeCharNameRecord][source]¶
- smith_utils.text.normalize_file_to_lf(input_path: Path, output_path: Path, *, chunk_size: int = 1048576) dict[source]¶
Normalize line endings to LF in a streaming fashion.
- Returns:
newline_type: str (LF, CRLF, CR, MIXED, NONE)
bytes_in: int
bytes_out: int
- Return type:
dict with
- smith_utils.text.normalize_newlines_stream(src: BinaryIO, dst: BinaryIO, chunk_size: int = 1048576) Literal['LF', 'CRLF', 'CR', 'MIXED', 'NONE'][source]¶
Normalize newlines in a stream, replacing CR and CRLF with LF, and writing the result to the destination stream.
This function processes a binary stream to unify newline characters into a single format (LF). It reads the source stream in chunks, processes each chunk to replace CR and CRLF with LF, and writes the normalized output to the destination stream. Additionally, it returns the newline type present in the original data: “LF”, “CRLF”, “CR”, or “MIXED”. If no newlines are present, it returns “NONE”.
- Parameters:
src (BinaryIO) – The source stream to read binary data from.
dst (BinaryIO) – The destination stream to write the normalized binary data to.
chunk_size (int) – The size of chunks to read from the source stream. The default is 1024 * 1024 bytes (1 MB).
- Returns:
- A string representing the type of newline characters found in
the source stream, which can be “LF”, “CRLF”, “CR”, “MIXED”, or “NONE”.
- Return type:
NewlineType
- smith_utils.text.normalize_text(text, ignore_case=True, remove_all_whitespace=False, nfkc=True)[source]¶
Normalizes the input text by applying transformations such as Unicode normalization, case folding, and whitespace handling.
- Parameters:
text (str or None) – The input text to normalize_text. If None, an empty string is returned.
ignore_case (bool, optional) – Whether to convert the text to lowercase. Defaults to True.
remove_all_whitespace (bool, optional) – Whether to remove all internal whitespace and trim outer whitespace. Defaults to False.
nfkc (bool, optional) – Whether to apply Unicode normalization using NFKC form. Defaults to True.
- Returns:
The normalized text.
- Return type:
str