Quickstart¶
This page gives a short introduction to the most common use cases.
Datetime Utilities¶
- smith_utils.format_ordinal(n: int) str[source]
Converts an integer to its ordinal representation as a string.
The function takes an integer input and returns the ordinal form of the number as a string with its appropriate suffix. For example, 1 becomes “1st”, 2 becomes “2nd”, and so on. The suffix rules account for the exceptional cases such as numbers ending in 11, 12, or 13.
- Parameters:
n (int) – The integer to be converted to its ordinal equivalent.
- Returns:
The ordinal representation of the input number.
- Return type:
str
- smith_utils.parse_strict_date(date_str: str) date[source]
Parses a given date string into a datetime.date object.
This function attempts to parse the input date string in various formats while adhering to strict expectations for valid date representations. It supports both basic (YYYYMMDD) and extended (YYYY-MM-DD) formats and rejects ambiguous or malformed inputs. The function raises an error if the provided string does not conform to the acceptable formats.
Note
The returned datetime.date object is naive and does not retain any time zone information. If the input string contains a time component or time zone identifier (e.g., after a ‘T’ or space), it is simply discarded without any time zone conversion. This means a UTC timestamp might represent a different calendar date in a local time zone.
- Parameters:
date_str (str) – The date string to be parsed. It is expected to be in either basic (YYYYMMDD) or extended (YYYY-MM-DD) format.
- Returns:
A date object representing the parsed date.
- Return type:
datetime.date
- Raises:
ValueError – If the input string contains an ambiguous or invalid date.
- smith_utils.ensure_date(date_input: str | date | None) date[source]
Converts various date input formats into a datetime.date object.
This function accepts a string, a datetime.date object, or a None value to produce a datetime.date object. If the input is None, the current date is returned. String inputs are first stripped of leading and trailing whitespace, and the function attempts to parse the string as an ISO 8601 formatted date. If the ISO 8601 parsing fails, the input is passed to a fallback parsing function for stricter date parsing.
- Raises:
ValueError – If the input string cannot be parsed into a valid date
via both ISO 8601 and the fallback parsing mechanism. –
- Parameters:
date_input (str | datetime.date | None) – The input to be converted
a (to a datetime.date object. This can be a string representing)
date
itself (a datetime.date object)
None. (or)
- Returns:
A valid datetime.date object corresponding to the provided input, or the current date if the input is None.
- Return type:
datetime.date
from smith_utils import ensure_date
# Convert strings or datetime.date objects into a date object.
date = ensure_date("20231225") # datetime.date(2023, 12, 25)
Numeric Refinement¶
- smith_utils.parse_numeric_value(val: str | float | None, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | str[source]
- smith_utils.parse_currency_value(val: str | float | None, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | str[source]
from smith_utils import parse_numeric_value
# Parse messy numeric data.
value = parse_numeric_value("(1,250.50)") # -1250.5
Text Normalization & Metrics¶
- smith_utils.normalize_text(text, ignore_case=True, remove_all_whitespace=False, nfkc=True)[source]
Normalizes the input text by applying transformations such as Unicode normalization, case folding, and whitespace handling.
- Parameters:
text (str or None) – The input text to normalize_text. If None, an empty string is returned.
ignore_case (bool, optional) – Whether to convert the text to lowercase. Defaults to True.
remove_all_whitespace (bool, optional) – Whether to remove all internal whitespace and trim outer whitespace. Defaults to False.
nfkc (bool, optional) – Whether to apply Unicode normalization using NFKC form. Defaults to True.
- Returns:
The normalized text.
- Return type:
str
- smith_utils.make_unicode_char_name_records(text: str) list[UnicodeCharNameRecord][source]
- smith_utils.normalize_file_to_lf(input_path: Path, output_path: Path, *, chunk_size: int = 1048576) dict[source]
Normalize line endings to LF in a streaming fashion.
- Returns:
newline_type: str (LF, CRLF, CR, MIXED, NONE)
bytes_in: int
bytes_out: int
- Return type:
dict with
- smith_utils.normalize_newlines_stream(src: BinaryIO, dst: BinaryIO, chunk_size: int = 1048576) Literal['LF', 'CRLF', 'CR', 'MIXED', 'NONE'][source]
Normalize newlines in a stream, replacing CR and CRLF with LF, and writing the result to the destination stream.
This function processes a binary stream to unify newline characters into a single format (LF). It reads the source stream in chunks, processes each chunk to replace CR and CRLF with LF, and writes the normalized output to the destination stream. Additionally, it returns the newline type present in the original data: “LF”, “CRLF”, “CR”, or “MIXED”. If no newlines are present, it returns “NONE”.
- Parameters:
src (BinaryIO) – The source stream to read binary data from.
dst (BinaryIO) – The destination stream to write the normalized binary data to.
chunk_size (int) – The size of chunks to read from the source stream. The default is 1024 * 1024 bytes (1 MB).
- Returns:
- A string representing the type of newline characters found in
the source stream, which can be “LF”, “CRLF”, “CR”, “MIXED”, or “NONE”.
- Return type:
NewlineType
- class smith_utils.StringDistance[source]
Bases:
objectProvides functionality for calculating string distances and relationships between two strings based on various algorithms.
This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.
- Return classification:
Possible relationship classification between two strings.
- Return damerau_levenshtein_distance:
Integer distance calculated using the Damerau-Levenshtein algorithm.
- Return jaro_winkler_score:
A float score indicating similarity using the Jaro-Winkler algorithm.
- Return similarity_percentage:
A percentage similarity score between two strings.
- static calculate_damerau_levenshtein(s1: str, s2: str) int[source]
Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.
- static calculate_jaro_winkler(s1: str, s2: str) float[source]
- static equals_ignore_case(a: str, b: str) bool[source]
- static strip_all(s: str) str[source]
Remove all whitespace characters from a string.
Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.
- static trim(s: str) str[source]
- class smith_utils.Result(classification: Relation, damerau_levenshtein_distance: int, jaro_winkler_score: float, similarity_percentage: float)[source]
Bases:
objectRepresents the result of a comparison, encapsulating classification and similarity metrics.
This class encapsulates the result of a comparison operation between two entities, including their relationship classification and various similarity metrics. It provides a method to retrieve a string representation of the relationship classification.
- classification: Relation
- damerau_levenshtein_distance: int
- get_relation_string() str[source]
- jaro_winkler_score: float
- similarity_percentage: float
- class smith_utils.Relation(*values)[source]
Bases:
Enum- CASE_INSENSITIVE_MATCH = 2
- EXACT_MATCH = 1
- NORMALIZED_SPACE_MATCH = 4
- NO_STRUCTURAL_MATCH = 5
- WHITESPACE_TRIMMED_MATCH = 3
from smith_utils import normalize_text
# Normalize text.
clean_text = normalize_text(" Smith Utils ") # "smith utils"
from smith_utils import make_unicode_char_name_records
# Extract codepoint and Unicode name metadata.
records = make_unicode_char_name_records("Aあ")
from smith_utils import normalize_file_to_lf
# Normalize newlines to LF and report original newline style.
summary = normalize_file_to_lf("input.txt", "output.txt")
Crypto Hash Utilities¶
- smith_utils.get_text_digest(text, encoding='utf-8')[source]
Return the SHA-256 hexadecimal digest for
textencoded asencoding.
- smith_utils.get_file_digest(file_path)[source]
Return the SHA-256 hexadecimal digest for the file at
file_path.
from smith_utils import get_file_digest, get_text_digest
# Calculate SHA-256 digests for strings and files.
text_digest = get_text_digest("smith-utils")
file_digest = get_file_digest("input.txt")
File Classification Utilities¶
- class smith_utils.FileClassification(path: Path, extension: str, file_description: str | None, file_mime_type: str | None, extension_mime_type: str | None, magic_type: str | None, magic_mime_type: str | None, file_class: str | None, categories: tuple[str, ...])[source]
Bases:
objectClassification evidence for a file path.
- categories: tuple[str, ...]
- extension: str
- extension_mime_type: str | None
- file_class: str | None
- file_description: str | None
- file_mime_type: str | None
- magic_mime_type: str | None
- magic_type: str | None
- path: Path
- smith_utils.classify_file(path: str | Path, *, sample_size: int = 4096) FileClassification[source]
Classify a file using extension, magic bytes, MIME, and
file(1)signals.
from smith_utils import classify_file
# Classify a file from extension, MIME, magic bytes, and file(1) evidence.
classification = classify_file("input.pdf")
file_class = classification.file_class # "document"
categories = classification.categories # ("document", "pdf")