Quickstart

This page gives a short introduction to the most common use cases.

Datetime Utilities

smith_utils.format_ordinal(n: int) str[source]

Converts an integer to its ordinal representation as a string.

The function takes an integer input and returns the ordinal form of the number as a string with its appropriate suffix. For example, 1 becomes “1st”, 2 becomes “2nd”, and so on. The suffix rules account for the exceptional cases such as numbers ending in 11, 12, or 13.

Parameters:

n (int) – The integer to be converted to its ordinal equivalent.

Returns:

The ordinal representation of the input number.

Return type:

str

smith_utils.parse_strict_date(date_str: str) date[source]

Parses a given date string into a datetime.date object.

This function attempts to parse the input date string in various formats while adhering to strict expectations for valid date representations. It supports both basic (YYYYMMDD) and extended (YYYY-MM-DD) formats and rejects ambiguous or malformed inputs. The function raises an error if the provided string does not conform to the acceptable formats.

Note

The returned datetime.date object is naive and does not retain any time zone information. If the input string contains a time component or time zone identifier (e.g., after a ‘T’ or space), it is simply discarded without any time zone conversion. This means a UTC timestamp might represent a different calendar date in a local time zone.

Parameters:

date_str (str) – The date string to be parsed. It is expected to be in either basic (YYYYMMDD) or extended (YYYY-MM-DD) format.

Returns:

A date object representing the parsed date.

Return type:

datetime.date

Raises:

ValueError – If the input string contains an ambiguous or invalid date.

smith_utils.ensure_date(date_input: str | date | None) date[source]

Converts various date input formats into a datetime.date object.

This function accepts a string, a datetime.date object, or a None value to produce a datetime.date object. If the input is None, the current date is returned. String inputs are first stripped of leading and trailing whitespace, and the function attempts to parse the string as an ISO 8601 formatted date. If the ISO 8601 parsing fails, the input is passed to a fallback parsing function for stricter date parsing.

Raises:
  • ValueError – If the input string cannot be parsed into a valid date

  • via both ISO 8601 and the fallback parsing mechanism.

Parameters:
  • date_input (str | datetime.date | None) – The input to be converted

  • a (to a datetime.date object. This can be a string representing)

  • date

  • itself (a datetime.date object)

  • None. (or)

Returns:

A valid datetime.date object corresponding to the provided input, or the current date if the input is None.

Return type:

datetime.date

from smith_utils import ensure_date

# Convert strings or datetime.date objects into a date object.
date = ensure_date("20231225")  # datetime.date(2023, 12, 25)

Numeric Refinement

smith_utils.parse_numeric_value(val: str | float | None, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | str[source]
smith_utils.parse_currency_value(val: str | float | None, sep: str = ',', decimal: str = '.', relaxed: bool = False) float | str[source]
from smith_utils import parse_numeric_value

# Parse messy numeric data.
value = parse_numeric_value("(1,250.50)")  # -1250.5

Text Normalization & Metrics

smith_utils.normalize_text(text, ignore_case=True, remove_all_whitespace=False, nfkc=True)[source]

Normalizes the input text by applying transformations such as Unicode normalization, case folding, and whitespace handling.

Parameters:
  • text (str or None) – The input text to normalize_text. If None, an empty string is returned.

  • ignore_case (bool, optional) – Whether to convert the text to lowercase. Defaults to True.

  • remove_all_whitespace (bool, optional) – Whether to remove all internal whitespace and trim outer whitespace. Defaults to False.

  • nfkc (bool, optional) – Whether to apply Unicode normalization using NFKC form. Defaults to True.

Returns:

The normalized text.

Return type:

str

smith_utils.make_unicode_char_name_records(text: str) list[UnicodeCharNameRecord][source]
smith_utils.normalize_file_to_lf(input_path: Path, output_path: Path, *, chunk_size: int = 1048576) dict[source]

Normalize line endings to LF in a streaming fashion.

Returns:

  • newline_type: str (LF, CRLF, CR, MIXED, NONE)

  • bytes_in: int

  • bytes_out: int

Return type:

dict with

smith_utils.normalize_newlines_stream(src: BinaryIO, dst: BinaryIO, chunk_size: int = 1048576) Literal['LF', 'CRLF', 'CR', 'MIXED', 'NONE'][source]

Normalize newlines in a stream, replacing CR and CRLF with LF, and writing the result to the destination stream.

This function processes a binary stream to unify newline characters into a single format (LF). It reads the source stream in chunks, processes each chunk to replace CR and CRLF with LF, and writes the normalized output to the destination stream. Additionally, it returns the newline type present in the original data: “LF”, “CRLF”, “CR”, or “MIXED”. If no newlines are present, it returns “NONE”.

Parameters:
  • src (BinaryIO) – The source stream to read binary data from.

  • dst (BinaryIO) – The destination stream to write the normalized binary data to.

  • chunk_size (int) – The size of chunks to read from the source stream. The default is 1024 * 1024 bytes (1 MB).

Returns:

A string representing the type of newline characters found in

the source stream, which can be “LF”, “CRLF”, “CR”, “MIXED”, or “NONE”.

Return type:

NewlineType

class smith_utils.StringDistance[source]

Bases: object

Provides functionality for calculating string distances and relationships between two strings based on various algorithms.

This class includes methods for analyzing string similarities and relationships, including exact matches, case-insensitive comparisons, and whitespace normalization. It also implements Damerau-Levenshtein and Jaro-Winkler distance calculations.

Return classification:

Possible relationship classification between two strings.

Return damerau_levenshtein_distance:

Integer distance calculated using the Damerau-Levenshtein algorithm.

Return jaro_winkler_score:

A float score indicating similarity using the Jaro-Winkler algorithm.

Return similarity_percentage:

A percentage similarity score between two strings.

static analyze(a: str, b: str, ignore_case: bool = False) Result[source]
static calculate_damerau_levenshtein(s1: str, s2: str) int[source]

Restricted Damerau-Levenshtein distance: insertion, deletion, substitution, adjacent transposition.

static calculate_jaro_winkler(s1: str, s2: str) float[source]
static classify(a: str, b: str) Relation[source]
static equals_ignore_case(a: str, b: str) bool[source]
static strip_all(s: str) str[source]

Remove all whitespace characters from a string.

Uses split/join logic so any whitespace character acts as a separator, including spaces, tabs, and newlines.

static trim(s: str) str[source]
smith_utils.analyze_pair(a: str, b: str, ignore_case: bool = False) Result[source]
class smith_utils.Result(classification: Relation, damerau_levenshtein_distance: int, jaro_winkler_score: float, similarity_percentage: float)[source]

Bases: object

Represents the result of a comparison, encapsulating classification and similarity metrics.

This class encapsulates the result of a comparison operation between two entities, including their relationship classification and various similarity metrics. It provides a method to retrieve a string representation of the relationship classification.

classification: Relation
damerau_levenshtein_distance: int
get_relation_string() str[source]
jaro_winkler_score: float
similarity_percentage: float
class smith_utils.Relation(*values)[source]

Bases: Enum

CASE_INSENSITIVE_MATCH = 2
EXACT_MATCH = 1
NORMALIZED_SPACE_MATCH = 4
NO_STRUCTURAL_MATCH = 5
WHITESPACE_TRIMMED_MATCH = 3
from smith_utils import normalize_text

# Normalize text.
clean_text = normalize_text("  Smith  Utils  ")  # "smith utils"
from smith_utils import make_unicode_char_name_records

# Extract codepoint and Unicode name metadata.
records = make_unicode_char_name_records("Aあ")
from smith_utils import normalize_file_to_lf

# Normalize newlines to LF and report original newline style.
summary = normalize_file_to_lf("input.txt", "output.txt")

Crypto Hash Utilities

smith_utils.get_text_digest(text, encoding='utf-8')[source]

Return the SHA-256 hexadecimal digest for text encoded as encoding.

smith_utils.get_file_digest(file_path)[source]

Return the SHA-256 hexadecimal digest for the file at file_path.

from smith_utils import get_file_digest, get_text_digest

# Calculate SHA-256 digests for strings and files.
text_digest = get_text_digest("smith-utils")
file_digest = get_file_digest("input.txt")

File Classification Utilities

class smith_utils.FileClassification(path: Path, extension: str, file_description: str | None, file_mime_type: str | None, extension_mime_type: str | None, magic_type: str | None, magic_mime_type: str | None, file_class: str | None, categories: tuple[str, ...])[source]

Bases: object

Classification evidence for a file path.

categories: tuple[str, ...]
extension: str
extension_mime_type: str | None
file_class: str | None
file_description: str | None
file_mime_type: str | None
magic_mime_type: str | None
magic_type: str | None
path: Path
smith_utils.classify_file(path: str | Path, *, sample_size: int = 4096) FileClassification[source]

Classify a file using extension, magic bytes, MIME, and file(1) signals.

from smith_utils import classify_file

# Classify a file from extension, MIME, magic bytes, and file(1) evidence.
classification = classify_file("input.pdf")
file_class = classification.file_class  # "document"
categories = classification.categories  # ("document", "pdf")