dstk.modules package#

Subpackages#

Submodules#

dstk.modules.count_models module#

This module offers functionality to transform and reduce high-dimensional text data represented as matrices, enabling more effective downstream analysis and modeling.

Key features include:

  • Scaling input matrices to zero mean and unit variance using standardization.

  • Generating low-dimensional word embeddings from co-occurrence matrices using dimensionality reduction techniques:

  • Truncated Singular Value Decomposition (SVD)

  • Principal Component Analysis (PCA)

These techniques help distill semantic information from sparse and high-dimensional co-occurrence data, facilitating tasks such as clustering, visualization, and feature extraction in natural language processing pipelines.

All functions return results as Pandas DataFrames for seamless integration with data workflows.

dstk.modules.count_models.pca_embeddings(matrix: DataFrame, n_components: int | float = 100, **kwargs) DataFrame[source]#

Generates word embeddings using Principal Component Analysis (PCA).

Parameters:
  • matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.

  • n_components (int or float) – If an integer, the number of dimensions to reduce the word embeddings to. If a float between 0 and 1, specifies the proportion of variance to preserve. Defaults to 100.

  • kwargs

    Additional keyword arguments to pass to sklearn’s PCA. Common options include:

    • n_components: If an integer, specifies the number of dimensions to reduce the Co-ocurrence matrix to. If a float, the amount of variance to preserve during PCA.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Returns:

A DataFrame of word embeddings generated by PCA.

Return type:

DataFrame

dstk.modules.count_models.scale_matrix(matrix: DataFrame, **kwargs) DataFrame[source]#

Scales the input matrix to have zero mean and unit variance for each feature.

This method applies standardization using scikit-learn’s StandardScaler, which transforms the data such that each colum (feature) has a mean of 0 and a standard deviation of 1.

Parameters:
Returns:

A scaled matrix.

Return type:

DataFrame

dstk.modules.count_models.svd_embeddings(matrix: DataFrame, n_components: int = 100, **kwargs) DataFrame[source]#

Generates word embeddings using truncated Single Value Descomposition (SVD).

Parameters:
  • matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.

  • n_components (int) – The number of dimensions to reduce the word embeddings to. Defaults to 100.

  • kwargs

    Additional keyword arguments to pass to sklearn’s TruncatedSVD. Common options include:

    • n_components: Specifies the number of dimensions to reduce the Co-ocurrence matrix to.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Returns:

A DataFrame of word embeddings generated by SVD.

Return type:

DataFrame

dstk.modules.geometric_distance module#

This module provides functions to compute geometric distance and similarity measures between word embeddings, enabling semantic comparison of words in vector space.

Available metrics include:

  • Euclidean distance

  • Manhattan distance

  • Cosine similarity

Additionally, it offers a method to find the nearest semantic neighbors of a given word based on specified distance or similarity metrics using scikit-learn’s NearestNeighbors.

All functions operate on word embeddings represented as Pandas DataFrames indexed by words, facilitating easy integration with common NLP and Computational Linguistic workflows.

dstk.modules.geometric_distance.cos_similarity(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the cosine similarity between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The cosine similarity between the first and second word.

Return type:

float

dstk.modules.geometric_distance.euclidean_distance(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the Euclidean distance between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The Euclidean distance between the first and second word.

Return type:

float

dstk.modules.geometric_distance.manhattan_distance(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the Manhattan distance between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The Manhattan distance between the first and second word.

Return type:

float

dstk.modules.geometric_distance.nearest_neighbors(embeddings: DataFrame, word: str, metric: str = 'cosine', n_words: int = 5, **kwargs) list[Neighbor][source]#

Returns the top N most semantically similar words to a given target word, based on the specified distance or similarity metric.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • word (str) – The target word to find neighbors for.

  • metric (str) – The distance or similarity metric to use (e.g., ‘cosine’, ‘euclidean’). Defaults to ‘cosine’.

  • n_words – Number of nearest neighbors to return. Defaults to 5.

  • kwargs – Additional keyword arguments to pass to sklearn’s NearestNeighbors. For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

Returns:

A list of Neighbor namedtuples, one for each word close to the target word.

Return type:

Neighbors

dstk.modules.ngrams module#

This module provides utilities for extracting context-based collocates, bigrams, and n-grams from a list of words or tokens. It is designed to support both raw string tokens and spaCy Token objects, allowing for flexibility in preprocessing pipelines.

The functions in this module focus on identifying co-occurrence patterns around a specific target word, as well as extracting fixed-length n-grams from sequences of tokens. This is useful for tasks such as collocation analysis, feature engineering for machine learning models, and exploratory corpus analysis.

Core functionalities include:

  • Extracting left and right context windows around a target word

  • Creating directed and undirected bigrams centered on a target

  • Generating fixed-length n-grams from a sequence of words

  • Counting the frequency of collocated words in context windows

The module is compatible with both plain string tokens and spaCy Tokens.

dstk.modules.ngrams.count_collocates(collocates: list[tuple[Word, ...]]) DataFrame[source]#

Counts the frequency of words in a list of collocations and returns the result as a DataFrame.

Parameters:

collocates (CollocatesList) – A list of collocations, where each collocation is a tuple of words.

Returns:

A DataFrame with two columns: “word” and “count”, sorted by frequency.

Return type:

DataFrame

dstk.modules.ngrams.extract_collocates(words: list[Word], target_word: str, window_size: tuple[int, int]) list[tuple[Word, ...]][source]#

Extracts the context words of the target word, returned as tuples whose lenght corresponds to the specified window_size.

Parameters:
  • words (Words) – A list of spaCy tokens or words represented as strings.

  • target_word (str) – The word to find within the list.

  • window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.

Returns:

A list of collocates (left and right context words) of the target word.

Return type:

CollocatesList

dstk.modules.ngrams.extract_directed_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int]) list[tuple[Word, tuple[str, str]]][source]#

Extracts directed bigrams (left and right context words) around a target word.

For each occurrence of target_word in the input words, this function collects two types of bigrams: * Left bigrams: (context_word, (“L”, target_word)) * Right bigrams: (context_word, (“R”, target_word))

Parameters:
  • words (Words) – A list of spaCy tokens or words represented as strings.

  • target_word (str) – The word to search for in the list.

  • window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.

Returns:

A list of directed bigrams in the form (word, (“L” | “R”, target_word)).

Return type:

DirectedCollocateList

dstk.modules.ngrams.extract_ngrams(words: list[Word], window_size: int, **kwargs) list[tuple[Word, ...]][source]#

Splits the tokens into groups of window_size consecutive words and joins each group into a string.

Parameters:
  • words (Words) – A list of spaCy tokens or words represented as strings.

  • window_size (int) – size of the square context window.

  • kwargs

    Additional keyword arguments to pass to nltk.util.ngrams Common options include:

    • pad_left (bool): whether the ngrams should be left-padded

    • pad_right (bool): whether the ngrams should be right-padded

    • left_pad_symbol (any): the symbol to use for left padding (default * * is None)

    • right_pad_symbol (any): the symbol to use for right padding (default is None)

For more information check: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams

Returns:

A list of tuples, where each tuple contains window_size consecutive words from the input.

Return type:

CollocatesList

dstk.modules.ngrams.extract_undirected_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int]) list[Bigram][source]#

Extracts undirected bigrams surrounding a target word.

For each occurrence of target_word, this function collects all context words within the specified window (both left and right), and forms a Bigram with:

  • collocate: the context word

  • target_word: the target word

Parameters:
  • words (Words) – A list of spaCy tokens or words represented as strings.

  • target_word (str) – The word to search for in the list.

  • window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.

Returns:

A list of Bigram namedtuples, one for each context word around each target occurrence.

Return type:

BigramList

dstk.modules.predict_models module#

This module provides utilities to train, save, and load word embedding models using neural networks models such as Word2Vec (gensim) and FastText (fasttext library).

Functions include:

  • word2vec: Train Word2Vec embeddings from a corpus file.

  • fastText: Train FastText embeddings from a corpus file.

  • load_model: Load a saved model from disk (supports Word2Vec .model and FastText .bin formats).

  • save_model: Save a trained model to disk in the appropriate format.

Each function supports passing additional keyword arguments to fine-tune training and loading.

dstk.modules.predict_models.fastText(path: str, **kwargs) _FastText[source]#

Creates word embeddings using the FastText algorithm.

Parameters:
  • path (str) – The path to a file containing a list of sentences or collocations from which to build word embeddings.

  • kwargs

    Additional keyword arguments to pass to fasttext.train_unsupervised. Common options include:

    • dim: Size of the word embedding vectors.

    • model: Training algorithm: skipgram or cbow (Continuous Bag of Words)

    • thread: Number of CPU cores to be used during the training process.

For more information check: https://fasttext.cc/docs/en/options.html

Returns:

An instance of fasttext’s FastText.

Return type:

FastText

dstk.modules.predict_models.load_model(path: str) Word2Vec | _FastText[source]#

Loads the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.

Parameters:

path (str) – Path to the saved model file.

Returns:

An instance of gensim’s Word2Vec or fasttext’s FastText.

Return type:

NeuralModels

dstk.modules.predict_models.save_model(model: Word2Vec | _FastText, path: str) str[source]#

Saves the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.

Parameters:
  • model (NeuralModels) – A trained Word2Vec or FastText model.

  • path (str) – The path (without extension) where to save the model.

Returns:

An instance of gensim’s Word2Vec or fasttext’s FastText.

Return type:

NeuralModels

dstk.modules.predict_models.word2vec(path: str, **kwargs) Word2Vec[source]#

Creates word embeddings using the Word2Vec algorithm.

Parameters:
  • path (str) – The path to a file conatining a list of sentences or collocations from which to build word embeddings.

  • kwargs

    Additional keyword arguments to pass to gensim.models.Word2Vec. Common options include:

    • vector_size: Size of the word embedding vectors.

    • workers: Number of CPU cores to be used during the training process.

    • sg: Training algorithm. 1 for skip-gram; 0 for CBOW (Continuous Bag of Words).

    • window (int): Maximum distance between the current and predicted word.

    • min_count (int): Ignores all words with total frequency lower than this.

For more information check: https://radimrehurek.com/gensim/models/word2vec.html

Returns:

An instance of gensim’s Word2Vec.

Return type:

Word2Vec

dstk.modules.text_matrix_builder module#

This module provides functions to construct common matrix representations used in text analysis and natural language processing.

Key features include:

  • Creating a Document-Term Matrix (DTM) from a corpus of text, leveraging sklearn’s CountVectorizer with customizable parameters such as stop word removal and n-gram range.

  • Generating a Co-occurrence Matrix from a given Document-Term Matrix, capturing how frequently terms co-occur across documents.

These matrices are foundational for many NLP and Computational Linguistics tasks, including topic modeling, word embedding training, and network analysis. The output is provided as Pandas DataFrames for ease of analysis and integration with data science workflows.

dstk.modules.text_matrix_builder.create_co_occurrence_matrix(dtm: DataFrame) DataFrame[source]#

Creates a Co-occurrence matrix from a Document Term Matrix (DTM).

Parameters:

dtm (DataFrame) – A Document Term Matrix (DTM) from which to build a Co-occurrence matrix.

Returns:

A Co-occurrence matrix.

Return type:

DataFrame

dstk.modules.text_matrix_builder.create_dtm(corpus: list[str], **kwargs) DataFrame[source]#

Creates Document Term Matrix (DTM).

Parameters:
  • corpus (list[str]) – A list of sentences or collocations from which to build a matrix.

  • kwargs

    Additional keyword arguments to pass to sklearn’s CountVectorizer. Common options include:

    • stop_words: If provided, a list of stopwords to remove from the corpus.

    • ngram_range: A tuple (min_n, max_n) specifying the range of n-grams to consider.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Returns:

A Document Term Matrix (DTM).

Return type:

DataFrame.

dstk.modules.text_processor module#

This module provides utility functions for processing tokenized or lemmatized text represented as lists of strings or POS-tagged tuples. It supports common text normalization and transformation tasks, such as lowercasing, vocabulary extraction, and joining tokens into a single string. Additionally, it includes functionality for saving processed text or tagged data to a file in plain text or CSV format.

Core functionalities include:

  • Converting spaCy tokens to strings (with optional lemmatization)

  • Lowercasing and vocabulary extraction

  • Joining word lists into full text strings

  • Saving word lists or (token, POS) pairs to disk in a consistent format

This module is useful for preparing text data for further analysis, modeling, or storage.

dstk.modules.text_processor.get_vocabulary(words: list[str]) list[str][source]#

Returns the vocabulary a text.

Parameters:

words (Words[str]) – A list words represented as strings.

Returns:

A list of words represented as strings.

Return type:

Words[str]

dstk.modules.text_processor.join(words: list[str]) str[source]#

Joins a list of strings into a single string text.

Parameters:

words (Words[str]) – A list words represented as strings.

Returns:

A single string formed by concatenating the input words separated by spaces.

Return type:

Words[str]

dstk.modules.text_processor.save_to_file(words: list[str] | list[POSTaggedWord], path: str) str[source]#

Saves a list of strings or (Token, POS) tuples in the specified path. If tokens is a list of strings, it saves each string in a new line. If it is a list of tuples, it saves each tuple in a new line as a pair or values separated by a comma, in a CSV format.

Parameters:
  • words (Words[str] or POSTaggedWordList.) – A list words represented as strings or a list of POSTaggedWord tuples.

  • path (str) – The path where to save the list of words.

Returns:

The path where the file was saved.

Return type:

str

dstk.modules.text_processor.to_lower(words: list[str]) list[str][source]#

Returns a list of lower cased words.

Parameters:

words (Words[str]) – A list words represented as strings.

Returns:

A list of words represented as strings.

Return type:

Words[str]

dstk.modules.text_processor.tokens_to_text(tokens: list[Token], lemmatize: bool = False) list[str][source]#

Converts a list of spaCy Token objects to a list of words represented as strings.

Parameters:
  • tokens (Words[Token]) – A list of spaCy tokens.

  • lemmatize (bool) – Whether to return the lemmatized form of each token. Defaults to False.

Returns:

A list words represented as strings.

Return type:

Words[str]

dstk.modules.tokenizer module#

This module provides utility functions for tokenizing texts using spaCy. It offers tools to process raw text into structured linguistic data, extract tokens and sentences, filter words by specific criteria (e.g., stop words, alphanumeric characters, part-of-speech), and generate POS-tagged outputs.

Core functionalities include:

  • Segemtating a text by applying a spaCy language model to raw text

  • Extracting tokens and sentences from processed documents

  • Removing stop words and non-alphanumeric tokens

  • Filtering tokens by part-of-speech (POS) tags

  • Generating (token, POS) tuples for downstream NLP tasks

The module is intended to provide tools for text segmentation and tagging.

dstk.modules.tokenizer.alphanumeric_raw_tokenizer(tokens: list[Token]) list[Token][source]#

Tokenizes a text including only alphanumeric characters and stop words.

Parameters:

tokens (Words[Token]) – A list of spaCy tokens.

Returns:

A list of spaCy tokens.

Return type:

Words[Token]

dstk.modules.tokenizer.apply_model(text: str, model: str | Language) Doc[source]#

Takes a text and analyzes it using a language model. It returns a processed version of the text that includes helpful information like the words, their meanings, and how they relate to each other.

Parameters:
  • text (str) – The text to be processed.

  • model (str or Language) – The name of the model to be used or its instance.

Returns:

A spaCy Doc object with linguistic annotations.

Return type:

Doc

dstk.modules.tokenizer.filter_by_pos(tokens: list[Token], pos: str) list[Token][source]#

Returns a list of spaCy tokens filtered by a spacific part-of-speech tag.

Parameters:
  • tokens (Words[Token]) – A list of spaCy tokens.

  • pos (str) – The POS tag to filter by (e.g., ‘NOUN’, ‘VERB’, etc.). Case-sensitive.

Returns:

A list of spaCy tokens.

Return type:

Words[Token]

dstk.modules.tokenizer.get_sentences(document: Doc) list[list[Word]][source]#

Returns a list of sentences from a spaCy Doc, where each sentence is represented as a list of spaCy Token objects.

Parameters:

document (Doc) – A spaCy Doc object.

Returns:

A list of sentences, each sentence is a list of spaCy Tokens.

Return type:

WordSentences

dstk.modules.tokenizer.get_tokens(document: Doc) list[Token][source]#

Returns a list of spaCy tokens from a Doc object.

Parameters:

docuument – A spaCy Doc object.

Returns:

A list of spaCy tokens.

Return type:

Words[Token]

dstk.modules.tokenizer.pos_tagger(tokens: list[Token]) list[POSTaggedWord][source]#

Returns a list of (Token, POS) tuples, pairing each token with its part-of-speech tag.

Parameters:

tokens (Words[Token]) – A list of spaCy tokens.

Returns:

A list of POSTaggedWord tuples.

Return type:

POSTaggedWordList

dstk.modules.tokenizer.remove_stop_words(tokens: list[Token], custom_stop_words: list[str] | None = None) list[Token][source]#

Filters tokens, returning only alphanumeric tokens that are not stop words.

Parameters:
  • tokens (Words[Token]) – A list of spaCy tokens.

  • custom_stop_words (list[str] or None) – If provided, a list of custom stop words. Defaults to None.

Returns:

A list of spaCy tokens.

Return type:

Words[Token]

dstk.modules.weight_matrix module#

This module provides functions to apply weighting schemes to co-occurrence matrices commonly used in natural language processing and text mining.

Available weighting methods include:

  • Pointwise Mutual Information (PMI) and Positive PMI (PPMI), which measure the association strength between co-occurring terms by comparing observed co-occurrence frequencies to expected frequencies under independence.

  • Term Frequency-Inverse Document Frequency (Tf-idf), which reweights term importance based on frequency patterns, leveraging sklearn’s TfidfTransformer.

These weighting techniques help enhance the semantic relevance of co-occurrence matrices, improving downstream tasks such as word embedding, clustering, and semantic similarity analysis.

All functions return weighted co-occurrence matrices as Pandas DataFrames for convenient further analysis.

dstk.modules.weight_matrix.pmi(co_matrix: DataFrame, positive: bool = False) DataFrame[source]#

Weights a Co-occurrence matrix by PMI or PPMI.

Parameters:
  • co_matrix (DataFrame) – A Co-occurrence matrix to be weighted.

  • positive (bool) – If True, weights the Co-ocurrence matrix by PPMI. If False, weighths it by PMI. Defaults to False.

Returns:

A Co-occurrence matrix weighted by PMI or PPMI.

Return type:

DataFrame

dstk.modules.weight_matrix.tf_idf(co_matrix: DataFrame, **kwargs) DataFrame[source]#

Weights a Co-occurrence matrix by Tf-idf.

Parameters:
Returns:

A Co-occurrence matrix weighted by Tf-idf.

Return type:

DataFrame

Module contents#