dstk.modules package#
Subpackages#
Submodules#
dstk.modules.count_models module#
This module offers functionality to transform and reduce high-dimensional text data represented as matrices, enabling more effective downstream analysis and modeling.
Key features include:
Scaling input matrices to zero mean and unit variance using standardization.
Generating low-dimensional word embeddings from co-occurrence matrices using dimensionality reduction techniques:
Truncated Singular Value Decomposition (SVD)
Principal Component Analysis (PCA)
These techniques help distill semantic information from sparse and high-dimensional co-occurrence data, facilitating tasks such as clustering, visualization, and feature extraction in natural language processing pipelines.
All functions return results as Pandas DataFrames for seamless integration with data workflows.
- dstk.modules.count_models.pca_embeddings(matrix: DataFrame, n_components: int | float = 100, **kwargs) DataFrame [source]#
Generates word embeddings using Principal Component Analysis (PCA).
- Parameters:
matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_components (int or float) – If an integer, the number of dimensions to reduce the word embeddings to. If a float between 0 and 1, specifies the proportion of variance to preserve. Defaults to 100.
kwargs –
Additional keyword arguments to pass to sklearn’s PCA. Common options include:
n_components: If an integer, specifies the number of dimensions to reduce the Co-ocurrence matrix to. If a float, the amount of variance to preserve during PCA.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- Returns:
A DataFrame of word embeddings generated by PCA.
- Return type:
DataFrame
- dstk.modules.count_models.scale_matrix(matrix: DataFrame, **kwargs) DataFrame [source]#
Scales the input matrix to have zero mean and unit variance for each feature.
This method applies standardization using scikit-learn’s StandardScaler, which transforms the data such that each colum (feature) has a mean of 0 and a standard deviation of 1.
- Parameters:
matrix (DataFrame) – The input data to scale.
kwargs – Additional keyword arguments to pass to sklearn’s StandardScaler. For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- Returns:
A scaled matrix.
- Return type:
DataFrame
- dstk.modules.count_models.svd_embeddings(matrix: DataFrame, n_components: int = 100, **kwargs) DataFrame [source]#
Generates word embeddings using truncated Single Value Descomposition (SVD).
- Parameters:
matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_components (int) – The number of dimensions to reduce the word embeddings to. Defaults to 100.
kwargs –
Additional keyword arguments to pass to sklearn’s TruncatedSVD. Common options include:
n_components: Specifies the number of dimensions to reduce the Co-ocurrence matrix to.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
- Returns:
A DataFrame of word embeddings generated by SVD.
- Return type:
DataFrame
dstk.modules.geometric_distance module#
This module provides functions to compute geometric distance and similarity measures between word embeddings, enabling semantic comparison of words in vector space.
Available metrics include:
Euclidean distance
Manhattan distance
Cosine similarity
Additionally, it offers a method to find the nearest semantic neighbors of a given word based on specified distance or similarity metrics using scikit-learn’s NearestNeighbors.
All functions operate on word embeddings represented as Pandas DataFrames indexed by words, facilitating easy integration with common NLP and Computational Linguistic workflows.
- dstk.modules.geometric_distance.cos_similarity(embeddings: DataFrame, first_word: str, second_word: str) float [source]#
Computes the cosine similarity between the embeddings of two words.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
first_word (str) – The first word in the pair.
second_word (str) – The second word in the pair.
- Returns:
The cosine similarity between the first and second word.
- Return type:
float
- dstk.modules.geometric_distance.euclidean_distance(embeddings: DataFrame, first_word: str, second_word: str) float [source]#
Computes the Euclidean distance between the embeddings of two words.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
first_word (str) – The first word in the pair.
second_word (str) – The second word in the pair.
- Returns:
The Euclidean distance between the first and second word.
- Return type:
float
- dstk.modules.geometric_distance.manhattan_distance(embeddings: DataFrame, first_word: str, second_word: str) float [source]#
Computes the Manhattan distance between the embeddings of two words.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
first_word (str) – The first word in the pair.
second_word (str) – The second word in the pair.
- Returns:
The Manhattan distance between the first and second word.
- Return type:
float
- dstk.modules.geometric_distance.nearest_neighbors(embeddings: DataFrame, word: str, metric: str = 'cosine', n_words: int = 5, **kwargs) list[Neighbor] [source]#
Returns the top N most semantically similar words to a given target word, based on the specified distance or similarity metric.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
word (str) – The target word to find neighbors for.
metric (str) – The distance or similarity metric to use (e.g., ‘cosine’, ‘euclidean’). Defaults to ‘cosine’.
n_words – Number of nearest neighbors to return. Defaults to 5.
kwargs – Additional keyword arguments to pass to sklearn’s NearestNeighbors. For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html
- Returns:
A list of Neighbor namedtuples, one for each word close to the target word.
- Return type:
Neighbors
dstk.modules.ngrams module#
This module provides utilities for extracting context-based collocates, bigrams, and n-grams from a list of words or tokens. It is designed to support both raw string tokens and spaCy Token objects, allowing for flexibility in preprocessing pipelines.
The functions in this module focus on identifying co-occurrence patterns around a specific target word, as well as extracting fixed-length n-grams from sequences of tokens. This is useful for tasks such as collocation analysis, feature engineering for machine learning models, and exploratory corpus analysis.
Core functionalities include:
Extracting left and right context windows around a target word
Creating directed and undirected bigrams centered on a target
Generating fixed-length n-grams from a sequence of words
Counting the frequency of collocated words in context windows
The module is compatible with both plain string tokens and spaCy Tokens.
- dstk.modules.ngrams.count_collocates(collocates: list[tuple[Word, ...]]) DataFrame [source]#
Counts the frequency of words in a list of collocations and returns the result as a DataFrame.
- Parameters:
collocates (CollocatesList) – A list of collocations, where each collocation is a tuple of words.
- Returns:
A DataFrame with two columns: “word” and “count”, sorted by frequency.
- Return type:
DataFrame
- dstk.modules.ngrams.extract_collocates(words: list[Word], target_word: str, window_size: tuple[int, int]) list[tuple[Word, ...]] [source]#
Extracts the context words of the target word, returned as tuples whose lenght corresponds to the specified window_size.
- Parameters:
words (Words) – A list of spaCy tokens or words represented as strings.
target_word (str) – The word to find within the list.
window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.
- Returns:
A list of collocates (left and right context words) of the target word.
- Return type:
CollocatesList
- dstk.modules.ngrams.extract_directed_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int]) list[tuple[Word, tuple[str, str]]] [source]#
Extracts directed bigrams (left and right context words) around a target word.
For each occurrence of target_word in the input words, this function collects two types of bigrams: * Left bigrams: (context_word, (“L”, target_word)) * Right bigrams: (context_word, (“R”, target_word))
- Parameters:
words (Words) – A list of spaCy tokens or words represented as strings.
target_word (str) – The word to search for in the list.
window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.
- Returns:
A list of directed bigrams in the form (word, (“L” | “R”, target_word)).
- Return type:
DirectedCollocateList
- dstk.modules.ngrams.extract_ngrams(words: list[Word], window_size: int, **kwargs) list[tuple[Word, ...]] [source]#
Splits the tokens into groups of window_size consecutive words and joins each group into a string.
- Parameters:
words (Words) – A list of spaCy tokens or words represented as strings.
window_size (int) – size of the square context window.
kwargs –
Additional keyword arguments to pass to nltk.util.ngrams Common options include:
pad_left (bool): whether the ngrams should be left-padded
pad_right (bool): whether the ngrams should be right-padded
left_pad_symbol (any): the symbol to use for left padding (default * * is None)
right_pad_symbol (any): the symbol to use for right padding (default is None)
For more information check: https://www.nltk.org/api/nltk.util.html#nltk.util.ngrams
- Returns:
A list of tuples, where each tuple contains window_size consecutive words from the input.
- Return type:
CollocatesList
- dstk.modules.ngrams.extract_undirected_bigrams(words: list[Word], target_word: str, window_size: tuple[int, int]) list[Bigram] [source]#
Extracts undirected bigrams surrounding a target word.
For each occurrence of target_word, this function collects all context words within the specified window (both left and right), and forms a Bigram with:
collocate: the context word
target_word: the target word
- Parameters:
words (Words) – A list of spaCy tokens or words represented as strings.
target_word (str) – The word to search for in the list.
window_size (tuple[int, int]) – A tuple indicating how many words to capture to the left and right of the target.
- Returns:
A list of Bigram namedtuples, one for each context word around each target occurrence.
- Return type:
BigramList
dstk.modules.predict_models module#
This module provides utilities to train, save, and load word embedding models using neural networks models such as Word2Vec (gensim) and FastText (fasttext library).
Functions include:
word2vec: Train Word2Vec embeddings from a corpus file.
fastText: Train FastText embeddings from a corpus file.
load_model: Load a saved model from disk (supports Word2Vec .model and FastText .bin formats).
save_model: Save a trained model to disk in the appropriate format.
Each function supports passing additional keyword arguments to fine-tune training and loading.
- dstk.modules.predict_models.fastText(path: str, **kwargs) _FastText [source]#
Creates word embeddings using the FastText algorithm.
- Parameters:
path (str) – The path to a file containing a list of sentences or collocations from which to build word embeddings.
kwargs –
Additional keyword arguments to pass to fasttext.train_unsupervised. Common options include:
dim: Size of the word embedding vectors.
model: Training algorithm: skipgram or cbow (Continuous Bag of Words)
thread: Number of CPU cores to be used during the training process.
For more information check: https://fasttext.cc/docs/en/options.html
- Returns:
An instance of fasttext’s FastText.
- Return type:
FastText
- dstk.modules.predict_models.load_model(path: str) Word2Vec | _FastText [source]#
Loads the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.
- Parameters:
path (str) – Path to the saved model file.
- Returns:
An instance of gensim’s Word2Vec or fasttext’s FastText.
- Return type:
NeuralModels
- dstk.modules.predict_models.save_model(model: Word2Vec | _FastText, path: str) str [source]#
Saves the trained embeddings in .model (Word2Vec) or .bin (FastText) format, depending on the algorithm used.
- Parameters:
model (NeuralModels) – A trained Word2Vec or FastText model.
path (str) – The path (without extension) where to save the model.
- Returns:
An instance of gensim’s Word2Vec or fasttext’s FastText.
- Return type:
NeuralModels
- dstk.modules.predict_models.word2vec(path: str, **kwargs) Word2Vec [source]#
Creates word embeddings using the Word2Vec algorithm.
- Parameters:
path (str) – The path to a file conatining a list of sentences or collocations from which to build word embeddings.
kwargs –
Additional keyword arguments to pass to gensim.models.Word2Vec. Common options include:
vector_size: Size of the word embedding vectors.
workers: Number of CPU cores to be used during the training process.
sg: Training algorithm. 1 for skip-gram; 0 for CBOW (Continuous Bag of Words).
window (int): Maximum distance between the current and predicted word.
min_count (int): Ignores all words with total frequency lower than this.
For more information check: https://radimrehurek.com/gensim/models/word2vec.html
- Returns:
An instance of gensim’s Word2Vec.
- Return type:
Word2Vec
dstk.modules.text_matrix_builder module#
This module provides functions to construct common matrix representations used in text analysis and natural language processing.
Key features include:
Creating a Document-Term Matrix (DTM) from a corpus of text, leveraging sklearn’s CountVectorizer with customizable parameters such as stop word removal and n-gram range.
Generating a Co-occurrence Matrix from a given Document-Term Matrix, capturing how frequently terms co-occur across documents.
These matrices are foundational for many NLP and Computational Linguistics tasks, including topic modeling, word embedding training, and network analysis. The output is provided as Pandas DataFrames for ease of analysis and integration with data science workflows.
- dstk.modules.text_matrix_builder.create_co_occurrence_matrix(dtm: DataFrame) DataFrame [source]#
Creates a Co-occurrence matrix from a Document Term Matrix (DTM).
- Parameters:
dtm (DataFrame) – A Document Term Matrix (DTM) from which to build a Co-occurrence matrix.
- Returns:
A Co-occurrence matrix.
- Return type:
DataFrame
- dstk.modules.text_matrix_builder.create_dtm(corpus: list[str], **kwargs) DataFrame [source]#
Creates Document Term Matrix (DTM).
- Parameters:
corpus (list[str]) – A list of sentences or collocations from which to build a matrix.
kwargs –
Additional keyword arguments to pass to sklearn’s CountVectorizer. Common options include:
stop_words: If provided, a list of stopwords to remove from the corpus.
ngram_range: A tuple (min_n, max_n) specifying the range of n-grams to consider.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- Returns:
A Document Term Matrix (DTM).
- Return type:
DataFrame.
dstk.modules.text_processor module#
This module provides utility functions for processing tokenized or lemmatized text represented as lists of strings or POS-tagged tuples. It supports common text normalization and transformation tasks, such as lowercasing, vocabulary extraction, and joining tokens into a single string. Additionally, it includes functionality for saving processed text or tagged data to a file in plain text or CSV format.
Core functionalities include:
Converting spaCy tokens to strings (with optional lemmatization)
Lowercasing and vocabulary extraction
Joining word lists into full text strings
Saving word lists or (token, POS) pairs to disk in a consistent format
This module is useful for preparing text data for further analysis, modeling, or storage.
- dstk.modules.text_processor.get_vocabulary(words: list[str]) list[str] [source]#
Returns the vocabulary a text.
- Parameters:
words (Words[str]) – A list words represented as strings.
- Returns:
A list of words represented as strings.
- Return type:
Words[str]
- dstk.modules.text_processor.join(words: list[str]) str [source]#
Joins a list of strings into a single string text.
- Parameters:
words (Words[str]) – A list words represented as strings.
- Returns:
A single string formed by concatenating the input words separated by spaces.
- Return type:
Words[str]
- dstk.modules.text_processor.save_to_file(words: list[str] | list[POSTaggedWord], path: str) str [source]#
Saves a list of strings or (Token, POS) tuples in the specified path. If tokens is a list of strings, it saves each string in a new line. If it is a list of tuples, it saves each tuple in a new line as a pair or values separated by a comma, in a CSV format.
- Parameters:
words (Words[str] or POSTaggedWordList.) – A list words represented as strings or a list of POSTaggedWord tuples.
path (str) – The path where to save the list of words.
- Returns:
The path where the file was saved.
- Return type:
str
- dstk.modules.text_processor.to_lower(words: list[str]) list[str] [source]#
Returns a list of lower cased words.
- Parameters:
words (Words[str]) – A list words represented as strings.
- Returns:
A list of words represented as strings.
- Return type:
Words[str]
- dstk.modules.text_processor.tokens_to_text(tokens: list[Token], lemmatize: bool = False) list[str] [source]#
Converts a list of spaCy Token objects to a list of words represented as strings.
- Parameters:
tokens (Words[Token]) – A list of spaCy tokens.
lemmatize (bool) – Whether to return the lemmatized form of each token. Defaults to False.
- Returns:
A list words represented as strings.
- Return type:
Words[str]
dstk.modules.tokenizer module#
This module provides utility functions for tokenizing texts using spaCy. It offers tools to process raw text into structured linguistic data, extract tokens and sentences, filter words by specific criteria (e.g., stop words, alphanumeric characters, part-of-speech), and generate POS-tagged outputs.
Core functionalities include:
Segemtating a text by applying a spaCy language model to raw text
Extracting tokens and sentences from processed documents
Removing stop words and non-alphanumeric tokens
Filtering tokens by part-of-speech (POS) tags
Generating (token, POS) tuples for downstream NLP tasks
The module is intended to provide tools for text segmentation and tagging.
- dstk.modules.tokenizer.alphanumeric_raw_tokenizer(tokens: list[Token]) list[Token] [source]#
Tokenizes a text including only alphanumeric characters and stop words.
- Parameters:
tokens (Words[Token]) – A list of spaCy tokens.
- Returns:
A list of spaCy tokens.
- Return type:
Words[Token]
- dstk.modules.tokenizer.apply_model(text: str, model: str | Language) Doc [source]#
Takes a text and analyzes it using a language model. It returns a processed version of the text that includes helpful information like the words, their meanings, and how they relate to each other.
- Parameters:
text (str) – The text to be processed.
model (str or Language) – The name of the model to be used or its instance.
- Returns:
A spaCy Doc object with linguistic annotations.
- Return type:
Doc
- dstk.modules.tokenizer.filter_by_pos(tokens: list[Token], pos: str) list[Token] [source]#
Returns a list of spaCy tokens filtered by a spacific part-of-speech tag.
- Parameters:
tokens (Words[Token]) – A list of spaCy tokens.
pos (str) – The POS tag to filter by (e.g., ‘NOUN’, ‘VERB’, etc.). Case-sensitive.
- Returns:
A list of spaCy tokens.
- Return type:
Words[Token]
- dstk.modules.tokenizer.get_sentences(document: Doc) list[list[Word]] [source]#
Returns a list of sentences from a spaCy Doc, where each sentence is represented as a list of spaCy Token objects.
- Parameters:
document (Doc) – A spaCy Doc object.
- Returns:
A list of sentences, each sentence is a list of spaCy Tokens.
- Return type:
WordSentences
- dstk.modules.tokenizer.get_tokens(document: Doc) list[Token] [source]#
Returns a list of spaCy tokens from a Doc object.
- Parameters:
docuument – A spaCy Doc object.
- Returns:
A list of spaCy tokens.
- Return type:
Words[Token]
- dstk.modules.tokenizer.pos_tagger(tokens: list[Token]) list[POSTaggedWord] [source]#
Returns a list of (Token, POS) tuples, pairing each token with its part-of-speech tag.
- Parameters:
tokens (Words[Token]) – A list of spaCy tokens.
- Returns:
A list of POSTaggedWord tuples.
- Return type:
POSTaggedWordList
- dstk.modules.tokenizer.remove_stop_words(tokens: list[Token], custom_stop_words: list[str] | None = None) list[Token] [source]#
Filters tokens, returning only alphanumeric tokens that are not stop words.
- Parameters:
tokens (Words[Token]) – A list of spaCy tokens.
custom_stop_words (list[str] or None) – If provided, a list of custom stop words. Defaults to None.
- Returns:
A list of spaCy tokens.
- Return type:
Words[Token]
dstk.modules.weight_matrix module#
This module provides functions to apply weighting schemes to co-occurrence matrices commonly used in natural language processing and text mining.
Available weighting methods include:
Pointwise Mutual Information (PMI) and Positive PMI (PPMI), which measure the association strength between co-occurring terms by comparing observed co-occurrence frequencies to expected frequencies under independence.
Term Frequency-Inverse Document Frequency (Tf-idf), which reweights term importance based on frequency patterns, leveraging sklearn’s TfidfTransformer.
These weighting techniques help enhance the semantic relevance of co-occurrence matrices, improving downstream tasks such as word embedding, clustering, and semantic similarity analysis.
All functions return weighted co-occurrence matrices as Pandas DataFrames for convenient further analysis.
- dstk.modules.weight_matrix.pmi(co_matrix: DataFrame, positive: bool = False) DataFrame [source]#
Weights a Co-occurrence matrix by PMI or PPMI.
- Parameters:
co_matrix (DataFrame) – A Co-occurrence matrix to be weighted.
positive (bool) – If True, weights the Co-ocurrence matrix by PPMI. If False, weighths it by PMI. Defaults to False.
- Returns:
A Co-occurrence matrix weighted by PMI or PPMI.
- Return type:
DataFrame
- dstk.modules.weight_matrix.tf_idf(co_matrix: DataFrame, **kwargs) DataFrame [source]#
Weights a Co-occurrence matrix by Tf-idf.
- Parameters:
co_matrix (DataFrame) – A Co-occurrence matrix to be weighted.
kwargs – Additional keyword arguments to pass to sklearn’s TfidfTransformer. For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
- Returns:
A Co-occurrence matrix weighted by Tf-idf.
- Return type:
DataFrame