dstk.models package#

Submodules#

dstk.models.model_tools module#

Module for orchestrating and automating the execution of multiple workflows and hooks.

Provides the ModelBuilder class which manages a sequence of WorkflowBuilder, StageWorkflowBuilder, or Hook instances, allowing flexible, stepwise processing of input data through these workflows.

Features:

Sequential execution of workflows with intermediate results.
Options to retrieve results from specific workflows, all workflows, or only the final output.
Supports integration with various workflow types for modular model construction.

This module facilitates building complex processing models by combining and controlling multiple modular workflows in a unified manner.

class dstk.models.model_tools.ModelBuilder(workflows: list[WorkflowBuilder | StageWorkflowBuilder | Hook])[source]#

Bases: object

Automates the execution of a sequence of workflows on a WorkflowBuilder or Hook subclass.

Parameters:: workflows (list[WorkflowBuilder | StageWorkflowBuilder | Hook]) – A list of Workflow, StageWorkflows or Hook to execute.

Usage:

CustomModel = ModelBuilder(workflows=[workflow1, workflow2, hook1])
final_result = CustomModel(input_data)

dstk.models.models module#

This module contains predefined and commonly used distributional semantic models. Each model is implemented as a high-level pipeline that integrates multiple stages of text processing, embedding generation, and similarity computation.

Currently supported models:

StandardModel: A count-based model using a context window, PPMI weighting, and dimensionality reduction via SVD. Based on the description found in the book ‘Distributional semantics’ by Lenci & Sahlgren (2023).
SGNSModel: A prediction-based model using Word2Vec’s Skip-Gram with Negative Sampling (SGNS), as described by Lenci & Sahlgren (2023).

These pipelines are modular and composable, built from reusable workflows to support both experimentation and production use.

Future versions of this module may include additional models and hybrid approaches.

class dstk.models.models.DistanceMeasurements(*args, **kwargs)[source]#

Bases: Protocol

Interface for semantic similarity methods based on word embeddings.

This protocol represents any object that implements methods for computing cosine similarity and retrieving nearest neighbors. It is used as the return type of the StandardModel(..., return_workflows=["GeometricDistance"]) or SGNS(..., return_workflows=["GeometricDistance"]) pipelines.

Methods:

cos_similarity(first_word, second_word):: Computes the cosine similarity between two words. Equivalent to dstk.modules.geometric_distance.cos_similarity.
nearest_neighbors(word, metric, n_words, **kwargs):: Returns the nearest neighbors to a word using a specified metric. Equivalent to dstk.modules.geometric_distance.nearest_neighbors.

cos_similarity(first_word: str, second_word: str) → float[source]#: Return the cosine similarity between two words.

nearest_neighbors(word: str, metric: str = 'cosine', n_words: int = 5, **kwargs) → list[Neighbor][source]#: Return the top-N nearest neighbors to a word using a given metric.

dstk.models.models.SGNSModel(text: str, model: str | Language, path: str, return_workflows: None = None, return_all: Literal[False] = False, **kwargs) → DistanceMeasurements[source]#

dstk.models.models.SGNSModel(text: str, model: str | Language, path: str, return_workflows: list[str] = None, return_all: Literal[False] = False, **kwargs) → Generator[Any, None, None]

dstk.models.models.SGNSModel(text: str, model: str | Language, path: str, return_workflows: None = None, return_all: Literal[True] = True, **kwargs) → Generator[StepResult, None, None]

This pipeline generates word embeddings using Skip-Gram with Negative Sampling (SGNS) as defined by (Lenci & Sahlgren 162). It preprocesses the text by extracting the sentences, removing stop words and lowering them. The embeddings are extracted by using word2vec to do SGNS. Then, cosine similarity is appliad as the distance metric.

Parameters:

text (str) – The text to extract the embeddings from.
model (str or Language) – The spaCy NLP model to tokenize the text.
path (str) – The path to save the processed senteces.
kwargs –
Additional keyword arguments to pass to gensim.models.Word2Vec. Common options include:
- vector_size: Size of the word embedding vectors.
- workers: Number of CPU cores to be used during the training process.
- negative: Specifies how many “noise words” to sample for each positive example during training. Typical values range from 5 to 20. Higher values make training slower but can improve embedding quality.
- window (int): Maximum distance between the current and predicted word.
- min_count (int): Ignores all words with total frequency lower than this.

For more information check: https://radimrehurek.com/gensim/models/word2vec.html

Parameters:

return_workflows (list[str] or None) –
If provided, yields results only for these workflows. Defaults to None. The names of the workflows that can be returned are the following:
- ProcessedText: Returns the pre-processed text.
- SGNS: Returns a Word2Vec instance of the Skip-Gram with Negative Sampling model, trained on the input text..
- Embeddings: Returns the generated word embeddings.
- GeometricDistance: Returns a wrapper with the methods cos_similarity and nearest_neighbors for semantic distance analysis.
return_all (bool) – If True, yields results for all workflows. Defaults to False.

Returns:

Wrapper for cosine_similarity and nearest_neighbors, or a generator of step/workflow results.

Return type:

ResultGenerator | StepGenerator | Wrapper

dstk.models.models.StandardModel(text: str, model: str | Language, custom_stop_words: list[str] | None = None, window_size: int = 2, n_components: int = 100, return_workflows: None = None, return_all: Literal[False] = False) → DistanceMeasurements[source]#

dstk.models.models.StandardModel(text: str, model: str | Language, custom_stop_words: list[str] | None = None, window_size: int = 2, n_components: int = 100, return_workflows: list[str] = None, return_all: Literal[False] = False) → Generator[Any, None, None]

dstk.models.models.StandardModel(text: str, model: str | Language, custom_stop_words: list[str] | None = None, window_size: int = 2, n_components: int = 100, return_workflows: None = None, return_all: Literal[True] = True) → Generator[StepResult, None, None]

This pipeline generates word embeddings using the standard model as defined by (Lenci & Sahlgren 97). It preprocesses the text by removing stop words, lowering the words and segmenting the text using a context window. The co-occurrence matrix is weighted with PPMI and reduced with truncated SVD. Then, cosine similarity is appliad as the distance metric.

Parameters:

text (str) – The text to extract the embeddings from.
model (str or Language) – The spaCy NLP model to tokenize the text.
window_size (int) – The size of the context window to segment the text. Defaults to 2.
n_components (int) – The number of dimensions of the embeddings. Defaults to 100.
return_workflows (list[str] or None) –
If provided, yields results only for these workflows. Defaults to None. The names of the workflows that can be returned are the following:
- ProcessedText: Returns the pre-processed text.
- Matrix: Returns the co-occurrence matrix.
- WeightedMatrix: Returns the weighted co-occurrence matrix.
- Embeddings: Returns the generated word embeddings.
- GeometricDistance: Returns a wrapper with the methods cos_similarity and nearest_neighbors for semantic distance analysis.
return_all (bool) – If True, yields results for all workflows. Defaults to False.

Returns:

Wrapper for cosine_similarity and nearest_neighbors, or a generator of step/workflow results.

Return type:

ResultGenerator | StepGenerator | Wrapper

dstk.models package

Contents

dstk.models package#

Submodules#

dstk.models.model_tools module#

dstk.models.models module#

Module contents#