dstk.lib_types package#

Submodules#

dstk.lib_types.dstk_types module#

class dstk.lib_types.dstk_types.Bigram(collocate: str | Token, target_word: str)[source]#

Bases: NamedTuple

Represents a bigram collocation between two words.

Parameters:
  • collocate (str or Token) – The collocate word.

  • target_word (str) – The target word in the bigram.

collocate: str | Token#

Alias for field number 0

target_word: str#

Alias for field number 1

dstk.lib_types.dstk_types.BigramList#

A list of bigram tuples.

alias of list[Bigram]

dstk.lib_types.dstk_types.Collocates#

A tuple representing a group of collocates (words).

alias of tuple[Word, …]

dstk.lib_types.dstk_types.CollocatesList#

A list of collocate tuples.

alias of list[tuple[Word, …]]

dstk.lib_types.dstk_types.DirectedCollocateList#

A list of directed collocates.

alias of list[tuple[Word, tuple[str, str]]]

dstk.lib_types.dstk_types.DirectedCollocates#

Directed collocates represented as a tuple of a word and a pair of directional tags.

alias of tuple[Word, tuple[str, str]]

class dstk.lib_types.dstk_types.ExcludedMethods#

Bases: dict

Specifies methods to exclude by name.

Parameters:

exclude (list[str] or str) – A list of method names or a single method name to exclude.

exclude: list[str] | str#
dstk.lib_types.dstk_types.Labels: TypeAlias = numpy.ndarray[typing.Any, numpy.dtype[numpy.str_]] | pandas.core.indexes.base.Index | list[str] | None#

Labels used in pandas DataFrames, representing index or column labels.

This can be a NumPy ndarray of strings, a pandas Index, a list of strings, or None.

dstk.lib_types.dstk_types.Matrix: TypeAlias = scipy.sparse._csr.csr_matrix | scipy.sparse._csc.csc_matrix | numpy.ndarray#

A union of matrix types from SciPy or NumPy.

class dstk.lib_types.dstk_types.Neighbor(word, score)[source]#

Bases: NamedTuple

score: float#

Alias for field number 1

word: str#

Alias for field number 0

dstk.lib_types.dstk_types.Neighbors#

A list of neighboring words with scores.

alias of list[Neighbor]

dstk.lib_types.dstk_types.NeuralModels: TypeAlias = gensim.models.word2vec.Word2Vec | fasttext.FastText._FastText#

A union of neural language model types.

dstk.lib_types.dstk_types.Number: TypeAlias = int | float#

Numeric types accepted (integer or float).

class dstk.lib_types.dstk_types.POSTaggedWord(word: str | Token, pos: str)[source]#

Bases: NamedTuple

Represents a word paired with its Part-Of-Speech (POS) tag.

Parameters:
  • word (str or Token) – The word, either as a string or spaCy Token.

  • pos (str) – The POS tag of the word.

pos: str#

Alias for field number 1

word: str | Token#

Alias for field number 0

dstk.lib_types.dstk_types.POSTaggedWordList#

A list of POS-tagged words.

alias of list[POSTaggedWord]

dstk.lib_types.dstk_types.ResultGenerator#

Generator that yields results of workflow steps without step metadata.

alias of Generator[Any, None, None]

dstk.lib_types.dstk_types.RulesTemplate#

Template defining rules for excluding methods once a specific type is triggered.

The outer dictionary keys are module names (e.g., ‘tokenizer’, ‘text_processor’), and the values specify which methods should be excluded in that module.

For example, when the data type changes to ‘POSTaggedWordList’, these rules prevent further usage of specific methods like ‘pos_tagger’ in the tokenizer module

alias of dict[str, ExcludedMethods]

dstk.lib_types.dstk_types.Sentences: TypeAlias = list[list[~Word]] | list[list[tuple[~Word, ...]] | list[tuple[~Word, tuple[str, str]]] | list[dstk.lib_types.dstk_types.POSTaggedWord] | list[dstk.lib_types.dstk_types.Bigram]]#

Union type representing either plain or tagged sentences.

dstk.lib_types.dstk_types.StageModules#

Mapping from stage indices (integers) to sets of module names allowed in that stage.

Each key is a stage number, and the value is a set of module names (strings) that are enabled or active during that stage of the stage workflow.

alias of dict[int, set[str]]

dstk.lib_types.dstk_types.StageTemplate#

Mapping from stage names to their corresponding workflow templates.

Each key is a stage name (a string identifying a module), and the value is a WorkflowTemplate describing the processing steps and triggers allowed in that stage.

alias of dict[str, WorkflowTemplate]

dstk.lib_types.dstk_types.StageWorkflow#

A stage workflow contains multiple workflows organized by module names. Each key is a module name (e.g., ‘tokenizer’, ‘ngrams’, ‘text_processor’), and the value is the workflow steps for that module.

alias of dict[str, list[dict[str, dict[str, Any]]]]

class dstk.lib_types.dstk_types.StepConfig#

Bases: dict

Configuration for a processing step in a workflow.

Parameters:
  • include (list[str] or str, optional) – Methods to include, either a list of strings or a single string.

  • exclude (dict[str, int], optional) – Methods to exclude, as a dictionary mapping strings to integers.

  • repeat (bool) – Whether the a method can be used more than once.

  • chaining (bool) – Whether method cchaining is enabled.

  • step_name (str) – The name of the step.

chaining: bool#
exclude: NotRequired[dict[str, int]]#
include: NotRequired[list[str] | str]#
repeat: bool#
step_name: str#
dstk.lib_types.dstk_types.StepGenerator#

Generator that yields StepResult objects, each representing the name and result of a workflow step.

alias of Generator[StepResult, None, None]

class dstk.lib_types.dstk_types.StepResult(name: str, result: Any)[source]#

Bases: NamedTuple

Represents the result of executing a single workflow or model step.

Parameters:
  • name – The name of thec step.

  • result – The output produced by the step.

name: str#

Alias for field number 0

result: Any#

Alias for field number 1

dstk.lib_types.dstk_types.TaggedSentences#

A list of tagged sentences, each containing tagged words.c

alias of list[list[tuple[Word, …]] | list[tuple[Word, tuple[str, str]]] | list[POSTaggedWord] | list[Bigram]]

dstk.lib_types.dstk_types.TaggedWordsList: TypeAlias = list[tuple[~Word, ...]] | list[tuple[~Word, tuple[str, str]]] | list[dstk.lib_types.dstk_types.POSTaggedWord] | list[dstk.lib_types.dstk_types.Bigram]#

Union type of all tagged word lists.

class dstk.lib_types.dstk_types.Word#

A generic type variable for words, bounded to str or spaCy Token.

alias of TypeVar(‘Word’, bound=str | Token)

dstk.lib_types.dstk_types.WordCounts#

A counter mapping words (strings) to their frequency counts.

alias of Counter[str]

dstk.lib_types.dstk_types.WordSenteces#

A list of sentences, where each sentence is a list of words.c

alias of list[list[Word]]

dstk.lib_types.dstk_types.Words#

A list of words (strings or spaCy Tokens).

alias of list[Word]

dstk.lib_types.dstk_types.Workflow#

A workflow is a list of ordered steps, where each step is a dictionary mapping method names to their keyword arguments.

alias of list[dict[str, dict[str, Any]]]

class dstk.lib_types.dstk_types.WorkflowTemplate#

Bases: dict

Template for an entire workflow, consisting of steps, a base type and triggers.

Parameters:
  • steps (dict[int, StepConfig]) – Mapping from step numbers to step configurations.

  • base_type (str) – The base type of the workflow.

  • triggers (dict[str, str]) – Mapping from method names to the data types they produce. When a method changes the current data type (the default return type),the corresponding trigger activates rules that enable or disable subsequent methods.

base_type: str#
steps: dict[int, StepConfig]#
triggers: dict[str, str]#

dstk.lib_types.fasttext_types module#

dstk.lib_types.gensim_types module#

dstk.lib_types.numpy_types module#

dstk.lib_types.pandas_types module#

dstk.lib_types.plotly_types module#

dstk.lib_types.sklearn_types module#

dstk.lib_types.spacy_types module#

Module contents#