dstk.parameters.co_matrix.creation.linguistic.word package#
Submodules#
dstk.parameters.co_matrix.creation.linguistic.word.window module#
This module provides tools for constructing word-by-word co-occurrence matrices based on “Lexemes” within the framework of distributional semantics. Specifically, it focuses on “Window-based collocates,” where the context of a word is determined by its proximity to other words in a sequence.
The module facilitates the transition from raw linguistic sequences to structured mathematical representations, allowing researchers to analyze how words appear together within specific windows.
Core functionalities include:
Generating co-occurrence matrices (word $ imes$ word) from lists of tokenized contexts.
Leveraging standard vectorization tools to handle preprocessing such as stop_words and n-grams.
Converting sparse mathematical matrices into labeled DataFrames for easier analysis by linguists and researchers.
Providing a framework for calculating how words relate to one another based on spatial proximity within the text.
This module is intended for use when analyzing lexical relationships where the physical distance between words (the “window”) defines their relationship.
- dstk.parameters.co_matrix.creation.linguistic.word.window.create_word_by_word_matrix(contexts: Sequence[Sequence[Word]], **kwargs) DataFrame[source]#
Build a Word By Word Matrix from tokenized contexts.
- Parameters:
contexts (list[Sequence[Word]]) – A list of word or token object sequences.
kwargs – Arguments for sklearn CountVectorizer (e.g. stop_words, ngram_range).
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- Returns:
Sparse co-occurrence matrix (feature x feature).
- Return type:
DataFrame