dstk.parameters.co_matrix.weighting package#

Submodules#

dstk.parameters.co_matrix.weighting.associative_measures module#

This module provides functions to transform raw co-occurrence matrices into association measures to better reflect the distributional information of lexemes.

While simple co-occurrence counts indicate how often two words appear together, they do not always account for the “informativeness” of those occurrences. For example, a common verb like “get” may co-occur with many different subjects, making it less informative about its context than a specific verb like “bark.”

Core functionalities include: * Weighting co-occurrence matrices using Pointwise Mutual Information (PMI) to identify

stronger distributional associations.

  • Weighting co-occurrence matrices using Positive Pointwise Mutual Information (PPMI) to eliminate negative values and focus on positive associations.

  • Handling sparse matrix data structures to ensure efficiency when processing large linguistic corpora.

The module is intended to help researchers move beyond simple frequency counts toward meaningful distributional analysis in corpus linguistics.

dstk.parameters.co_matrix.weighting.associative_measures.pmi(word_by_word_matrix: DataFrame, positive: bool = False) DataFrame[source]#

Weights a co-occurrence matrix by PMI or PPMI.

Parameters:
  • word_by_word_matrix (DataFrame) – The co-occurrence matrix to be weighted.

  • positive (bool) – If True, weights by PPMI; if False, weighs by PMI. Defaults to False.

Returns:

Sparse co-occurrence matrix weighted by PMI or PPMI.

Return type:

DataFrame

dstk.parameters.co_matrix.weighting.relevance_measures module#

This module provides functions for calculating relevance measures to weight word-document matrices. These measures are used to assess the informativeness of lexical items within a corpus, helping to distinguish specific content from common distributional noise.

Based on information retrieval principles, these weights combine local components (such as term frequency) with global components (reflecting the overall informativeness of a term). By applying such filters, the module reduces the influence of terms that appear frequently across many documents and enhances the weight of terms associated with specific contexts.

Core functionalities include: * Applying Term Frequency-Inverse Document Frequency (TF-IDF) to word-by-document matrices. * Weighting lexical items based on their global informativeness across a training corpus. * Transforming raw word counts into weighted representations suitable for linguistic analysis and

distributional modeling.

The module is intended to provide tools for quantifying term relevance in the context of co-occurrence matrices and distributional representation.

dstk.parameters.co_matrix.weighting.relevance_measures.tf_idf(word_by_document_matrix: DataFrame, **kwargs) DataFrame[source]#

Weights a Word By Document Matrix using TF-IDF.

Parameters:
  • word_by_document_matrix (DataFrame) – A DataFrame representing word-document counts.

  • kwargs – Additional arguments for scikit-learn’s TfidfTransformer.

Returns:

Sparse weight-adjusted Word By Document Matrix.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

Return type:

DataFrame

Module contents#