dstk.parameters package#
Subpackages#
Submodules#
dstk.parameters.dimensionality_reduction module#
This module offers functionality to transform and reduce high-dimensional text data represented as matrices, enabling more effective downstream analysis and modeling.
Key features include:
Scaling input matrices to zero mean and unit variance using standardization.
Generating low-dimensional word embeddings from co-occurrence matrices using dimensionality reduction techniques:
Truncated Singular Value Decomposition (SVD)
Principal Component Analysis (PCA)
These techniques help distill semantic information from sparse and high-dimensional co-occurrence data, facilitating tasks such as clustering, visualization, and feature extraction in natural language processing pipelines.
All functions return results as Pandas DataFrames for seamless integration with data workflows.
- dstk.parameters.dimensionality_reduction.pca(matrix: DataFrame, n_dimensions: int | float = 300, **kwargs) DataFrame[source]#
Generates word embeddings using Principal Component Analysis (PCA).
- Parameters:
matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_dimensions (int or float) – If an integer, the number of dimensions to reduce the word embeddings to. If a float between 0 and 1, specifies the proportion of variance to preserve. Defaults to 300.
kwargs – Additional keyword arguments to pass to sklearn’s PCA.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- Returns:
A DataFrame of word embeddings generated by PCA.
- Return type:
DataFrame
- dstk.parameters.dimensionality_reduction.svd(matrix: DataFrame, n_dimensions: int = 300, **kwargs) DataFrame[source]#
Generates word embeddings using truncated Single Value Descomposition (SVD).
- Parameters:
matrix (DataFrame) – A Co-occurrence matrix from which embeddings will be generated.
n_dimensions (int) – The number of dimensions to reduce the word embeddings to. Defaults to 300.
kwargs – Additional keyword arguments to pass to sklearn’s TruncatedSVD.
For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
- Returns:
A DataFrame of word embeddings generated by SVD.
- Return type:
DataFrame