dstk.modules.data_visualization package#
Submodules#
dstk.modules.data_visualization.clustering module#
Clustering utilities for word embeddings analysis and visualization.
This module provides functions to determine the optimal number of clusters for word embeddings using popular methods such as the Elbow method and Silhouette score. It also assigns cluster labels to the embeddings accordingly.
Key features:
elbow_method: Applies the Elbow method on embeddings to find the best cluster count by minimizing inertia.
extract_silhouette_score: Uses the Silhouette score to evaluate clustering quality and determine the optimal cluster number.
Both functions support visualization of their respective metrics and can save plots to file.
Cluster labels are appended to the embeddings DataFrame for easy downstream use, such as visualization or further analysis.
These utilities are designed to work seamlessly with word embedding DataFrames, enabling efficient and interpretable clustering analysis.
- dstk.modules.data_visualization.clustering.elbow_method(embeddings: DataFrame, max_clusters: int, show: bool = False, path: str | None = None) DataFrame [source]#
Applies the Elbow method to determine the optimal number of clusters for word embeddings, and assigns cluster labels based on the identified value.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
max_clusters (int) – The maximum number of clusters to evaluate when applying the Elbow method.
show (bool) – If True, shows the plot. Defaults to False.
path (str) – If provided, saves the plot in the specified path. Defaults to None.
- Returns:
A copy of the input DataFrame with an additional ‘cluster’ column containing the cluster labels.
- Return type:
DataFrame
- dstk.modules.data_visualization.clustering.extract_silhouette_score(embeddings: DataFrame, max_clusters: int, show: bool = False, path: str | None = None, **kwargs) DataFrame [source]#
Extracts the Silhouette score to determine the optimal number of clusters for word embeddings, and assigns cluster labels based on the identified value.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
max_clusters (int) – The maximum number of clusters to evaluate when applying the Elbow method.
show (bool) – If True, shows the plot. Defaults to False.
path (str) – If provided, saves the plot in the specified path. Defaults to None.
kwargs – Additional keyword arguments to pass to sklearn.metrics silhouette_score. For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
- Returns:
A copy of the input DataFrame with an additional ‘cluster’ column containing the cluster labels.
- Return type:
DataFrame
dstk.modules.data_visualization.embeddings module#
Visualization utilities for word embeddings using UMAP dimensionality reduction.
This module provides a function to project high-dimensional word embeddings into 2D or 3D space for visualization purposes. It uses UMAP to reduce dimensionality while preserving local and global structure, enabling intuitive exploration of semantic relationships between words.
Key features:
Supports 2D and 3D scatter plots of word embeddings.
Optionally displays word labels and cluster assignments.
Allows customization of UMAP parameters such as number of neighbors, distance metric, and minimum distance.
Supports saving interactive Plotly visualizations as HTML files.
This utility helps linguists, NLP practitioners, and data scientists gain insights from embedding spaces through visual inspection.
- dstk.modules.data_visualization.embeddings.plot_embeddings(embeddings: DataFrame, n_dimensions: int = 2, labels: bool = False, show: bool = True, path: str | None = None, umap_neighbors: int = 15, umap_metric: str = 'cosine', umap_dist: float = 0.1) Figure [source]#
Generates a plot of the word embedddings using UMAP for dimensionality reduction.
- Parameters:
embeddings (DataFrame) – A dataframe containing the word embeddings.
labels (bool) – Whether to show word labels on each point. Defaults to False.
show (bool) – If True, shows the plot. Defaults to False.
path (str) – If provided, saves the plot in the specified path. Defaults to None.
umap_neighbors (int) – Controls how UMAP balances local versus global structure. Higher values consider a broader context when reducing dimensions. Defaults to 15.
umap_metric (str) – The distance metric UMAP uses to assess similarity between words (e.g., “cosine”, “euclidean”). Defaults to “cosine”, which is common for word embeddings.
umap_dist (float) – Controls how tightly UMAP packs points together. Lower values keep similar words closer in the 2D space. Defaults to 0.1.
- N_dimensions:
The number of dimensions for the plot. Must be 2 or 3 corresponding to a 2D or 3D scatter plot respectively. This also determines the dimensionality UMAP will reduce the embeddings to. Defaults to 2.
- Returns:
A Plotly Figure object containing the 2D or 3D scatter plot.
- Return type:
Figure