dstk.parameters.vector_similarity.geometric_measures package#

Submodules#

dstk.parameters.vector_similarity.geometric_measures.dissimilarity module#

This module provides geometric measures to calculate the dissimilarity between word embeddings. By representing words as vectors in a high-dimensional space, these methods measure the spatial “distance” between them; a smaller distance indicates higher semantic similarity, while a larger distance indicates greater dissimilarity.

Core functionalities include:

  • Calculating Euclidean distance (L2 norm) to determine the straight-line distance between two word vectors.

  • Calculating Manhattan distance (L1 norm) to measure the distance between two word vectors along axes at right angles.

  • Providing foundational geometric metrics used to evaluate distributional similarity and identify nearest neighbors for given lexemes.

The module is intended to provide researchers with standard spatial metrics to quantify the relationships between words in a vector space.

dstk.parameters.vector_similarity.geometric_measures.dissimilarity.euclidean_distance(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the Euclidean distance between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The Euclidean distance between the first and second word.

Return type:

float

dstk.parameters.vector_similarity.geometric_measures.dissimilarity.manhattan_distance(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the Manhattan distance between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The Manhattan distance between the first and second word.

Return type:

float

dstk.parameters.vector_similarity.geometric_measures.similarity module#

This module implements geometric measures to quantify the semantic similarity between lexemes based on their representation in a vector space. Following the geometric approach, it treats word embeddings as points in an n-dimensional space where the proximity of two vectors indicates the degree of similarity between their meanings.

Core functionalities include:

  • Calculating cosine similarity to measure the distance between two specific distributional vectors (embeddings).

  • Identifying the $k$ nearest neighbors for a target word using exact search algorithms, returning both the labels and their corresponding similarity scores.

  • Performing approximate nearest neighbor searches to facilitate fast, memory-efficient queries in large datasets using HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) methods.

The module is designed to provide researchers with tools for both precise pairwise comparisons and scalable retrieval of semantically related terms within large multidimensional embedding spaces.

dstk.parameters.vector_similarity.geometric_measures.similarity.approximate_nearest_neighbors(embeddings: DataFrame, word: str, metric: Literal['hnsw', 'ivf'] = 'ivf', n_words: int = 5, n_centroids: int = 100, clusters_to_search: int = 10, n_connections: int = 16, search_depth: int = 8, construction_depth: int = 64) list[Neighbor][source]#

Find words with similar embeddings using a fast, memory-efficient approximate search.

This function returns the closest words to a target word without checking every possible word directly. Instead, it uses structures that give very close results much faster than an exact search, especially on large embedding sets.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • word (str) – The target word to find neighbors for.

  • metric (str) – The search methods to use (IVF or HNSW). IVF uses less memory; HNSW is more accurate but heavier. Defaults to ‘ivf’.

  • n_words – Number of nearest neighbors to return. Defaults to 5.

  • n_centroids (int) – The number of centroids IVF uses to group the embeddings. Equivalent to the number of clusters. Defaults to 100.

  • clusters_to_search (int) – Controls how many clusters (centroids) IVF visits during the search. The higher, the slower, but more accurate. Defaults to 10.

  • n_connections (int) – Number of connections of each node in HNSW. Defaults to 16.

  • search_depth (int) – How much of the network HNSW will search. Defaults to 8.

  • construction_depth (int) – How muhc of the network you will search during its construction (how accurately the network is built) (Won’t make any difference in search time, so it is better to use higher numbers. However, it does have an effect in construction time). Defaults to 64.

Returns:

A list of Neighbor namedtuples, one for each word close to the target word.

Return type:

Neighbors

dstk.parameters.vector_similarity.geometric_measures.similarity.cos_similarity(embeddings: DataFrame, first_word: str, second_word: str) float[source]#

Computes the cosine similarity between the embeddings of two words.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • first_word (str) – The first word in the pair.

  • second_word (str) – The second word in the pair.

Returns:

The cosine similarity between the first and second word.

Return type:

float

dstk.parameters.vector_similarity.geometric_measures.similarity.nearest_neighbors(embeddings: DataFrame, word: str, metric: str = 'cosine', n_words: int = 5, **kwargs) list[Neighbor][source]#

Returns the top N most semantically similar words to a given target word, based on the specified distance or similarity metric.

Parameters:
  • embeddings (DataFrame) – A dataframe containing the word embeddings.

  • word (str) – The target word to find neighbors for.

  • metric (str) – The distance or similarity metric to use (e.g., ‘cosine’, ‘euclidean’). Defaults to ‘cosine’.

  • n_words – Number of nearest neighbors to return. Defaults to 5.

  • kwargs – Additional keyword arguments to pass to sklearn’s NearestNeighbors.

For more information check: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

Returns:

A list of Neighbor namedtuples, one for each word close to the target word.

Return type:

Neighbors

Module contents#