dstk.corpus package#

Submodules#

dstk.corpus.analysis module#

This module provides standard tools for lexical and corpus linguistic analysis. It simplifies common tasks such as calculating word frequencies, identifying collocations (frequently occurring neighbor words), generating concordances to examine context, and extracting unique vocabularies from annotated text data.

Core functionalities include:

Frequency Analysis: Counting occurrences of words and returning them in a structured pandas DataFrame for easy analysis or visualization.
Concordance Generation: Extracting specific search terms along with their surrounding context to study usage patterns.
Collocation Extraction: Identifying statistically significant pairs or sequences of neighboring words within a specified window.
Vocabulary Filtering: Isolating unique items from word sequences and organizing them alphabetically for systematic overview.

This module serves as a primary interface for transforming raw linguistic data into quantifiable metrics and usable structures for research in digital humanities and computational linguistics.

dstk.corpus.analysis.get_collocations(words: list[Word], number: int = 20, window_size: int = 2) → list[tuple[str, str]][source]#

Return the most frequent collocations in a sequence of words.

Parameters:

words (list[Word]) – A sequence of word objects.
number (int) – The maximum number of collocations to return.
window_size (int) – The number of neighboring words to consider.

Returns:

A list of collocations.

Return type:

list[tuple[str, str]]

dstk.corpus.analysis.get_concordances(document: Document, text: str, **kwargs) → list[Concordance][source]#

Find all occurrences of a specific string within a document and return their context.

Parameters:

document (Document) – The Stanza Document to search.
text (str) – The substring to search for.
kwargs – Additional arguments passed to the underlying nltk.Text.concordance_list method.

Returns:

A list of Concordance objects containing left and right context.

Return type:

list[Concordance]

dstk.corpus.analysis.get_vocabulary(words: Sequence[Word]) → list[Word][source]#

Return the unique words in a sequence, sorted alphabetically.

Parameters:: words (Sequence[Word]) – A sequence of word objects.
Returns:: The vocabulary of the sequence.
Return type:: list[Word]

dstk.corpus.analysis.word_frequency(words: list[Word]) → DataFrame[source]#

Calculate and return the frequency of each word as a pandas DataFrame.

Parameters:: words (list[Word]) – A sequence of word objects.
Returns:: A DataFrame with “Word” and “Frequency” columns.
Return type:: DataFrame

dstk.corpus.annotation module#

This module provides tools for annotating and processing text corpora to extract linguistic information. It acts as a unified interface for multi-language NLP tasks, supporting both Stanza and spaCy back-ends to process raw text into structured data including part-of-speech (POS) tags, lemmas, and word stems.

Core functionalities include:

Processing raw text strings into structured Document objects containing linguistic metadata.
Automatic stemming for a wide range of languages using the Snowball algorithm.
Integration with Stanza and spaCy pipelines to perform tasks like NER, dependency parsing, and morphology.
Exporting processed results into the standard CoNLL-U format for further research.
Importing existing CoNLL-U files from local directories for analysis.

The module is designed to streamline the transition from raw text to structured linguistic data for use in digital humanities and computational linguistics projects.

class dstk.corpus.annotation.StemmerProcessor(device, config, pipeline)[source]#

Bases: Processor

process(doc: Document) → Document[source]#

Execute the stemming step for a given document.

Parameters:: doc (Document) – The Document to be processed.
Returns:: The processed Document.
Return type:: Document

dstk.corpus.annotation.annotate_corpus(corpus: list[str], language_model: str, output_dir: str | None = None, document_names: list[str] | None = None, max_length: int = 1000000, processors: str = 'tokenize,mwt,pos,lemma,stem,depparse,ner,sentiment,constituency', **kwargs) → dict[str, Document][source]#

Annotates a collection of texts using the specified language model.

Parameters:

corpus (list[str]) – A list of raw strings to be processed.
language_model (str) – The identifier or instance of the language model.
output_dir (str | None) – Optional path where results should be saved.
document_names (list[str] | None) – Optional list of names for each item in the corpus.
max_length (int) – Maximum allowed character length per text.
processors (str) – Comma-separated string of processing steps to include.

Returns:

A mapping of document names to annotated Document objects.

Return type:

DocumentIndex

dstk.corpus.annotation.load_annotations(input_dir: str, filenames: list[str] | None = None) → dict[str, Document][source]#

Load CoNLL-U files from a directory and map them to their filenames.

Parameters:

input_dir (str) – Path to the directory containing .conllu files.
filenames (list[str] | None) – Optional list of specific filenames to include (filtered by filename excluding extension).

Returns:

A dictionary mapping filenames to annotated Document objects.

Return type:

DocumentIndex

dstk.corpus package

Contents

dstk.corpus package#

Submodules#

dstk.corpus.analysis module#

dstk.corpus.annotation module#

Module contents#