pub struct Ngrams {
pub anchor: Option<String>,
pub window_size: usize,
pub n: usize,
pub counts: HashMap<String, usize>,
}
Expand description
Ngrams are dictionaries with
Fields
anchor: Option<String>
anchor word that must be present in all ngram contexts (in their window)
window_size: usize
if an anchor word is given, word window size, applied to the left and to the right of the anchor word
n: usize
n-grams for a sequence of n words
counts: HashMap<String, usize>
statistics hashmap for the occurence counts
Implementations
sourceimpl Ngrams
impl Ngrams
sourcepub fn sorted(&self) -> Vec<(&String, usize)>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A>where
A: Allocator,
pub fn sorted(&self) -> Vec<(&String, usize)>ⓘNotable traits for Vec<u8, A>impl<A> Write for Vec<u8, A>where
A: Allocator,
A: Allocator,
obtain the ngram report, sorted by descending frequency
sourcepub fn distinct_count(&self) -> usize
pub fn distinct_count(&self) -> usize
get the number of distinct ngrams recorded
sourcepub fn add_content(&mut self, content: &str)
pub fn add_content(&mut self, content: &str)
add content for ngram analysis, typically a paragraph or a line of text
sourcepub fn add_anchored_content(&mut self, content: &str)
pub fn add_anchored_content(&mut self, content: &str)
In essence, for a given window size W, a word at index i is justified to participate in the ngrams if there is an instance of an anchor word in the range of words [i-W, i+W]. this can be highly irregular e.g. “word word anchor word anchor word word”, so we record flexibly looking for no-justification cutoffs, where a continuous word sequence is recorded for ngram counts
sourcepub fn record_words(&mut self, words: Vec<&str>)
pub fn record_words(&mut self, words: Vec<&str>)
Take an arbitrarily long vector of words, and record all (overlapping) ngrams obtainable from it