pub fn extract(
    corpus_path: String,
    token_model_filepath: String,
    discard_math: bool
) -> Result<HashMap<String, u64>, Box<dyn Error>>
Expand description

Parallel traversal of latexml-style HTML5 document corpora, based on jwalk and DNMParameter::llamapun_normalization with additional subformula lexemes via dnm::node::lexematize_math