pub fn extract(
corpus_path: String,
token_model_filepath: String,
discard_math: bool
) -> Result<HashMap<String, u64>, Box<dyn Error>>
Expand description
Parallel traversal of latexml-style HTML5 document corpora, based on jwalk and
DNMParameter::llamapun_normalization
with additional subformula lexemes via
dnm::node::lexematize_math