Resiliparse CLI
The resiliparse
command line utility provides tools for maintaining and benchmarking Resiliparse. At the moment, these tools are aimed primarily at developers of Resiliparse. General-purpose tools geared towards users of the library may be added later.
To install the Resiliparse CLI tool, specify the cli
flag (and optionally the cli-benchmark
flag for any third-party benchmarking dependencies) in your pip install
command:
$ pip install 'resiliparse[cli]'
Once installed, run resiliparse [COMMAND] --help
for detailed help listings.
Top-Level Commands
In the following is a short listing of the top-level commands:
resiliparse
Resiliparse Command Line Interface.
resiliparse [OPTIONS] COMMAND [ARGS]...
Commands
- encoding
Encoding module tools.
- html
HTML module tools.
- lang
Language module tools.
Full Command Listing
Below is a full description of all available commands:
resiliparse
Resiliparse Command Line Interface.
resiliparse [OPTIONS] COMMAND [ARGS]...
encoding
Encoding module tools.
resiliparse encoding [OPTIONS] COMMAND [ARGS]...
download-whatwg-mapping
Download WHATWG encoding mapping.
Download the current WHATWG encoding mapping, parse and transform it, and then print it as a copyable Python dict.
resiliparse encoding download-whatwg-mapping [OPTIONS]
html
HTML module tools.
resiliparse html [OPTIONS] COMMAND [ARGS]...
benchmark
Benchmark Resiliparse HTML parser.
Benchmark Resiliparse HTML parsing by extracting the titles from all HTML pages in a WARC file.
You can compare the performance to Selectolax (both the old MyHTML and the new Lexbor engine) and
BeautifulSoup4 by installing the PyPi packages selectolax
and beautifulsoup4
. Install
Resiliparse with the cli-benchmark
flag to install all optional third-party dependencies automatically.
See Resiliparse HTML Parser Benchmarks for more details and example benchmarking results.
resiliparse html benchmark [OPTIONS] WARC_FILE
Arguments
- WARC_FILE
Required argument
lang
Language module tools.
resiliparse lang [OPTIONS] COMMAND [ARGS]...
benchmark
Benchmark Resiliparse against FastText and Langid.
Either package must be installed for this comparison. Install Resiliparse with the
cli-benchmark
flag to install all optional third-party dependencies automatically.
resiliparse lang benchmark [OPTIONS] INFILE
Options
- -r, --rounds <rounds>
Number of rounds to benchmark
- Default:
10000
- -f, --fasttext-model <fasttext_model>
FastText model to benchmark
Arguments
- INFILE
Required argument
create-dataset
Create a language detection dataset.
Create a language detection dataset from a set of extracted Wikipedia article dumps.
Expected is a directory containing one subdirectory per language (with the language name,
e.g. “en” or “enwiki”) with any number of subdirectories and wiki_*
plaintext files. Use
Wikiextractor for creating the plaintext
directories for each language.
Empty lines and <doc>
tags will be stripped from the plaintext, otherwise the texts
are expected to be clean already.
The created dataset will consist of one directory for each language, each containing three files for train, validation, and test with one example per line. The order of the lines is randomized.
resiliparse lang create-dataset [OPTIONS] INDIR OUTDIR
Options
- --val-size <val_size>
Portion of the data to use for validation
- --test-size <test_size>
Portion of the data to use for testing
- --min-examples <min_examples>
Minimum number of examples per language
- Default:
10000
- -j, --jobs <jobs>
Parallel jobs
Arguments
- INDIR
Required argument
- OUTDIR
Required argument
download-wiki-dumps
Download Wikipedia dumps for language detection.
Download the first Wikipedia article multistream part for each of the specified languages.
The downloaded dumps can then be extracted with Wikiextractor.
resiliparse lang download-wiki-dumps [OPTIONS] DUMPDATE
Options
- -l, --langs <langs>
Comma-separated list of languages to download
- -o, --outdir <outdir>
Output directory
- -j, --jobs <jobs>
Parallel download jobs (3 is the Wikimedia rate limit)
Arguments
- DUMPDATE
Required argument
evaluate
Evaluate language prediction performance.
resiliparse lang evaluate [OPTIONS] INDIR
Options
- -s, --split <split>
Which input split to use
- Default:
val
- Options:
val | test
- -l, --langs <langs>
Restrict languages to this comma-separated list
- -c, --cutoff <cutoff>
Prediction cutoff
- Default:
1200
- -t, --truncate <truncate>
Truncate examples to this length
- -f, --fasttext-model <fasttext_model>
Use the specified FastText model for samples above cutoff
- --sort-lang
Sort by language instead of F1
- --print-cm
Print confusion matrix (may be very big)
Arguments
- INDIR
Required argument
train-vectors
Train and print vectors for fast language detection.
Expects the following directory structure:
resiliparse lang train-vectors [OPTIONS] INDIR
Options
- -s, --split <split>
Which input split to use
- Default:
train
- Options:
train | test | val
- -f, --out-format <out_format>
Output format (raw vectors or C code)
- Default:
raw
- Options:
raw | c
- -l, --vector-size <vector_size>
Output vector size
- Default:
256
Arguments
- INDIR
Required argument