Show HN: WP2TXT – Wikipedia dump text extractor with category/section filtering

3 points

4 months ago

WP2TXT is a command-line tool that extracts plain text from Wikipedia dump files. I originally built it in 2006 for corpus linguistics research and have maintained it since. The latest version (2.1) was largely rewritten with features for selective extraction:

- Auto-download dumps by language code (350+ languages) - Extract specific articles by title without downloading the full dump - Extract articles from a Wikipedia category with subcategory recursion - Extract specific sections by name with alias matching (e.g., "Plot" also matches "Synopsis") - Template expansion (dates, coordinates, unit conversions → readable text) - Content type markers ([MATH], [TABLE], etc.) instead of silent removal - Category metadata preserved in output - JSON/JSONL output - Parallel processing (English Wikipedia 24 GB dump: ~2 hours on Apple M4) - Written in Ruby.

No comments

No comments