This is a problem that I hit on regularly when working on toy projects. I want to crawl some small site, download the pages, and extract the textual content from each page (ie. cut out headers, footers, ads, etc). Something that works 95% of the time would be good enough for my toy projects.
I feel like there should be an opensource library that I just `pip install` and it solves the problem, esp. in 2023 in the age of AI.
My guess is that a simple regex/rule-based solution (with community maintained rules) would be good enough (like how ad blockers work) to hit 95%, and a deep NN could be nearly perfect.
Also, obviously there are lots of services, some of them free like archive.is, that must have such a module in the backend.
But as far as I can tell there is no such library. What am I missing?