Ask HN: Library for extracting the contents of a webpage?

1 point

3 years ago

This is a problem that I hit on regularly when working on toy projects. I want to crawl some small site, download the pages, and extract the textual content from each page (ie. cut out headers, footers, ads, etc). Something that works 95% of the time would be good enough for my toy projects.

I feel like there should be an opensource library that I just `pip install` and it solves the problem, esp. in 2023 in the age of AI.

My guess is that a simple regex/rule-based solution (with community maintained rules) would be good enough (like how ad blockers work) to hit 95%, and a deep NN could be nearly perfect.

Also, obviously there are lots of services, some of them free like archive.is, that must have such a module in the backend.

But as far as I can tell there is no such library. What am I missing?

No comments

No comments