Text-based RAG does badly on visually rich documents like pitch decks or company reports.
We've been struggling with countless hacks to support all kinds of graphs, tables, and other random elements in our RAG pipeline until we realized this is fundamentally the wrong way to approach the problem.
There were two key bottlenecks: visual elements make the text a mess, leading (1) to poor retrieval and (2) poor understanding by the LLM. Instead of supporting each corner case, we've developed a RAG pipeline that treats documents as both an image and a text, leading to a dramatic reduction in size (8B outperforms 70B) and a moderate improvement in quality compared to the current SOTA.