I've come across the problem a few times to need to remove duplicate values from my data. Usually, the data are higher level objects like images or text blobs. I end up writing custom deduplication pipelines every time.
I got sick of doing this over and over, so I wrote a wrapper around RocksDB that deduplicates values after a Put() operation. Currently only exact deduplication is performed, but I want to extend it in a number of ways, including semantic (fuzzy) deduplication for things like images and text.
Any feedback on the project would be appreciated: