I am in the need of scaling a TB-scale hashtable in a distributed environment; the typical use case is looking things up hash-wise, the hash is not crypto-strong and there's a fair amount of data redundancy, so collisions will be the norm. Data will be write-once read-many, and queries will be random.
Currently I'm using a RDBMS on a single machine, on RAID, and the I/O is clearly the bottleneck.
I keep reading that Cassandra is write-optimized more than reads, and in http://wiki.apache.org/cassandra/CassandraHardware it says "Obviously there is no benefit to having more RAM than your hot data set", suggesting the obvious benefit of caching.
My question is, what would be the right "bigdata" choice for applications where caching is not really possible?
Is cassandra read perf limited by reads rate or by the amount of RAM (ie. the ability of keeping the keys in memory)?