I think you mean readline?
Information quantities are more meaningfully expressed in number of bits.
Having ~a quadrillion redundant bitstrings all mapping to NaN sounds pretty bad, but logarithmic/information utilization-wise, this is actually not too bad.
See "8087 Numeric Data Processor" page S-74: https://ethw.org/w/images/2/2f/Intel_8086_family_users_numer...
Unfortunately IEEE didn't bother specifying NaN propagation semantics so it ended up pretty useless.
A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.
but as usual there is an xkcd for that. https://xkcd.com/1205/
One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.
If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.
Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".
However, the drawback is that system-level memory accounting becomes even more fuzzy.
(hi Alex!)
Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.
Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.
In the original paper they do not give it any name: https://people.csail.mit.edu/rivest/Rsapaper.pdf
RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.
I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.
Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.
The rust version of the engine (https://github.com/ieviev/resharp) just returns an Error instead of falling back to NFA, I think that should be a reasonable approach, but the library is still new so i'm still waiting to see how it turns out and whether i had any oversights on this.
I'm not sure if it's with both RE2 or Rust, but some internal engines of Rust appear to allocate a fixed buffer that it constantly re-creates states into.
I'm not really familiar with the eviction technique of RE2 but I've done a lot of benchmark comparisons. A good way to really stress test RE2 is large Unicode classes, \w and \d in RE2 are ascii-only, i've noticed Unicode (\p{class}) classes very drastically change the throughput of the engine.
One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/
There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.
This is not well documented unfortunately, and I'm not aware of open-source implementations of this.
EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.
Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?
Tbh I thought clock_gettime was a vdso based virtual syscall anyway
Yes, that's exactly what a seqlock (reader) is.