ot

Born on September 28, 2010•18576 Karma

About Submitted Comments Favorites

ot•

3 days ago

•on: Emacs 31 is around the corner: The changes I'm dai...

> GNU deadline

I think you mean readline?

mesrik•

3 days ago

Sure. Browser autocorrect there just tried to be helpful :/

mindcrime•

3 days ago

That said, I'm kinda hoping somebody does create a "GNU deadline" project now. I'm curious to see what kind of project it would be.

sph•

3 days ago

A weird, inscrutable project management tool for the shell written in Perl 4 and Guile Scheme, that the ten people in the world who learned to operate it swear it is the greatest piece of productivity software ever invented.

bch•

3 days ago

Notable users: GNU HURD Project (Shipping any day now).

layer8•

3 days ago

I thought you were using emacs?

mesrik•

3 days ago

Well, I've got to admit I've haven't read HN using emacs. Is there such a .el thing avalabnle somewhere? It would be great to read HN as it was with usenet news. Not joking, that would be excellent tools I'd like to have !

layer8•

3 days ago

There’s https://github.com/thanhvg/emacs-hnreader, but it doesn’t appear to support commenting.

polyamid23•

3 days ago

Why not use eww?

ot•

20 days ago

•on: Only 17% of all 64-bit Integers are products of tw...

Yeah the number sounds a lot less impressive if you say that you only get 2^61.44 integers out of 2^64. In other words, a 4% entropy loss.

Information quantities are more meaningfully expressed in number of bits.

jetsamflotsam•

20 days ago

Exactly. Being precise about logarithmic vs linear utilizations is key here. I tried making a similar point about the inefficiency of IEEE-754 redundant NaN encodings here: https://arxiv.org/pdf/2508.05621

Having ~a quadrillion redundant bitstrings all mapping to NaN sounds pretty bad, but logarithmic/information utilization-wise, this is actually not too bad.

kens•

20 days ago

I've been looking at the 8087 NaN circuitry lately. Having 2^53 (or whatever) values for NaN was supposedly a feature: "the large number of NAN values that are available, provide the sophisticated programmer with a tool that can be applied to a variety of special situations." For example, the different NaN values could hold debugging information to track down errors.

See "8087 Numeric Data Processor" page S-74: https://ethw.org/w/images/2/2f/Intel_8086_family_users_numer...

TomatoCo•

20 days ago

You'll see some scripting languages (ab)use this. Where the native "number" type is a 64 bit float and only one NaN bit pattern is a real NaN. The others smuggle a pointer to an object in the lower bits. This way you don't spend any memory overhead indicating if a given variable contains a primitive or an object.

adgjlsfhk1•

20 days ago

> For example, the different NaN values could hold debugging information to track down errors.

Unfortunately IEEE didn't bother specifying NaN propagation semantics so it ended up pretty useless.

ot•

3 months ago

•on: Meta’s renewed commitment to jemalloc

That's a false dichotomy: you optimize both the application and the allocator.

A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.

sumtechguy•

3 months ago

exactly. I can think of at least 5 different projects I have been on where a better allocator would made a world of difference. I can also think of another 5 where it probably would have been a waste of time to even fiddle with.

but as usual there is an xkcd for that. https://xkcd.com/1205/

One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.

ot•

3 months ago

•on: Meta’s renewed commitment to jemalloc

It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.

If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.

Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".

However, the drawback is that system-level memory accounting becomes even more fuzzy.

(hi Alex!)

menaerus•

3 months ago

I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?

Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.

Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.

jcalvinowens•

3 months ago

My memory is that Ivy Bridge was when it started being different.

ahoka•

3 months ago

AVX maybe?

ot•

3 months ago

•on: The “JVG algorithm” only wins on tiny numbers

RSA was also not given that name by its authors, the name came later, which is usually the case.

In the original paper they do not give it any name: https://people.csail.mit.edu/rivest/Rsapaper.pdf

ot•

3 months ago

•on: RE#: how we built the fastest regex engine in F#

> are there eviction techniques to guard against this?

RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.

I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.

Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.

ieviev•

3 months ago

I have experienced this as well, the performance degradation of DFA to NFA is enormous and while not as bad as exponential backtracking, it's close to ReDoS territory.

The rust version of the engine (https://github.com/ieviev/resharp) just returns an Error instead of falling back to NFA, I think that should be a reasonable approach, but the library is still new so i'm still waiting to see how it turns out and whether i had any oversights on this.

ot•

3 months ago

Here RE2 does not fall back to the NFA, it just resets the Lazy DFA cache and starts growing it again. The latency spikes I was mentioning are due to the cost of destroying the cache (involving deallocations, pointer chasing, ...)

ieviev•

3 months ago

Ah, sorry then i misunderstood the comment

I'm not sure if it's with both RE2 or Rust, but some internal engines of Rust appear to allocate a fixed buffer that it constantly re-creates states into.

I'm not really familiar with the eviction technique of RE2 but I've done a lot of benchmark comparisons. A good way to really stress test RE2 is large Unicode classes, \w and \d in RE2 are ascii-only, i've noticed Unicode (\p{class}) classes very drastically change the throughput of the engine.

ot•

3 months ago

•on: Read Locks Are Not Your Friends

This is drawing broad conclusions from a specific RW mutex implementation. Other implementations adopt techniques to make the readers scale linearly in the read-mostly case by using per-core state (the drawback is that write locks need to scan it).

One example is folly::SharedMutex, which is very battle-tested: https://uvdn7.github.io/shared-mutex/

There are more sophisticated techniques such as RCU or hazard pointers that make synchronization overhead almost negligible for readers, but they generally require to design the algorithms around them and are not drop-in replacements for a simple mutex, so a good RW mutex implementation is a reasonable default.

PaulHoule•

3 months ago

I think it’s not unusual that reader-writer locks, even if well implemented, get in places where there are so many readers stacked up that writers never get to get a turn or 1 writer winds up holding up N readers which is not so scalable as you increase N.

Jyaif•

3 months ago

And a Rust equivalent of folly::SharedMutex: https://docs.rs/crossbeam-utils/latest/crossbeam_utils/sync/...

amluto•

3 months ago

Wow, folly::SharedMutex is quite an example of design tradeoffs. I wonder what application the authors wanted it for where using a global array was better than a per-mutex array.

mike_hearn•

3 months ago

Right, and if you're on the JVM you have access to things like ConcurrentHashMap which is lock free.

ot•

4 months ago

•on: Every book recommended on the Odd Lots Discord

Glad that Moby Dick is in there.

ot•

5 months ago

•on: A 40-line fix eliminated a 400x performance gap

You can do even faster, about 8ns (almost an additional 10x improvement) by using software perf events: PERF_COUNT_SW_TASK_CLOCK is thread CPU time, it can be read through a shared page (so no syscall, see perf_event_mmap_page), and then you add the delta since the last context switch with a single rdtsc call within a seqlock.

This is not well documented unfortunately, and I'm not aware of open-source implementations of this.

EDIT: Or maybe not, I'm not sure if PERF_COUNT_SW_TASK_CLOCK allows to select only user time. The kernel can definitely do it, but I don't know if the wiring is there. However this definitely works for overall thread CPU time.

jerrinot•

5 months ago

That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!

ot•

5 months ago

Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)

catlifeonmars•

5 months ago

I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway

jerrinot•

5 months ago

In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.

nly•

5 months ago

Why do you need a seqlock? To make sure you're not context switched out between the read of the page value and the rdtsc?

Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Tbh I thought clock_gettime was a vdso based virtual syscall anyway

ot•

5 months ago

> Presumably you mean you just double check the page value after the rdtsc to make sure it hasn't changed and retry if it has?

Yes, that's exactly what a seqlock (reader) is.

mgaunard•

5 months ago

clock_gettime is not doing a syscall, it's using vdso.

jerrinot•

5 months ago

clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.

ot•

5 months ago

•on: A 40-line fix eliminated a 400x performance gap

If you look below the vDSO frame, there is still a syscall. I think that the vDSO implementation is missing a fast path for this particular clock id (it could be implemented though).

jerrinot•

5 months ago

Exactly this.