Pannoniae

Born on June 19, 2023•2565 Karma

13 days ago

•on: Orthodox C++ (2016)

They're just clearly inferior in pretty much any situation.

The map stuff the other posters summed up well but even std::vector is dogshit with pretty much all implementations having inlined grow code in push_back, a not too great API and missed optimisations e.g. no trivial relocation when growing the vector / moving it and no useful APIs such as "grow but don't initialise"...

jstimpfle•

13 days ago

To be fair grow-but-don't-initialize is a pretty fundamental part of the API, the reserve() method.

But already the basic premise that you should push back without thinking is wrong. You will suffer reallocations and invalisations when you least expected them, and frankly you have to architect around that fact which is a terrible restriction. You can work around by pre reserving but at that point it's just a basic fixed heap allocated array but worse because the type gives you a weird look all the time, "I'll realloc as soon as you don't pay attention, harhar"!

tialaramex•

13 days ago

std::vector lacks what I call the "bifurcated reservation API". It has Rust's Vec::reserve_exact but not Vec::reserve -- these APIs serve subtly different purposes, the former (which C++ calls just "reserve") says "Here's a hint for exactly how big this container will ever grow" while the latter says "Here's a hint for how big this container will get in my immediate future, but it might grow further later".

The implementation always tries to grow (if necessary) to the exact size chosen for Vec::reserve_exact, but for plain Vec::reserve if growth is needed it always grows exponentially, not to the exact size, preserving the O(1) push cost.

For a typical "doubling" growable array type, if we're pushing groups of ten items, reserve_exact or C++ grows like 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 ... which is much worse than O(1) whereas the correct reserve grows 10, 20, 40, 80, 160 preserving O(1)

For trivial types you can work around this in C++ with a little work, and for the non-trivial types you can work around it with a bunch more extra code, but you probably won't.

Bjarne among other people teaching C++ recommend just not using the reservation API as a reservation API because of this problem†, and the resulting teaching definitely leaks into CS graduates and even into languages which have the correct API and so you have to un-teach the bad lesson.

In applications where you actually can't afford to pay for growth (or at least in some cases can't afford this) I also like Vec::push_within_capacity which I believe comes from Rust for Linux where the kernel legitimately needs this "If there's room push, otherwise I have a plan B" approach.

† To Bjarne this API is instead conceived of as a way to preserve reference validity. Since it won't grow, our references will still work.

jstimpfle•

13 days ago

Vec::reserve() is the behaviour you get from C++ std::vector push_back() (implicit just-in-time reserve), so right now I can't see a situation where I'd want the explicit Rust version even if I didn't think the whole push realloc thing is mostly a bad idea.

Yes, Rust version allows you to maybe skip a reallocation step or two by doing explicit up front reallocation. But remember most allocation work is always from the last grow anyway. The Rust version seems like a microoptimization, giving a little bit more explicit control in a situation where you've already pretty much given up control and gone like, throw hands in the air, we're doing push_back()!

tialaramex•

12 days ago

As with the likely/ unlikely branch hint the problem is that programmers are wrong much more often than they expect on both sides. They're too often wrong to think they know the final size - hence Bjarne's caution - but they're also too often wrong that they've got no idea how much capacity they need at all. So hence this API.

You're correct that this isn't a huge optimization. But it more than pulls its weight directly because it's a small boon when you're right and it doesn't have the terrible penalty that Vec::reserve_exact has when inevitably the programmer is sometimes wrong. It's very much about saving pennies, but the growable array type is so widely used that counting pennies makes sense.

I have a lot more thoughts about reservation, but these suffice for specifically the growable array type.

jstimpfle•

12 days ago

There are many many situations where I know exactly what is the common path and what is the uncommon path. And when I don't know, it's much harder to optimize!

If you're counting pennies, Vec::reserve() (inexact) is hardly what you want, because in the worst case you're wasting a factor of 1.5x or 2x of elements due up-front overallocation. Maybe chunk lists could be better, overhead is bounded by chunk size and all operations are constant-time. No pointer invalidation either. And you can pool those chunks, preventing memory fragmentation and improving memory utilization, since there aren't a million different sized allocations in your process.

tialaramex•

11 days ago

> If you're counting pennies, Vec::reserve() (inexact) is hardly what you want, because in the worst case you're wasting a factor of 1.5x or 2x of elements due up-front overallocation.

I think this is "Trump math" but assuming we're actually looking at the same over-allocation this isn't new, what you're calling "wasting a factor of 2x" was already what you were paying by using the amortized O(1) growable array type, that's its central trade off, so yeah, the type with that property does still have that property.

> Maybe chunk lists could be better, overhead is bounded by chunk size and all operations are constant-time.

Basically std::deque. Surely no more need be said ?

jstimpfle•

11 days ago

> was already what you were paying by using the amortized O(1) growable array type

As I said, I don't think growable array data structure is a good idea.

> Basically std::deque. Surely no more need be said ?

Ok now you're really out of your depth here. Pretty sure you haven't actually used it, it's a somewhat obscure data structure even though the more serious C++ programmers have heard of it, most have probably never used it.

From a web search, std::deque seems to be pretty universally thought of as a weird and very bad data structure. I've had to learn myself because I did actually try to use it myself recently. Beyond being unintuitive to use due to being "abstract" (had 2-3 serious bugs, e.g. unexpected iterator invalidations that happened only under load), apparently std::deque is not even specified as a chunk list. O(1) random access must be guaranteed. So it must be much more complicated than that, I don't know the details though.

And while actual implementations do use chunks in some way (just not in a simple chunk list), apparently "chunks" is not a part of the std::deque spec, and hence there isn't anything standardized about the size of these chunks either.

- On MSVC, chunk size is 16 elements so chunk size of std::deque char is 16 BYTES !!!!! I was thinking they can't be serious, but apprently it is the case

- On GCC, chunk size is 512 byte fixed.

- On Clang, chunk size is 4096 byte fixed.

It _is_ a shithole, surely no more need be said?

tialaramex•

11 days ago

> As I said, I don't think growable array data structure is a good idea.

It is still quite hard for me to take this seriously as a belief.

> Ok now you're really out of your depth here.

I don't think so. As you found, the consensus is that this sort of thing is a terrible idea. The O(1) complexity looks good superficially but your actual performance is miserable.

> I don't know the details though.

Raymond Chen has a pretty good explanation of the three implementations of std::deque, like he did for their std::string but Raymond's goal is to help you do forensics, he's not here to tell us which data structures are a good idea. https://devblogs.microsoft.com/oldnewthing/20230810-00/?p=10... might work, or, given Microsoft, it might already be dead by the time you click it.

jstimpfle•

11 days ago

> The O(1) complexity looks good superficially

Without knowing more about the details of the spec and of real world implementations [0], I'll boldly claim that the O(1) is the precise problem why this is a weird data structure (I never needed or expected that sort of constraint). It is not a chunk list, like many (including me) seem to have assumed.

I don't even have performance problems measured with this, also given that my std::deque is not currently in production, it is used as a simple streaming queue that only has a couple megabytes/second of bandwidth requirements, and it is not being built with MSVC (so chunks are going to be 512 or 4096 bytes). But it is a latent problem.

[0] Btw. I don't want to know the messy details of this right now, I need simple not complex. I've learned not to add more complexity in a futile attempt to fix things that are fundamentally broken. What use case is this going to be solved by std::deque, name me one such use case and tell me with a straight face that it's not a completely made up case or a super niche case that would likely also be better served by a different approach.

Or take it from someone else if you would, is Chandler Carruth high-profile enough and free enough of conflict of interest to be credible for saying that std::deque is dumb? https://news.ycombinator.com/item?id=22962980

tialaramex•

11 days ago

> It is not a chunk list, like many (including me) seem to have assumed.

The thing C++ programmers tend to expect (though perhaps not you) is Rust's std::collections::VecDeque - the growable array type again but used as backing for a ring buffer.

This type has amortized O(1) push and pop at both ends, I assume since you didn't like the amortized growth of Vec / std::vector you'll feel the same way but that's what most programmers actually want from such a type - in use as a queue it won't repeatedly cycle allocations so long as the size is constrained, because it's a ring buffer. If you actually know the needed size up front the overhead from this structure, which can grow, versus a structure which can't grow, is a single integer for the capacity.

But as you saw, std::deque is... not that. STL wasn't able to explain what it's for when I asked him, so, if anybody knows it apparently doesn't include the people maintaining the C++ standard library implementations.

> Or take it from someone else if you would

I think the trouble here is that you've misconstrued this entire part of the conversation. I was gesturing at std::deque because it's terrible, and your response, that it's terrible, isn't a rebuttal as I think you're expecting. We agree on that, std::deque is terrible, our disagreement seems to be that you think a slightly different terrible data structure would be a good idea somehow and I do not.

jstimpfle•

11 days ago

I don't know the size up front because it goes up and down depending on load. I want to be able to buffer jitters/spikes but not hog memory when there is no data in the queue. I was assuming it's natural to want a simple linked list of chunks to represent this, maybe chunks can be partially full here. So chunk -> chunk -> chunk, and each chunk is actually a node with a buffer pointer + size + fill fields, or maybe the data comes directly after the size + fill fields so no buffer pointer needed.

That easily gives O(1) access to enqueue/dequeue, and the queue naturally grows and shrinks (even within configurable bounds) as part of reading and writing. Blocks can be taken and returned from/to some fixed size block pool. This design would be completely satisfactory. It's also very easy to code from scratch, so why should I settle for anything less, for example a growable vector where a reallocation would mean blocking other threads for extended time? But I was told to "not reinvent the wheel" and to "not overengineer it" so I tried to change my mind and make use of an STL container data structure.

> our disagreement seems to be that you think a slightly different terrible data structure would be a good idea

I don't know why you think that I would be thinking that, and I'm confused that you say you were gesturing at how std::deque is terrible, and not sure why I should be the one misconstruing something, when my premise from the beginning was "STL containers are terrible in many ways". And I don't get at all why you responded with "basically std::deque" to my "maybe chunk lists would better". It doesn't make the least bit of sense to me.

But let's settle this here, this is a pointless fight...

tialaramex•

10 days ago

> I was assuming it's natural to want a simple linked list of chunks

In the 1970s or 1980s that does feel entirely natural, but in the 1990s the 68040 and i486 both introduce L1 data cache and so now all list chasing hurts very badly, your structure is a list chase any time we index into the collection.

I think I can see a way to have what you're describing hit amortized O(1) push/ pop with the spikes which are amortized being more frequent (linear with capacity) but fixed size and smaller (allocate or free one block then do some housekeeping), it costs more RAM because of the block overhead and it is no longer contiguous, but I see that for your intended application you probably don't care about either problem.

Now that I think I understand your data structure better it's much less similar to std::deque than I had originally thought, it does seem very niche to me, but more power to you if you write such a type.

jstimpfle•

10 days ago

Yes, I am talking about a buffer for an outbound queue that should buffer on the order of MB/sec. That's decidedly not a niche thing. It's very basic systems engineering (and also how socket buffers are represented in a kernel for example).

You would chunk this at a reasonable size, at least multiple kilobytes per chunk. That is not slow, it is basically the only way to do it. Yes, that is amortized, bounded overhead. Sure it costs a bit of extra RAM, like one extra pointer and maybe 1-2 integers per chunk? For chunks of 4 KB, this overhead is predictably less than 1 % plus less than 1 chunk worth of fragmentation in the last chunk. That is NOT unpredictable spikes of > 100 percent during reallocation or 100000 percent due to low utilisation...

I believe you can even instruct the CPU to preload the next buffer while still streaming the last, but personally have never had the need to dive _that_ deep. It would probably be very hard to measure any performance benefit from doing so. I like to write very basic straightforward C code, just solve the data structure problem first, not hand-wave it.

jstimpfle•

13 days ago

Feels like typical improvement/perfection tunnel vision syndrome to me, though. That syndrome is engrained in C++ community and I think also Rust community, although it looks like the latter took the chance to do many things better from the start learning from C++'s mistakes.

Realloc is pretty much never the right way in whatever form, and I've never seen any need to include realloc in any of my own allocators (mostly blockallocators and linear allocators and pools/free lists, sometimes using malloc/free).

Pannoniae•

37 days ago

•on: Everything in C is undefined behavior

Exactly, you write for your target, not some imaginary spec. The spec is only as useful as to predict what your target roughly does, it's not normative.

Compilers might have bugs where the spec is supposed to work but it doesn't, and many extensions without standard equivalents, or implementation-specific behaviour where undefined things in the standard do get assigned a meaningful outcome.

Pannoniae•

40 days ago

•on: C++26 Shipped a SIMD Library Nobody Asked For

sadly inline assembly is still at the ergonomics of "one compiler doesn't support it in x64 mode" and "you can choose between the readable syntax (which is a black box to the compiler) and the unreadable syntax (which can specify I/O/clobber regs)"

msla•

40 days ago

OK, is there a horrible speed penalty for writing your SIMD in pure assembly functions and then calling those functions? If you're writing assembly anyway, just drop the "inline" part.

Pannoniae•

39 days ago

Sure, if you're willing to write a large enough chunk that you can eat the cost of not inlining it. If you just write a small leaf function or two, it will probably be a wash or perform worse.

Pannoniae•

45 days ago

•on: Screenshots of Old Desktop OSes

Hi! I'm slightly under that so I feel like I'm qualified :)

I wholeheartedly agree, they're quite a regression.... although I don't think this is a popular opinion around here.

When people say "something used to be better" they usually don't mean literally, they mean that for the circumstances, it was better. Of course, more modern systems support more hardware, more features, etc., but if you made those same modern technical improvements on top of an older designs, you'd get much better results.

To me it looks like software design has been massively overtaken by "form over function", everyone just wants a unique "brand" but the actual UX is complete dogshit. Borderless buttons, zero indication what's clickable, no visual delimiters for different areas of programs, no good shortcut / altkey menu support, etc....

This has somehow infected even Linux to such a crazy extent...

Pannoniae•

47 days ago

•on: Show HN: Building a web server in assembly to give...

Look at the bright side, it's much more feasible now :)

Yes, it's not deterministic, and if you were using it commercially, the ROI would be terrible, and it's certainly not reliable but for a hobby project.... why not?

Encouraging people to understand the layer of abstractions they're building on is helpful, doesn't matter if they do it by hand or with clankers.

LLMs lower the barrier for execution - they make you faster. The unstated question is: faster at what - they can make you faster at something clever, or faster at the entirely wrong thing...

Your point is correct if we're looking at it through a scarcity lens - the effort to make it certainly decreased a lot - but that doesn't mean that anything is now worthless. We can just move onto doing bigger, better things now, until we hit the next limits...

moron4hire•

47 days ago

> Encouraging people to understand the layer of abstractions they're building on is helpful, doesn't matter if they do it by hand or with clankers.

But we know that long term use of LLMs does not lead to better understanding, it leads to reliance on the LLM for the person to be able to function at their job.

Pannoniae•

47 days ago

You also depend on {computers, the internet, electricity, groceries, etc.}, that's not the discriminating factor.... the question is, are you actually better at your job using it, and are you stunting your own growth?

I'm not claiming to have a definite answer to either, but I think the right question to ask is - are you going to benefit from using it in the long run? If yes, carry on, if no, re-evaluate what you're doing :)

moron4hire•

47 days ago

Early in my career, I had a consulting job where the client wanted me to work on their system using their tools. They had nothing and would not provide anything. No compilers, no runtimes, nothing but a regular, run of the mill, desktop office workstation. Instead of performing the task in Visual Studio with .NET like we had planned, I did it in Excel with VBA macros.

If you come to rely on specific tools such that you can't do your job without them, you're no longer doing the job you think you're doing, you're now just a tool operator. If I don't have a computer of some kind, I'm not writing software, I'm just a manager. If my team can't work because GitHub is down, then we've done a bad job of being software developers.

The experience taught me that there are two kinds of tools: those that are necessary and those that are nice to have. Yes, that job sucked, and yes, I quit that job, but more because they refused to pay me the overtime I needed to get it done with such crappy tools.

I had another job where we were one client of many on site with a vendor for a training event and the vendor couldn't get a system configured correctly for our cohort to continue on. Part of the problem was their configuration system was garbage, very easy to get wrong, very time consuming to manually edit. While they were dicking around with throwing edits against the wall, I wrote an HTML page with JavaScript to make a UI to edit the data in an much more natural way and then generate the config file. It took 10 minutes and saved us hours of waiting so that we could continue on with the training. Perhaps the takeaway from that experience was that Valve's engineers write shitty config systems. All I know is we got the training event done and I walked away with a piece of paper that said I was now a Certified Hardware Vendor, which we then used to sell more contracts on our own.

Pannoniae•

47 days ago

•on: Show HN: Building a web server in assembly to give...

That's just not true... the flags to get preprocessed output and assembly are quite useful and used a fair bit, in fact. Multiple reasons - sanitychecking your code, finding bugs, or even finding compiler errors.

Pannoniae•

47 days ago

•on: Show HN: Building a web server in assembly to give...

You'd be surprised, again.... most compilers don't generate very good code, mostly because

1. the time for optimisation is limited

2. the constraints are overlapping and just completely intractable beyond a single function (do you want to inline this, saving on the call and increasing binary size, or not do it because it's cold?)

3. they don't have domain-specific knowledge about your code, and even with PGO, they might incorrectly decide what's hot and what's not - typical example are program settings. You didn't enable a setting during PGO instrumentation, compiler sees you didn't call that path, shoves it out of line. Now your PGO-optimised code is worse than -O2. And compilers have different levels of adherence to manual branch hinting - on MSVC you get a reorder at best, Clang and GCC try much harder at [[likely]] and [[unlikely]].

4. There's still quite a bit of low-hanging fruit left, mostly because progress is jagged ;) For example our calling conventions generally suck - this is actually why inlining is so helpful - and the inertia makes everyone emit the default calling convention and that's it.

For example, did you know that compilers have very inconsistent support for struct unpacking? It can be much faster to write

  int32 meow(int64 a, int64 b);

than

  struct mytype {
    int64 x;
    int64 y;
  };

  int32 meow(mytype a);

because the first one goes through registers on the MSVC ABI, the second one gets lowered to the caller passing a pointer to the stack. Before someone says "oh this just means MS sucks" - fair, but for std::unique_ptr the situation is the other way around... on the MSVC ABI the callee cleans it up so it's truly zero-cost, but on the Itanium ABI using it is worse than using T* as a raw pointer... see the GCC codegen :)

These examples might seem a bit cherrypicked but this is only scratching the surface, not to talk about the codegen in higher-level languages, which is even more dreadful. Manually optimising your code can usually get a magnitude worth of free performance, which is just tragic.

I wouldn't even rule out LLM codegen in the future - although they're quite unreliable today so you'd get miscompiles like crazy - but there's just so much low-hanging fruit left on the table that it wouldn't be too out of step...

pocksuppet•

47 days ago

Expanding the struct to two arguments does not take longer than rewriting your whole project in assembly.

Pannoniae•

46 days ago

I've never said that, but using assembly in certain places can certainly be justified, especially for the performance-intensive parts.

Pannoniae•

50 days ago

•on: Nobody Reviews Compiler Output

Just like LLMs, compilers are just another layer of abstraction and no they're not deterministic.

Just yesterday I've reported a codegen bug in MSVC. (Luckily they've fixed it very fast.) Can you realise that it's an optimiser bug without inspecting the assembly? Hardly.

All the arguments people claim against LLMs are similarly applicable to compilers, but compilers are old technology and LLMs are new.

If you're an expert, just about every compiled function contains obvious inefficiencies, and a skilled assembly programmer can speed it up by in the ballpark of 3x. If we're talking about your average webapp, you can usually get 1000x better resource usage in most ways, including CPU, RAM, storage and so on.

And the output isn't deterministic either - the bugs no withstanding, code generation is highly chaotic, optimisations have non-local impacts and you can't easily predict optimised codegen output from source.

LLMs aren't much worse. They have non-deterministic output, but you can steer it - similarly to a compiler. An expert can use it to gain great speed and efficiency, but in the hands of someone not as capable, you can make something awful just as fast. Both tools are force multipliers.

Pannoniae•

51 days ago

•on: How do I inform Windows that I'm writing a binary ...

Calling win32 from other languages is supported, calling it from assembly is supported (as long as you use the calling convention properly, obviously), using ntdll to bypass the win32 API is not supported.

Basically on Linux the syscalls are the equivalent of Win32 except much narrower in scope.

account42•

50 days ago

> using ntdll to bypass the win32 API is not supported

But it is sometimes required to do things properly.

Pannoniae•

51 days ago

•on: The bottleneck was never the code

Yeah, you're definitely right about the shifting goalposts ("it's a stochastic parrot" -> "it hallucinates all the time, it can't even get APIs right" -> "it can generate functions but can't reason about the codebase" -> "the bottleneck was never shipping code")

At the same time, humans can move up the abstraction ladder faster than the LLMs can. At least, some humans. Agents can produce lots of code. They can also do the entirely wrong thing. The impact of wrong decisions have been massively write-amplified with more and more intelligent LLMs. With earlier ones, it got a sentence or a function wrong, you reprompted, the cost of a mistake was 10 seconds. Now, you can burn hours or even days of work on the entirely wrong thing without a competent human operator stepping in and course-correcting.

The trajectory of agents have been bigger and bigger context windows, bigger autonomy, but at the same time, a bigger blast radius. In this context, I don't think the human experts will be out of their jobs any time soon.

skiing_crawling•

51 days ago

> At the same time, humans can move up the abstraction ladder faster than the LLMs can

This was kind of the point, its only true for now. I agree with you that this kind of stuff will take longer. I don't think there's probably good training data for it right now. Handling abstractions and course correcting is probably the job now, and it also happens to be exactly the data that we will be typing in our prompts. They'll train on it or something like it.

Pannoniae•

50 days ago

Unless something radical changes (and that isn't unprecedented! I'm just writing this as of today), the trend is still "just" a bigger hammer. It's bigger, you can get way more done, but the blast radius is also larger.

Take the strawman: even if AI can one-shot basically any application below let's say, 1MLoC, if your prompt is 4 lines, it will generate something. It can't read your mind. If you make proper specs, then you'll get what you want - but many people don't know what they want. And even if they do, they might have contradictions in their requirements, might be asking for something impossible, etc.