The pretraining stage is the first stage which consists of "next token prediction" on the entire internet, PB of tokens, etc. This is what most people think of when they think of training LLMs, however it produces a "base model" which is not really "intelligent", but rather much like a blurry JPEG of all human language and knowledge. You cannot really talk to such a model; it will simply complete your prompt by producing both sides of the conversation. Note however at some level the training has encoded enough structure through compression that it is able to simulate all sorts of phenomena, from human conversations to code. The great R&D difficulty here is to scale pretraining so that it can proceed smoothly in vast distributed datacenters in a fault-tolerant manner.
The next few stages are collectively called post-training, and typically consist of supervised fine-tuning, then reinforcement learning.
In supervised fine-tuning, the model is further trained to predict the next token, but on a much more focused data set of natural language conversations where the "assistant" and "user" turns are explicitly delineated with special tokens. The output of this stage is a model which is capable of carrying on proper conversations, but typically with no ability to creatively problem-solve, and less of a personality. The data and compute are many orders of magnitude smaller than in pretraining.
The reinforcement learning stage used to be a small part of model training, but ever since AI-assisted coding took off, it has become larger and larger chunk of training. In recent models, the compute spend on RL has allegedly come to rival or even exceed that of pretraining [1], which is a bit scary because RL is classically what lead to sci-fi like AIs which are extremely good at accomplishing goals to the detriment of everything else.
The way that RL works is that you put an instance of your model in some environment (such as a VM containing a git repository) and give it a task (such as fix the linked github issue). The model will then generate a bunch of attempts to solve the task which we call "trajectories", in most cases there is either an objective measure of the task success (such as passing the tests), or a fuzzy measure (such as having another LLM look at the results and provide a score). This is called the reward, and the model will learn slowly by producing trajectories that receive reward. It can actually be quite hard to prevent "reward hacking" from the model here and the rewards must be shaped very carefully, much R&D labor goes into here, as well as similar challenges to distributed pretraining.
A significant challenge is that coding/knowledge work tasks these days are getting extremely difficult, we are far beyond 2024 days where models could barely solve the easiest problems in SWE-bench. Tasks at the frontier now look more like mini projects that would take humans multiple hours or even days to finish (or in some cases, research-style tasks that would be beyond reach for even top human experts, such as the Erdős unit distance problem which was posed in 1946 but wasn't solved until recently, by GPT-5.5). Huge amounts of trajectories must be produced, and huge amounts of them produce zero reward and therefore are useless for learning. Getting a cold start requires running tens of thousands of instances of your model in VMs in parallel for multiple days to produce trajectories, to say nothing of the GPU costs.
So what do you do when you only have a model which is capable of basic conversations but cannot even begin to tackle basic coding tasks, use tools, etc? The approach that companies behind the frontier have decided on is to bootstrap their learning process by having an already extremely intelligent model such as Claude produce hundreds of thousands of seed trajectories for them. Then they can use this data to get a warm start and begin learning immediately. And if you use Claude for your reward model too, you get to skip the nastiness of reward shaping.
Therefore, even if in number of raw tokens the data are much smaller than internet-scale pretraining data, the value that each token provides is far far greater.
[1] For example, Grok 4 compute spend on RL was ~100% of that of pretraining: https://www.interconnects.ai/p/grok-4-an-o3-look-alike-in-se...
I worry about the use of humans as sacrificial accountability sinks. The "self-driving car" model already has this: a car which drives itself most of the time, but where a human user is required to be constantly alert so that the AI can transfer responsibility a few hundred miliseconds before the crash.
Essentially you are mischaracterizing what Feynman did or say, although this is also Feynman fault :-), by doing the famous public demonstration, with the ice water in a glass [2], although even there he only said it has "significance to the problem...". In other words, we should not simplify, even for the general public, what are complex subtle engineering issues. This is also the reason why current AI, will fail spectacularly, but I digress...
Feynman documented the joint rotation problem in his written Appendix F, but his televised demonstration became the explanation...[3]
Camarda is correct here. There was a fundamentally flawed field joint design, meaning the tang-and-clevis joint opened under combustion pressure instead of closing. This meant the O-rings were being asked to chase a widening gap something the O-ring manufacturer explicitly told Thiokol O-rings were never designed to do. Joint rotation was known as early as 1977, a full nine years before the disaster.
The cold temperature made things worse by stiffening the rubber so it could not chase the gap as quickly, but O-ring erosion and blow-by were occurring on flights in warm weather too and nearly every flight in 1985 showed damage.
The proof is how they fixed. NASA redesigned the joint metal structure with a capture feature to prevent rotation, added a third O-ring for redundancy, and installed heaters but kept the exact same Viton rubber. If the O-rings were the real problem, you would change the O-rings. They did not need to.
The report [1] is public for everybody to read...but not from the NASA page... who funnily enough has a block on the link from their own page, so I had to find an alternative link...
[1] - https://www.govinfo.gov/content/pkg/GPO-CRPT-99hrpt1016/pdf/...
[2] - https://youtu.be/6TInWPDJhjU
[3] - https://calteches.library.caltech.edu/3570/1/Feynman.pdf
Claude Opus came up with this script:
It produces a somewhat-readable PDF (first page at least) with this text output:
(I used the cleaned output at https://pastebin.com/UXRAJdKJ mentioned in a comment by Joe on the blog page)
To be clear, ADHD, despite having "disorder" in the name, is actually a syndrome: a complex of symptoms that, when recognized together, indicate that a certain set of interventional treatments will likely be applicable.
Diagnosing someone with a syndrome does not indicate any knowledge is available on the cause (etiology) of the symptoms. Many different things can cause the same set of symptoms. But if a certain treatment ameliorates anything qualifying as that syndrome, regardless of the upstream cause, then the diagnosis of the syndrome (and so the existence of the syndrome as a concept) is useful, even if it's not informative.
The DSM actually covers two very different categories of what we might call "mental" illnesses: neurocognitive illnesses, and neuroendocrine (or neurohormonal) illnesses.
Neurocognitive illnesses — structural problems with the brain or its cells (think Parkinson's Disease, or ALS, or Lewy Body dementia) — are usually traceable to specific etiologies, as each one usually has either a very unique presentation of signs and symptoms, or has unique markers that can be assayed/biopsied for.
Neuroendocrine illnesses, on the other hand, are almost always syndromes. Many different upstream problems (genetic, toxic, nutritive, auto-immune, etc) can potentially cause the same small menagerie of messenger-chemicals to get out of whack, and due to this, many different upstream problems end up looking like the same few "templates" of symptoms. If you can put the particular out-of-whack messenger-chemicals back into whack with drugs that do that, then you've fixed the symptoms — which doesn't fix the upstream problem (if it even can be fixed), but does fully compensate downstream for the upstream problem.
Essentially, yes, but I would go further in saying that embodiment is harder than intelligence in and of itself.
I would argue that intelligence is a very simple and primitive mechanism compared to the evolved animal body, and the effectiveness of our own intelligence is circumstantial. We manage to dominate the world mainly by using brute force to simplify our environment and then maintaining and building systems on top of that simplified environment. If we didn't have the proper tools to selectively ablate our environment's complexity, the combinatorial explosion of factors would be too much to model and our intelligence would be of limited usefulness.
And that's what we see with LLMs: I think they model relatively faithfully what, say, separates humans from chimps, but it lacks the animal library of innate world understanding which is supposed to ground intellect and stop it from hallucinating nonsense. It's trained on human language, which is basically the shadows in Plato's cave. It's very good at tasks that operate in that shadow world, like writing emails, or programming, or writing trite stories, but most of our understanding of the world isn't encoded in language, except very very implicitly, which is not enough.
What trips us up here is that we find language-related tasks difficult, but that's likely because the ability evolved recently, not because they are intrinsically difficult (likewise, we find mental arithmetic difficult, but it not intrinsically so). As it turns out, language is simple. Programming is simple. I expect that logic and reasoning are also simple. The evolved animal primitives that actually interface with the real world, on the other hand, appear to be much more complicated (but time will tell).
Since China has the most advanced network censorship, the Chinese have also invented the most advanced anti-censorship tools.
The first generation is shadowsocks. It basically encrypts the traffic from the beginning without any handshakes, so DPI cannot find out its nature. This is very simple and fast and should suffice in most places.
The second generation is the Trojan protocol. The lack of a handshake in shadowsocks is also a distinguishing feature that may alert the censor and the censor can decide to block shadowsocks traffic based on suspicions alone. Trojan instead tries to blend in the vast amount of HTTPS traffic over the Internet by pretending to be a normal Web server protected by HTTPS.
After Trojan, a plethora of protocol based on TLS camouflaging have been invented.
1. Add padding to avoid the TLS-in-TLS traffic characteristics in the original Trojan protocol. Protocols: XTLS-VLESS-VISION.
2. Use QUIC instead of TCP+TLS for better performance (very visible if your latency to your tunnel server is high). Protocols: Hysteria2 and TUIC.
3. Multiplex multiple proxy sessions in one TCP connection. Protocols: h2mux, smux, yamux.
4. Steal other websites' certificates. Protocols: ShadowTLS, ShadowQUIC, XTLS-REALITY.
Oh, and there is masking UDP traffic as ICMP traffic or TCP traffic to bypass ISP's QoS if you are proxying traffic through QUIC. Example: phantun.
This is a common misunderstanding of LLMs. The major, qualitative difference is that LLMs represent their knowledge in a latent space that is composable and can be interpolated. For a significant class of programming problems this is industry changing.
E.g. "solve problem X for which there is copious training data, subject to constraints Y for which there is also copious training data" can actually solve a lot of engineering problems for combinations of X and Y that never previously existed, and instead would take many hours of assembling code from a patchwork of tutorials and StackOverflow posts.
This leaves the unknown issues that require deeper reasoning to established software engineers, but so much of the technology industry is using well known stacks to implement CRUD and moving bytes from A to B for different business needs. This is what LLMs basically turbocharge.
>How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is let some other process fix the error, but what does this mean for their code? The answer is let it crash. By this I mean that in the event of an error, then the program should just crash. But what is an error? For programming purpose we can say that:
>• exceptions occur when the run-time system does not know what to do.
>• errors occur when the programmer doesn’t know what to do.
>If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.
>Errors occur when the programmer does not know what to do. Programmers are supposed to follow specifications, but often the specification does not say what to do and therefore the programmer does not know what to do.
>[...]
>The defensive code detracts from the pure case and confuses the reader—the diagnostic is often no better than the diagnostic which the compiler supplies automatically.
Note that this "program" is a process. For a process doing work, encountering something it can't handle is an error per the above definitions, and the process should just die, since there's nothing better for it to do; for a supervisor process supervising such processes-doing-work, "my child process exited" is an exception at worst, and usually not even an exception since the standard library supervisor code already handles that.
- ~612 BC Ashurbanipal di Nineveh tablets, sort of structured tag-based library with more than 30k tags found, mostly used to note transactions and other daily life activities
- ~245 BC Callimacus pínakes, another sort of tag-based index for the Alexandria giant library
- ~1545 Conrad Gessner libraries of Babel, personal notes closely similar to "modern" ZettelKasten
- 1673-94 Leibniz's Scrinium Literatum another far similar to Gessner's one and ZK
- 1934 Paul Otlet & Henry La Fontaine Mundaneum, so-called the modern web ancestor
- 1960 Niklas Luhmann's ZettelKasten
Those are just few I remember but there are many others and surely many more not lost in the history. All claim to be universal and all have an ultimate goal: store&retrieve information as easily as possible to produce new one, to evolve. All are closely similar in principles (usage of meta-information, cataloguing techniques of various kind, keep individual "entries" small for easy isolation and composing etc). The web (1.0 so called) is the first general and global example of those systems. All fails though at a certain point.
Long story short: there is no universal method to be followed slavishly expecting magic results, there are common needs, normally solved in closely similar ways with the tools of the time for millennia, the best option is understand the problem and the principle behind all those solutions tailoring one on our needs.
Personally I use Emacs/org-mode/org-roam and various other related package to manage my personal information, suffering a bit by the lack of a more flexible storage than files and filesystems, but still enough to manage almost anything so effectively that I can't use modern desktops/sw anymore, it's not PARA, ZK etc but just another systems, without strict rules, tailored on my needs following the similar principles of all others. Popular modern one are LYT https://youtu.be/RgwnpEBFNUg or Jonny Decimal.
And of course Yannic Kilcher[4], and also listening in on the paper discussions they do on discord.
Practicing a lot with just doing backpropagation by hand and making toy models by hand to get intuition for the signal flow, and building all kinds of smallish systems, e.g. how far can you push whisper, small qwen3, and kokoro to control your computer with voice?
People think that deepseek/mistral/meta etc are democratizing AI, but its actually Karpathy who teaches us :) so we can understand them and make our own.
[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...
[2] https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5...
Krennic's meeting on kalkite:
* structured like the https://en.wikipedia.org/wiki/Wannsee_Conference
* takes place at the https://en.wikipedia.org/wiki/Kehlsteinhaus
the Aldhani heist: https://en.wikipedia.org/wiki/1907_Tiflis_bank_robbery
Ferrix: https://en.wikipedia.org/wiki/The_Troubles
Vel Sartha:
* https://en.wikipedia.org/wiki/Rose_Dugdale
* https://en.wikipedia.org/wiki/Dolours_Price
* https://en.wikipedia.org/wiki/Red_Army_Faction
Kleya Marki: https://en.wikipedia.org/wiki/Noor_Inayat_Khan
the Dhanis: https://en.wikipedia.org/wiki/S%C3%A1mi_people#Discriminatio...
escape from Narkina 5:
* https://en.wikipedia.org/wiki/Maze_Prison_escape
* https://en.wikipedia.org/wiki/Mauthausen_concentration_camp
* https://en.wikipedia.org/wiki/Sobibor_uprising
* https://en.wikipedia.org/wiki/Vrba%E2%80%93Wetzler_report
Rix Road:
* https://en.wikipedia.org/wiki/Haymarket_affair
* https://en.wikipedia.org/wiki/Corporals_killings
* https://en.wikipedia.org/wiki/Shireen_Abu_Akleh#Funeral
Mon Mothma's speech: https://en.wikipedia.org/wiki/Otto_Wels#Speech_in_opposition
Ghormans: https://en.wikipedia.org/wiki/French_Resistance
Ghorman massacre:
* https://en.wikipedia.org/wiki/Bloody_Sunday_(1972)
* https://en.wikipedia.org/wiki/Tlatelolco_massacre
* https://en.wikipedia.org/wiki/Jallianwala_Bagh_massacre
* https://en.wikipedia.org/wiki/Rabaa_massacre
* and perhaps https://en.wikipedia.org/wiki/Silesian_weavers%27_uprising
Obviously the comparisons aren't exact, but it's clear the show had a great many sources of inspiration (or maybe history rhymes as it always has).
An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).
Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.
[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)
[2] https://arxiv.org/abs/2402.08115
[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)
I think about coding assistants like this as well. When I'm "ahead of the code," I know what I intend to write, why I'm writing it that way, etc. I have an intimate knowledge of both the problem space and the solution space I'm working in. But when I use a coding assistant, I feel like I'm "behind the code" - the same feeling I get when I'm reviewing a PR. I may understand the problem space pretty well, but I have to basically pick up the pieced of the solution presented to me, turn them over a bunch, try to identify why the solution is shaped this way, if it actually solves the problem, if it has any issues large or small, etc.
It's an entirely different way of thinking, and one where I'm a lot less confident of the actual output. It's definitely less engaging, and so I feel like I'm way less "in tune" with the solution, and so less certain that the problem is solved, completely, and without issues. And because it's less engaging, it takes more effort to work like this, and I get tired quicker, and get tempted to just give up and accept the suggestions without proper review.
I feel like these tools were built without any sort of analysis if they _were_ actually an improvement on the software development process as a whole. It was just assumed they must be, since they seemed to make the coding part much quicker.
One place I think the analogy breaks down, though, is that I think you're pretty severely underestimating the time and effort it takes to be productive at math research. I think my path is pretty typical, so I'll describe it. I went to college for four years and took math classes the whole time, after which I was nowhere near prepared to do independent research. Then I went to graduate school, where I received a small stipend to teach calculus to undergrads while I learned even more math, and at the end of four and a half years of that --- including lots of one-on-one mentorship from my advisor --- I just barely able to kinda sorta produce some publishable-but-not-earthshattering research. If I wanted to produce research I was actually proud of, it probably would have taken several more years of putting in reps on less impressive stuff, but I left the field before reaching that point.
Imagine a world where any research I could have produced at the end of those eight and a half years would be inferior to something an LLM could spit out in an afternoon, and where a different LLM is a better calculus instructor than a 22-year-old nicf. (Not a high bar!) How many people are going to spend all those years learning all those skills? More importantly, why would they expect to be paid to do that while producing nothing the whole time?
That is if you don't want to get into unsafe code.
[1] https://caitlinrivers.substack.com/p/understanding-mysteriou...
Unfortunately I don't know if there's an English equivalent, and considering how awful of a language Dutch is to learn it may be easier to learn Japanese, read the originals, and look up all the references yourself.
(You might want to pick a value that runs reasonably well on old phones, or have it adjust based on frame rate. Alternatively just put a some links at the top of the article.)
See https://ciechanow.ski/ (very popular on this website) for a world-class example of just how cool it is to embed simulations right in the article.
(Obligatory: back in my day, every website used to embed cool interactive stuff!)
--
Also, I think you can run a particle sim on GPU without WebGPU.
To this end, good coverage of decent (destructive) unification algorithms can be found in any simple resource on type inference (https://okmij.org/ftp/ML/generalization/sound_eager.ml) or the Warren Abstract Machine (https://github.com/a-yiorgos/wambook/blob/master/wambook.pdf).
Of course, there are times where you would want to reify substitutions as a data structure to compose and apply, but most of the time you just want to immediately apply them in a pervasive way.
Despite what another comment says, unification is a valid - and rather convenient - way to implement pattern matching (as in the style of ML) in an interpreter: much like how you rewrite type variables with types you are unifying them with, you can rewrite the pattern variables to refer to the parts of the (evaluated) value you are matching against (which you then use to extend the environment with when evaluating the right hand side).
Failed and not fun games do happen, but in general game developers put far more weight to this sort of process. There are no other goals that you can claim were accomplished if the users think the game sucks. So you get these ruthless production cycles and there's an appetite to cut things that don't work.
Hurlburt has great research on this using Descriptive Experience Sampling.
Some people mainly use images, others mainly speech, others mainly emotion etc. And many more use a varied mix.
Also the way each modality of thought is used is hugely variable - exactly what people see and with what quality or how precisely they feel emotional in their body etc.
To me it explains a huge amount of how different people are good at different skills.
I've a podcast on this topic ("Imagine an apple") if you're interested in more.
I think it's interesting that I consider myself to have a good memory but it is by far best at verbal associations (e.g. facts about a specific topic, object, word, or concept). The autobiographical part feels quite weak for me: I often find it hard to remember experiences in my life "by theme" (e.g. "think of a time when you felt X", "think of a time when you did Y very well or poorly"), and I very often don't remember what year a particular thing happened, or who was present with me on a particular occasion, or what I have or haven't done before with a particular friend or family member. I certainly have many vivid memories from my life, but they don't seem to be indexed that well by date, topic, or person.
I've been lucky enough to travel frequently, but I feel like I would be unable to answer questions like "how many times have you been to country X?" or "in what year did you first/last visit country Y?". (But I would probably be able to draw a decent map of specific places I've been, on various scales, and remember specific restaurants, train stations, landmarks, foreign language vocabulary, impressions of history and culture of various countries, etc. -- just not necessarily things like "when did you go there?". For example, my nephew recently asked me my impressions of Singapore, and I wrote him a six-page letter in reply with tons of particulars about all sorts of aspects of life/culture/politics/geography of Singapore, but I was unable to remember what years my trips there took place.)
Anyway, all this reminds me that "having a good memory" is definitely not just one thing!
If this is dangerous with normal English, how much more so with legal text.
At least if a lawyer drafts the text, there is at least one human with some sort of intentionality and some idea of what they're trying to say when they draft the text. With LLMs there isn't.
(And as I say in the linked post, I don't think that is fundamental to AI. It is only fundamental to LLMs, which despite the frenzy, are not the sum totality of AI. I expect "LLMs can generate legal documents on their own!" to be one of those things the future looks back on our era and finds simply laughable.)
When I read the Coming Technological Singularity back in the mid-90s it resonated with me and for a while I was a singularitarian- basically, dedicated to learning enough technology, and doing enough projects that I could help contribute to that singularity. Nowadays I think that's not the best way to spend my time, but it was interesting to meet Larry Page and see that he had concluded something familiar (for those not aware, Larry founded Google to provide a consistent revenue stream to carry out ML research to enable the singularity, and would be quite happy if robots replaced humans).
[ edit: I reread "The Coming Technogical Singularity". There's an entire section at the bottom that pretty much covers the past 5 years of generative models as a form of intelligence augmentation, he was very prescient. ]
> "Some of them are visibly fused. Some idiot must have welded them.”
> “Welded, yes. But not by some idiot. By the sun.”
> “Leo, it doesn’t get that hot—”
> “Not directly. What you’re seeing is spontaneous vacuum diffusion welding. Metal molecules are evaporating off the surfaces of the pieces in the vacuum. Slowly, to be sure, but it’s a measurable phenomenon. On the clamped areas they migrate into their neighboring surfaces and eventually achieve quite a nice bond. A little faster for the hot pieces on the sun side, a little slower for the cold pieces in the shade—but I’ll bet some of those clamps have been in place for twenty years.”
-- Falling Free (1988) by Lois McMaster Bujold