pietro / post-its
gpus 2026-04-10
how long is a GPU worth running for? (wip)
birds 2026-03-01
italian bird names:
- merlo
- civetta
- passero
- gru
- quaglia
- gazza
- rondine
- airone
- picchio
- colomba
gold 2026-02-19
all the gold that has ever been mined fits into a cube with side length 22m (73ft). gold mines increase the global reserve by 2 to 3% each year.
quines 2026-01-20
I like quines!
- 8½
- Compilers (with dangerous consequences!).
- RepRap
Inherently unstable, since every change of features in the object must be corresponded by a change in how it’s computed. Unlike most things in CS, we don’t have the nice causal dependencies between program and output, so it should be approached more as a fixed point problem.
moon-bird-plane 2025-12-27
Tigros’ Parking Lot.

civetta 2025-11-20
A list of things I’ve learned while writing civetta:
- Models are a composition of differential operators, so when building an autograd engine, you only have to write the explicit backward for a small set of “axiomatic” operators, and then, as long as you can express any other operation as a composition of these, you’re good.
- The computation graph represents (almost) everything: even the input tensors are part of the CG and not really different from the weight tensors, only difference is that you don’t take their grad (you could even take the grad in theory, as long as the optimizer doesn’t see it). Or also the loss is part of the CG as well. You could even theoretically represent the backward pass as an extension of the computation graph (like a mirror image of the forward pass), and same for the optimizers (which are operators with some side effect on the weight tensors). At that point there is not even a concept of a backward pass anymore, you are only doing forwards on this augmented CG.
- All ops that do memory wrangling during the backward pass are just redirections on the gradients (e.g.
concat,reshape, …). - You need very few “axiomatic” ops (i.e. for which you have to define the explicit backward), even
matmulis not a fundamental op, since its just a elementwise matrix multiply (with broadcasting) and a sum reduce. - One thing that the computation graph cannot capture properly is pipelining.
- A lot of parallelism is just defining a dimension across which you slice the tensor (e.g. DP is slicing across the batch dim, TP is slicing across the hidden dim, …).
ReLUis the simplest nonlinear since in backprop its literally just a 0-1 filter over whatever tensor we applied it to.- Normal
softmaxis truly too numerically unstable, even for toy models. So things that interact with softmax (e.g. cross entropy) all need to be phrased using the log probs rather than the probs. Which is why we do things usingCrossEntropyWithLogitsin torch. So you always seelogsumexprather than softmax. - Scalars are just 0 dim tensors. The indexing is done with an empty slice
[]. - An easy way to store the activations for the backward pass is to write the backward function as a nested function to your forward function, and to capture the computed value via the closure.
- There are 2 interfaces:
- non-parametric functions: these do not hold params, so you can just thread them through.
- parametric functions (modules): since these have params, they need to expose their params so that the optimizer can access them. PyTorch offers a functional interface also for non-parametric ones (e.g.
reluisReLU). So all learnable modules in PyTorch are functionals since they have to hold information across calls, which they do in the closure. In the closure the parameters are captured and then a forward function is exposed that uses those parameters.
decomposition 2025-11-05
Given a complex high dimensional object, decompose it into a reduction of simpler, one-dimensional objects. There is generally an order of importance between the components.
- Taylor Expansion
- DFT (and friends)
- SVD
Remarkably, a very good approximation can be obtained with a very small number of components.
bet 2025-09-04
2 opposing tendencies with respect to raw materials (e.g. ores and oil):
- (#1) The more we mine, the more we exhaust the deposits that are easy to access, and have to go deeper and deeper. To pull energy out, you also need energy, and you can calculate EROI (energy return on investment), a.k.a. how many joules you get per joule put in. In the US in the 1900s, the EROI for finding oil was ~1200:1, now it’s ~5:1!!!!.
- (#2) The more we mine, the more materials we have to build infrastructure that is able to either extract or use those resources more efficiently (or that enables us to pivot to other things, e.g. using the energy from oil to produce solar panels).
Analogously:
- (#1) The more people, the more quickly we consume resources
- (#2) The more people, the more smart people can work on making extraction and use of resources more efficient
We are implicitly betting that (#2) can outpace (#1).