Love how this chronicles the instruction count at 301 million and then for each optimization and compromise it cuts xx million instructions of the runtime.
I think the 6502 final would need to be run in an emulator to get the retired instruction count. On 586+ cpu such a function is baked into the hardware.
My preference was to work in cycles. Many systems have a timer one can use to get the cycle counts. There isn't one on a stock Apple 2. Many cards have the PIA chip, 6522, which does have two timers, though they are only 16 bit.
Or, a quick hand timing gets fairly close. On that, the only real difficulty is finding a task that scales well with our perceptual slowness.