joshka 6 hours ago

If you're going to the effort of writing a procmacro, you may as well output a string from the macro instead of code.

If you're going idiomatic rust, then you might instead output a type that has a display impl rather than generating code that writes to stdout.

Arnavion 3 hours ago

If OP is looking for ideas, there are two intermediate steps between the extremes of "write every line to stdout" and "build up a buffer of the whole output and then write it to stdout".

1. `stdout().lock()` and `writeln!()` to that. By default using `print*!()` will write to `stdout()` which takes a process-wide lock each time. (Funnily enough they use .lock() in the "build up a buffer of the whole output" section, just to do one .write_all() call, which is the one time they don't need to use .lock() because Stdout's impl of write_all() will only take the lock once anyway.)

2. Wrap the locked stdout in a `BufWriter` and `writeln!()` to that. It won't flush on every line, but it also won't buffer the entire output, so it's a middle point between speed and memory usage.

---

For the final proc macro approach, there is the option to unroll the loop in the generated code, and the option to generate a &'static str literal of the output.

ainiriand 7 hours ago

In my opinion a more accurate measure when you go down to the micro seconds level is TSC directly from the CPU. I've built a benchmark tool for that https://github.com/sh4ka/hft-benchmarks

Also I think that CPU pining could help in this context but perhaps I need to check the code in my machine first.

  • vlovich123 3 hours ago

    How does this compare with divan?

jasonjmcghee 8 hours ago

Reminds me of the famous thread on stack overflow. I'll link the rust one directly, but one cpp answer claims 283 GB/s - and others are in the ballpark of 50GB/s.

The rust one claims around 3GB/s

https://codegolf.stackexchange.com/a/217455

You can take this much further! I think throughput is a great way to measure it.

Things like pre-allocation, no branching, constants, simd, etc

hyperhello 8 hours ago

Maybe I’m missing something but can’t you unroll it very easily by 15 prints at a time? That would skip the modulo checks entirely, and you could actually cache everything but the last two or three digits.

  • Terretta 7 hours ago

    > Maybe I’m missing something but can’t you unroll it very easily by 15...

    Sure, 3 x 5 = 15. But, FTA:

    But then, by coincidence, I watched an old Prime video and decided to put the question to him: how would you extend this to 7 = "Baz"?

    He expanded the if-else chain: I asked him to find a way to do it without explosively increasing the number of necessary checks with each new term added. After some hints and more discussion...

    Which is why I respectfully submit almost all examples of FizzBuzz including the article's first are "wrong" while the refactor is "right".

    As for the optimizations, they don't focus on only 3 and 5, they include 7 throughout.

Etherlord87 4 hours ago

> At this point, I'm out of ideas. The most impactful move would probably be to switch to a faster terminal... but I'm already running Ghostty! I thought it was a pretty performant terminal to being with!

But what is the point? Why do you want to optimize the display? If you want to be able to fizz-buzz for millions of numbers, then you want to... Well realistically only compute them just before they are displayed.

  • Arnavion 3 hours ago

    Because the display is the bottleneck.