2025-02-13 Ordinary Meeting Notes
Feb 13, 2025 | RV-LFX Performance Event Sampling TG
Attendees: Beeman Strong
Notes
Attendees: Beeman, Sun, Hasan, Robert, Snehasish, Qin, Daniel, Chun
Agenda:
Continue reviewing sample data collected
Late sample filtering
New perf events
Sample collection
Intros
Sun & Qin - from Alibaba
Resume discussion about sample data, with latencies
All these fields, save for those that are optional/opt-in, would be recorded with every sample
Retire stall is a single field that holds one of two values:
If the selected inst completes before it is oldest, count the cycles that it sits waiting for older insts to retire
If the selected inst becomes oldest before it is complete, count the cycles that is is oldest, and hence delaying younger insts
Not including things like GPRs in sample record because very hard for HW to read them
Would need an extra RF port? And latency to read them all. Might end up stuck with a ucode-based implementation, for the output-to-memory configuration
The proposed sample data could all reside in (non-renamed) CSRs, easy for HW to store out
Assume that users who want to collect GPRs or call-stack will use the interrupt-based mechanism, then handler can collect whatever it wants
Shadow stack could be better than walking call-stack, but not aware of production uses
Nobody aware of production use of SS
Retire stall needs as many bits as total/mem latency counters, could spend a lot of time there
Rather than just counting stalls, should we full latency at each high-level stage of execution?
Decoupled from the actual pipeline, but can imagine ~5 phases of execution, such that the combined time in each adds up to total latency
AI Beeman to start an email thread about this
Want to make sure we keep retire pushout
Google models execution unit latency info
Adding it to the sample record could be helpful to tune these models, but think they are pretty accurate already
So unclear if it would be worth extra HW
What about instructions that do multiple loads?
Implementations have option to select one to sample, or combine latencies across all
But will only have 1 data address, 1 data source, etc
Discussing late filtering options
Propose no filtering by address, but can filter by most General fields and all Events. And filter by one latency, which is configurable.
Keeps CSR state manageable while supporting key functionality
All filtering would be mask & match, save for threshold compare for latency
Could latency compare be mask & match too, so just powers of two for the threshold?
Might mean can’t tune threshold to match latency to access individual caches
Could SW just do the filtering
Yes, but HW filtering reduces the overhead by avoiding interrupts or stored records for uninteresting samples
SW may have to do filtering for esoteric cases, but for general perf analysis the HW filters should be effective
What’s address source?
Where did the translation come from: L1 TLB, LL TLB, page walk, …
Out of time
Action items
Feb 13, 2025 - Beeman Strong - Start thread on latency options
RISC-V International