2025-02-13 Ordinary Meeting Notes

Feb 13, 2025 | RV-LFX Performance Event Sampling TG

Attendees: Beeman Strong

Notes

Attendees: Beeman, Sun, Hasan, Robert, Snehasish, Qin, Daniel, Chun
Slides, video
Agenda:
- Continue reviewing sample data collected
- Late sample filtering
- New perf events
- Sample collection
Intros
- Sun & Qin - from Alibaba
Resume discussion about sample data, with latencies
- All these fields, save for those that are optional/opt-in, would be recorded with every sample
- Retire stall is a single field that holds one of two values:
  - If the selected inst completes before it is oldest, count the cycles that it sits waiting for older insts to retire
  - If the selected inst becomes oldest before it is complete, count the cycles that is is oldest, and hence delaying younger insts
- Not including things like GPRs in sample record because very hard for HW to read them
  - Would need an extra RF port? And latency to read them all. Might end up stuck with a ucode-based implementation, for the output-to-memory configuration
  - The proposed sample data could all reside in (non-renamed) CSRs, easy for HW to store out
  - Assume that users who want to collect GPRs or call-stack will use the interrupt-based mechanism, then handler can collect whatever it wants
  - Shadow stack could be better than walking call-stack, but not aware of production uses
    - Nobody aware of production use of SS
- Retire stall needs as many bits as total/mem latency counters, could spend a lot of time there
- Rather than just counting stalls, should we full latency at each high-level stage of execution?
  - Decoupled from the actual pipeline, but can imagine ~5 phases of execution, such that the combined time in each adds up to total latency
  - AI Beeman to start an email thread about this
  - Want to make sure we keep retire pushout
  - Google models execution unit latency info
    - Adding it to the sample record could be helpful to tune these models, but think they are pretty accurate already
    - So unclear if it would be worth extra HW
- What about instructions that do multiple loads?
  - Implementations have option to select one to sample, or combine latencies across all
  - But will only have 1 data address, 1 data source, etc
Discussing late filtering options
- Propose no filtering by address, but can filter by most General fields and all Events. And filter by one latency, which is configurable.
- Keeps CSR state manageable while supporting key functionality
- All filtering would be mask & match, save for threshold compare for latency
  - Could latency compare be mask & match too, so just powers of two for the threshold?
  - Might mean can’t tune threshold to match latency to access individual caches
- Could SW just do the filtering
  - Yes, but HW filtering reduces the overhead by avoiding interrupts or stored records for uninteresting samples
  - SW may have to do filtering for esoteric cases, but for general perf analysis the HW filters should be effective
- What’s address source?
  - Where did the translation come from: L1 TLB, LL TLB, page walk, …
Out of time

Action items

Feb 13, 2025 - Beeman Strong - Start thread on latency options