/
2025-02-13 Ordinary Meeting Notes

2025-02-13 Ordinary Meeting Notes

Feb 13, 2025 | RV-LFX Performance Event Sampling TG

Attendees: Beeman Strong

 

Notes

  • Attendees: Beeman, Sun, Hasan, Robert, Snehasish, Qin, Daniel, Chun

  • Slides, video

  • Agenda:

    • Continue reviewing sample data collected

    • Late sample filtering

    • New perf events

    • Sample collection

  • Intros

    • Sun & Qin - from Alibaba

  • Resume discussion about sample data, with latencies

    • All these fields, save for those that are optional/opt-in, would be recorded with every sample

    • Retire stall is a single field that holds one of two values:

      • If the selected inst completes before it is oldest, count the cycles that it sits waiting for older insts to retire

      • If the selected inst becomes oldest before it is complete, count the cycles that is is oldest, and hence delaying younger insts

    • Not including things like GPRs in sample record because very hard for HW to read them

      • Would need an extra RF port?  And latency to read them all.  Might end up stuck with a ucode-based implementation, for the output-to-memory configuration

      • The proposed sample data could all reside in (non-renamed) CSRs, easy for HW to store out

      • Assume that users who want to collect GPRs or call-stack will use the interrupt-based mechanism, then handler can collect whatever it wants

      • Shadow stack could be better than walking call-stack, but not aware of production uses

        • Nobody aware of production use of SS

    • Retire stall needs as many bits as total/mem latency counters, could spend a lot of time there

    • Rather than just counting stalls, should we full latency at each high-level stage of execution?

      • Decoupled from the actual pipeline, but can imagine ~5 phases of execution, such that the combined time in each adds up to total latency

      • AI Beeman to start an email thread about this

      • Want to make sure we keep retire pushout

      • Google models execution unit latency info

        • Adding it to the sample record could be helpful to tune these models, but think they are pretty accurate already

        • So unclear if it would be worth extra HW

    • What about instructions that do multiple loads?

      • Implementations have option to select one to sample, or combine latencies across all

      • But will only have 1 data address, 1 data source, etc

  • Discussing late filtering options

    • Propose no filtering by address, but can filter by most General fields and all Events.  And filter by one latency, which is configurable.

    • Keeps CSR state manageable while supporting key functionality

    • All filtering would be mask & match, save for threshold compare for latency

      • Could latency compare be mask & match too, so just powers of two for the threshold?

      • Might mean can’t tune threshold to match latency to access individual caches

    • Could SW just do the filtering

      • Yes, but HW filtering reduces the overhead by avoiding interrupts or stored records for uninteresting samples

      • SW may have to do filtering for esoteric cases, but for general perf analysis the HW filters should be effective

    • What’s address source?

      • Where did the translation come from: L1 TLB, LL TLB, page walk, …

  • Out of time

 

Action items


Standard_2.png

RISC-V International