Jose to present performance comparison across some alternatives. See slides and spreadsheet below. In the spreadsheet, tab C+ does not represent a new option. It is Option C but with matrix data pre-packed before the kernel, as is normally done in high-performance BLAS implementations. See tab C for performance with plain data.