Thanks for the feedback Gregory. It is nice to hear from someone who does this for a living.
You hit on a number of points which I have wanted to improve on but I don't know where get better info. For instance I am still struggling to get some kind of ball park sense of how complex macro-operation fusion is to implement.
1. Like can the complexity be compared to the complexity of any other well known micro-architecture challenges?
2. How does the complexity grow? Does complexity grow linear? Like will supporting 4 macro-op fusions cost 4x as many transistors as 1 macro-op fusion? I am assuming you put in some kind of generic hardware to begin identifying candidates for fusion?
I have similar questions areound doing dispatch on compressed RISC-V instructions. Is there some ballpark figure on how much it adds to complexity of a superscalar processor?
If they decompress single instructions with 400 gates, then will decompressing two instructions in parallell require 800 gates or does complexity grow in a non-linear fashion?
I get that RISC-V was not designed primarily for performance, yet judging by the chips SiFive has made they are doing great in terms of performance. From what I understand they are able to match Arm performance with equal or fewer transistors for the lower end chips.
Maybe you have some insights on this, but to me as a non-expert that suggest that there is nothing wrong with RISC-V in terms of its ability to perform.
Of course the claims they make about performance might be bogus.