A bit outside my field Matthew. A long time since I code any fourier transform algorithms. But I will try to answer to the best of my knowledge.
The highest performant chips today are of course not RISC-V but things like ARM and x86. However RISC-V doing something none of these chips are succeeding and and that is to become the common instruction set for a large variety of custom accelerators.
For instance your particular problem requires running sine and cosines which is computationally expensive and thus custom hardware is useful in this case. Here RISC-V can help. These guys are developing a RISC-V based trigometric function accelerator:
https://www.jstage.jst.go.jp/article/elex/advpub/0/advpub_18.20210266/_pdf/-char/ja
But until these kinds of solutions pop up and get broadly available I think accelerators for scientific computing/machine learning from graphics card makers like Nvidia and AMD are the way to go. Boards like the H100 and A100 have specialized hardware for accelerating trigometric function acceleration.
And you can run a lot of these calculations in parallel. You would use something like CUDA to program this. Of course you may already be well aware of this given your background.
Frequent memory access of the same elements probably applies in your case, so you could move data frequently used into local storage for a CUDA kernel. I cover this topic a bit in this article: https://itnext.io/graphics-processors-gpus-under-the-hood-4522dbec777d
If you need specialized hardware e.g. in a microcontroller on the edge rather than in a regular office with a big computer then maybe one of the Nvidia embedded solutions would fit you: