Thanks for pointing this out. That was indeed a misunderstanding on my part. However, I didn't write it was easy to decode because I had this misconception but because numerous people with a much better grasp on this than me have come out and said that.
I still think you are grossly exaggurating the complexity. You can feed all the data you want to decode as instructions in parallel into some lookup tables to identify boundaries and use multiplexers to feed the right halfs of 32-bit instructions into each decoder.
Every 16-bit chunk in the instruction stream is going to be either the start of a 32/16-bit instruction or the end of a 32-bit instruction.
Every decoder can be given something which is either 32-bit or 16-bit start and figure out whether it needs the next 16-bit chunk or not in parallel.
I suspect a lookup table could provide the outputs to chose which decoders should be turned on or off.
For every 16-bit chunk you can send it as either the beginning of a decoder or as the end of a 32-bit instruction to the "previous" decoder. I tried to sketchup this idea in the article if this explanation is unclear.
But the fact that you are dealing with only two sizes, rather than 15 should make a big difference.