Hello,
I achieved a significant performance gain by changing the number of cuboids. I started from the aorta3d example, which used too many cuboids (8 per CPU core in parallel), and I greatly reduced this number (1 per core with 32 cores).
Hope this helps. Good luck!
Sylvain