Hi Thomas,
Thank you for sharing this test. I just did a quick test and yes this is really interesting !
Without the scale option, the load is close to 100% for each cpus (so the pstates are increasing up to the turbo frequency) but with scale=320:208, the load is oscillating (close to 50% in average), so the requested frequencies are lower (the power is also reduced).
I have a patch (not yet submitted) that reduce the gap for such a use case.