DWT usage and parallelized instructions

Hi there!

I have a question about how to make sure that the DWT clock measurement for CM55 is working correctly.
The question arose because for some functions the measured values look strange and not always realistic.
For example, I'm attaching the code that tests the clock measurement and the results in disassembly and output to the console:

uint32_t volatile *const DEMCR = (uint32_t volatile *const)0xE000EDFC;
uint32_t volatile *const DWT_CTRL = (uint32_t volatile *const)0xE0001000;
uint32_t volatile *const DWT_CYCCNT = (uint32_t volatile *const)0xE0001004;
uint32_t volatile *const LAR = (uint32_t *) 0xE0001FB0;

void init_counter(void)
{
*DEMCR |= (1 << 24);
*DWT_CTRL |= 1;
*LAR = 0xC5ACCE55;
*DWT_CYCCNT = 0;
}


int main()
{
init_counter();

uint32_t cycles_1;
uint32_t a, b;

a = *DWT_CYCCNT;
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
__asm__ __volatile__("nop");
b = *DWT_CYCCNT;
cycles_1 = b - a;

printf("cycles: %u", cycles_1);

return 0;
}



Also, optimization level setted to -O1

Are there any additional settings/files that should be added to make the processor cycles measurement work predictably and properly? 

I also have a question about the fact that disassembly does not parallelize instructions, for example, when performing load and MAC operations,
this should be done by separate units, and these operations can occur in parallel, but at least in disassembly they are not displayed like that...
Are there any settings that can be added / changed for this so that operations that can be performed in parallel are displayed in disassembly.