Having trouble with interrupt latency and cycle time in M0+

Hello. This is my first time using an ARM MPU having trouble from past 2 weeks trying to figure out why there is higher than normal interrupt latency and about 125+ns of cycle time for a Nop.

https://www.infineon.com/dgdl/Infineon-S6E1A_Series_32_bit_Arm_Cortex_M0+_FM0+_Microcontroller-DataSheet-v05_00-EN.pdf?fileId=8ac78c8c7d0d8da4017d0edad5f45d7a&utm_source=cypress&utm_medium=referral&utm_campaign=202110_globe_en_all_integration-datasheet

Using a cypress arm M0+ for which the datasheet is above. Running the CPU at 40Mhz using an external oscillator which should give a cycle time of 25ns. Verified the prescaler values and also verified the clock by generating a PWM of 5Khz using the formula they provide. Everything checks out. But when i run the following code they dealy for one NOPs measured to be about 125ns. I know there is a latency for the I/O but comparing the timing for toggling the I/O without a NOP and with one i get the time for it. 

    FGpio1pin_Put(GPIO1PIN_P21, 0u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 1u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 0u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 1u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 0u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 1u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 0u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 1u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 0u);
	__ASM volatile ("NOP");
    FGpio1pin_Put(GPIO1PIN_P21, 1u);

I have also tried using a loop to generate busy waits using NOPs over a longer duration and got the same result. 

Is this something to do with ARM? I checked the clock prescalers and generated a PWM everything checks out. When i try to do busy waits the cycle time is no where near 25ns. Moreover the interrupt latency is 3ms at best for NMI and 4-6us for GPIO interrupts. From the datasheet the interrupt should be in 500ns. Also from the datasheet this particular MPU has zero wait states.

The code is bare bones main loop using the PDL library cypress provides.

I know these are two different problems or maybe not but I am trying to understand why a NOP which should take a cycle to complete, at 40Mhz it is behaving as though it is running at 8Mhz. Reducing the clock prescalers does increase the cycle time of NOPs.

How do I verify clock in Arm M0+ other than what i have tried with PWM and NOPs ? Any pointers will be appreciated.