How to use DWT mechanism of cortex-m33 to obtain the corresponding instruction and accessed memory in run time?

I am trying to use Data Watchpoint and Trace (DWT) mechanism to obtain runtime instruction and accessed memory address.
Specifically, I focus on STR(store) instruction operate on a continuous memory address space (more than 1000 bytes).
I successfully configured DEMCR and the corresponding registers, it does trigger an interrupt and triggers the execution of the DebugMon_Handler function.

However, I do not know how to obtain the corresponding instruction and the memory address that triggered the interrupt.
I can obtain the return address using the following code:
__attribute__((naked)) void DebugMon_Handler(void ) {

__asm volatile(
"tst lr, #4 \n"
"ite eq \n"
"mrseq r0, msp_ns \n"
"mrsne r0, psp_ns \n"
"b my_debug_handler \n"

void my_debug_handler(uint32_t* sp){
printf("ret address:%p \n\r", *(sp+6));
However, I found that the obtained "ret address" is not the exact instruction after the corresponding store instruction.
Instead, it is sometimes the next instruction and sometimes has two or three instruction delays.
For example,
.text:0804116A loc_804116A
.text:0804116A CMP R1, R2
.text:0804116C BNE loc_8041170
.text:0804116E POP {R4,PC}
.text:08041170 loc_8041170
.text:08041170 LDRB.W R4, [R1],#1
.text:08041174 STRB.W R4, [R3,#1]!
.text:08041178 B loc_804116A
the instruction 0x08041174 ` STRB.W` triggered the interrupt and the ret address was printed as 0x0804116C.
But what I really need is the instruction 0x08041178, which is exactly after the 0x08041174 `STRB.W` instruction,
which is helpful to do an analysis.

What I want to know is why the delays are uncertain?
Besides, how to obtain the exact return address in the Debug_Monitor function?
Any suggestions on how to obtain the exact memory address that triggers the interrupt?

More questions in this forum