Read allocate will impact bzero performance or not

I'm trying to understand read allocate mode in cortex A7 core. From description of Read allocate mode in TRM of Cortex A7 core, my understanding is that bzero will downgrade memset performance while memset was done with consecutive memory.  For example, lmbench bzero test will apply 100M memory, the it will try it clear memory via memset, then measure the performance. Please correct me if my above understanding is wrong.


The L1 data cache only supports a Write-Back policy. It normally allocates

a cache line on either a read miss or a write miss. However, there are some

situations where allocating on writes is undesirable, such as executing the

C standard library memset() function to clear a large block of memory to a

known value. Writing large blocks of data like this can pollute the cache

with unnecessary data. It can also waste power and performance if a

linefill must be performed only to discard the linefill data because the

entire line was subsequently written by the memset().

To prevent this, the Bus Interface Unit (BIU) includes logic to detect when

a full cache line has been written by the processor before the linefill has

completed. If this situation is detected on three consecutive linefills, it

switches into read allocate mode. When in read allocate mode, loads

behave as normal and can still cause linefills, and writes still lookup in the

cache but, if they miss, they write out to L2 rather than starting a linefill.

The BIU continues in read allocate mode until it detects either a cacheable

write burst to L2 that is not a full cache line, or there is a load to the same

line as is currently being written to L2.