This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM1176JZ-S, cache confg: effective cache size calculation

Note: This was originally posted on 22nd February 2009 at http://forums.arm.com

Hello,

1) I am using ARM1176JZ-S core with WinCE Platform. The cache memory is configured as follows

    DCache: 128 sets, 4 ways, 32 line size, 16384 size
    ICache: 128 sets, 4 ways, 32 line size, 16384 size

    Now I want to know the effective data cache size, I mean the total data from the main memory 
    could be cached and accessed without cache trashing within a function.

2) Is the cache set size(128 sets) and cache block/segment(of other processors) size are same?

Kindly reply this mail, thanks in advance

Regards,
Deven
  • Note: This was originally posted on 25th February 2009 at http://forums.arm.com

    Hello Sim,


    Thanks for the reply. I have put my understanding in the attached document, kindly review it and send your comments.

    -Regards
    Deven



    Deven,

    From the data you have provided, the data-cache is 128 lines of 4 way set associative with each way containing 32 bytes per line; multiplying all these numbers together produces the 16384 byte total size. A 32kB cache on this implementation would have twice the number of lines (256) and a 64kB variant would have twice the number of lines again (512).

    The line and byte offset within the line is a fixed mapping for any particular byte in memory, however, the byte may live in any of the 4 ways (hence the cache is 4-way set associative). The choice of way is made when the data is first fetched into the cache based on a victim way pointer, which in turn is based on some replacement algorithm (psuedo random, round-robin etc.).

    Given this information, it is theoretically possible for this data-cache to hold 16kB of sequential data starting from any cache line size aligned memory address, though achieving this will be dependent on interactions between code and the cache replacement algorithm.

    The 4kB number you appear to be refering to is the size of a single way of the cache (128 lines * 32 bytes per line). This, assuming you don't have any literal loads in your code, is the size of a contiguous, cache line size aligned, block of data you could repeatedly read (in a loop) where it should be impossible for any evictions to occur after the first time through the loop (each group of 32-bytes will be in a separate line, though not necessarily in the same way).

    hth
    s.
  • Note: This was originally posted on 25th February 2009 at http://forums.arm.com

    Hello isogen74,

    Thanks for the information.

    From your explanation, does it mean there is only one segment and the effective cache size is 4KB.

    Main memory          ...........................  data cache memory
    ===========...........................  ==============
        0-1023 bytes              ---------------------->    0- 4095 bytes               (4way)
    ===========...........................  ==============
    1024-2047 bytes            ---------------> 4096- 8191 bytes            (4way)
    ===========-...........................  -==============
    2048-3071 bytes           --------------->  8192- 12887 bytes          (4way)
    ===========...........................  ==============
    3072- 4096bytes           --------------->  12888-16384 bytes          (4way)
    ===========...........................  ==============
    effective size = 4KB



    Pls clarify.

    thanks,
    Deven





    > I mean the total data from the main memory 

    16KB for data, and 16KB for instructions are the critical numbers you will want. The rest is just noise unless you design a pathological algorithm which really abuses the cache.

    Some systems also include a L2 cache which can cache more data between the L1 caches and the main memory.

    > Is the cache set size(128 sets) and cache block/segment(of other processors) size are same?

    It varies - basically the scheme you outline means that for any 1 address there are 4 possible places (ways) where the data may reside. 128 sets (cache lines) * 4 ways * 32-bytes per cache line = 16 KB. Four 4-way caches are pretty common as they give a good trade-off between  speed and cache utilization for typical code.

    For most caches on ARM systems the number of ways is fixed, but number of sets depends on the size of the cache. In this case 64 sets = 8KB cache, 128 sets = 16KB cache, etc.
  • Note: This was originally posted on 11th March 2009 at http://forums.arm.com

    Hello isogen74 & tum,

    Thanks for the information.

    I have one more question.

    Does the first memory access immediately after the cache flush shall consume less cycles than memory access at cache with some entry. (Assume both are cache hit condition)

    I am doing a code optimization. In any-case cache flush before the algorithm execution benefit the algorithm performance compare to with-out cache flush.


    In other way, the pseudo-random/round-robin cache line selection and fill have any effect of cache flush before the operation.


    Kindly clarify.

    Thanks,
    Deven
  • Note: This was originally posted on 25th February 2009 at http://forums.arm.com

    Deven,

    From the data you have provided, the data-cache is 128 lines of 4 way set associative with each way containing 32 bytes per line; multiplying all these numbers together produces the 16384 byte total size. A 32kB cache on this implementation would have twice the number of lines (256) and a 64kB variant would have twice the number of lines again (512).

    The line and byte offset within the line is a fixed mapping for any particular byte in memory, however, the byte may live in any of the 4 ways (hence the cache is 4-way set associative). The choice of way is made when the data is first fetched into the cache based on a victim way pointer, which in turn is based on some replacement algorithm (psuedo random, round-robin etc.).

    Given this information, it is theoretically possible for this data-cache to hold 16kB of sequential data starting from any cache line size aligned memory address, though achieving this will be dependent on interactions between code and the cache replacement algorithm.

    The 4kB number you appear to be refering to is the size of a single way of the cache (128 lines * 32 bytes per line). This, assuming you don't have any literal loads in your code, is the size of a contiguous, cache line size aligned, block of data you could repeatedly read (in a loop) where it should be impossible for any evictions to occur after the first time through the loop (each group of 32-bytes will be in a separate line, though not necessarily in the same way).

    hth
    s.
  • Note: This was originally posted on 5th March 2009 at http://forums.arm.com

    Hello,

    I had profiled the below code. But I could not see cache advantage of repeated access.

    This code is placed in a WinCE appication thread and profiled.

    [snipped]

    Could you explain where is the error.


    Thanks,
    Deven


    Hi Deven,

    Please don't assume that I'm an expert in this area but my incling is that under WinCE
    you'll hardly get trustworthy results for your tests because apart from your 'profiled thread'
    there will be lots and lots of other threads, also actively using caches, and thus interfering
    with your results.

    Correct me if I'm wrong. (I know close to nothing about WinCE and your methods of profiling).
  • Note: This was originally posted on 12th March 2009 at http://forums.arm.com

    [skipped]
    Does the first memory access immediately after the cache flush shall consume less cycles than memory access at cache with some entry. (Assume both are cache hit condition)

    I am doing a code optimization. In any-case cache flush before the algorithm execution benefit the algorithm performance compare to with-out cache flush.


    In other way, the pseudo-random/round-robin cache line selection and fill have any effect of cache flush before the operation.


    Interesting. How big is the performance difference? Do you observe this result every time? (assuming you've made several measurements, not just one or two)...
    To be honest, I still think your profiling methods are not giving you the results you can trust.
    The "algorithm execution" you're talking about is (assuming again) a rather long loop, therefore I would not expect you to notice the effect of 'the very first memory access'...
    Let's see what isogen74 can tell about this :)
  • Note: This was originally posted on 22nd February 2009 at http://forums.arm.com

    > I mean the total data from the main memory 

    16KB for data, and 16KB for instructions are the critical numbers you will want. The rest is just noise unless you design a pathological algorithm which really abuses the cache.

    Some systems also include a L2 cache which can cache more data between the L1 caches and the main memory.

    > Is the cache set size(128 sets) and cache block/segment(of other processors) size are same?

    It varies - basically the scheme you outline means that for any 1 address there are 4 possible places (ways) where the data may reside. 128 sets (cache lines) * 4 ways * 32-bytes per cache line = 16 KB. Four 4-way caches are pretty common as they give a good trade-off between  speed and cache utilization for typical code.

    For most caches on ARM systems the number of ways is fixed, but number of sets depends on the size of the cache. In this case 64 sets = 8KB cache, 128 sets = 16KB cache, etc.
  • Note: This was originally posted on 12th March 2009 at http://forums.arm.com

    If the cache is in write-back mode and the line is dirty then the cache line has to be written to memory before the line can be reloaded with new data. If the line is not dirty (either because it is empty, the data is read-only, or is has been flushed) then you can skip this step and just load the new data.

    How long the write-back stage takes depends on the microarchitecture of the processor - some designs block the cache line, whilst others have dedicated victim buffers for storing the lines to be evicted, which frees the line up to for new data much earlier.

    > The "algorithm execution" you're talking about is (assuming again) a rather long loop, therefore I would not expect you to notice the effect of 'the very first memory access'...

    It depends on the data structures. If you have data that is nicely packed (and accessed) in cache-line size-chunks then you probably won't see much effect. If you have data structures which access one byte per cache line, and pollute a lot of the cache, before accessing the next byte in each cache line then you are likely to see a huge performance loss.

    Designing data structures to "be nice" to the cache is one of the most beneficial aspects of optimization - but worrying about individual cache line evictions is unlikely to be useful because of the pseudo-random nature of the typical cache policy).
  • Note: This was originally posted on 25th February 2009 at http://forums.arm.com

    Hi Deven,

    Yes that looks correct to me. The second loop should run faster.

    Iso
  • Note: This was originally posted on 9th March 2009 at http://forums.arm.com

    Assuming that your QUERY_START and QUERY_END macros are calling a system function to get the time stamp, I would think that you are spending a significant time in the kernel to actually process the time request.

    Your test loop is quite short (64K ops, from cache, may only be 100K-200K cycles). There is an off chance that the system call is much slower than this (in most OSes the system calls are expensive) and because your loop has just saturated the D cache it will run even slower than it would normally.

    Functions like printf are also quite data heavy, so you may find you are corrupting a significant chunk of the cache that you think you have preloaded, but in fact evicting it by printf'ing before calling the test loop.

    (If you are using the CP15 performance counters, you are in with half a chance. You can also measure the cache misses for both I and D to sanity check your results).

    Constructing benchmarks to measure cache effects can be quite difficult, especially on top of an operating system...