This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Slow performance on samsung S3C6410

Note: This was originally posted on 18th January 2011 at http://forums.arm.com

Hi,

I'am a software developer and I am trying to port our product to new device. This is Windows CE 6 device with S3C6410 (ARM1176JZF-S) CPU.  The problem is that Q-Bench benchmarks show that this is very fast system but after executing our application it is actually very slow.

I have spend a lot of time profiling various parts of our product, but it shows nothing. Finally what I have found out is that the problem is with the huge code amount. Actually our .exe is ~10MB in size. I have made tests in which I have auto generated huge amounts of code (~200,000 lines of c++ code, VS2005 compiled), and now executing this exe (~1.5MB) on this device shows significant slow down, 8 - 10 times comparing it to other devices (with slower CPUs). This auto generated code does nothing with data, it just executes lots of functions which just increment some variables.

My question is what is the source of problem? From What I know this CPU has  16 KiB instruction cache. Can it be somehow badly configured? I actually have no contact with this device manufacturer. I can only give some hints to its reseler to maybe push information further.

some more info:
Q-Bench Pro - shows that Cache Line == 8, while on other devices it is 32
CeGetCacheInfo - gives below results:
dwL1Flags=0
dwL1ICacheSize=16384
dwL1ICacheLineSize=32
dwL1ICacheNumWays=4
dwL1DCacheSize=16384
dwL1DCacheLineSize=32
dwL1DCacheNumWays=4
dwL2Flags=0
dwL2ICacheSize=0
dwL2ICacheLineSize=0
dwL2ICacheNumWays=0
dwL2DCacheSize=0
dwL2DCacheLineSize=0
dwL2DCacheNumWays=0

Thank You for any help
Martin
  • Note: This was originally posted on 20th January 2011 at http://forums.arm.com

    this is what I was afraid of, reorganizing code to minimize i-cache reloads is not an easy task from the c++ programmer view, one thing that puzzle me is that our application works OK (among many other CPUs) on the following two:

    MTK, ARM1176JZ-S-MT3351
    sirf arm1136jf-s-a-at550

    I made a benchmark comparison of above two CPUs with the problematic one:

       MT3351(wce5) / at550(wce6) / S3C6410(wce6)
    CPU: 241.113Q / 508.559Q / 505.484Q
    Memory: 229.122Q / 605.436Q / 343.842Q
    File I/O: 21.497KQ / 37.940KQ / 2.586KQ
    GDI: 202.291Q / 415.037Q / 288.697Q

    (this are standardized tests made with QBench Pro and their details can be found here [Removed]
    (File I/O measured on SD card)
    (GDI: MT3351 and S3C6410 is 800x480 screen, at550 is 480x272)

    for each of the above CPUs, CeGetCacheInfo returns the same information.

    I really dont see that S3C6410 is more powerfull than the other two ones. When it comes to CPU it is even a little bit worse than at550.
  • Note: This was originally posted on 25th January 2011 at http://forums.arm.com

    Thank You for your time, I dont really have any benchmarks of our application that would be of use here. Maybe the time of frame rendering, it gets nerly 2s on this device, while on others its 120ms-450ms. But it does a lot of memory reading and loading data from SD card. I have actually tested putting data from SD card to main memory (limited resource) on Windows CE but nothing changed.

    I started investigating this problem a little bit deeper and wrote a driver to read/write coprocessor registers. So far I have not found anything different from other platforms. With a bit of spare time I plan to use instruction cache miss event as described in "ARM11 performance monitor unit" article in knowledge base. It gives hint that high cache misses might be caused by disabled branch prediction.

    I suspect as You said that slow memory + small cache + fast CPU is the cause of the problem here. I know that our application works properly on Samsung Omnia 2 which uses this CPU, but it most probably uses other memory / sd card hardware configuration.

    Martin
  • Note: This was originally posted on 1st December 2011 at http://forums.arm.com

    From TCM status register I have read that there are two Data and two Instruction TCMs. They are 8KB in size. From TCM Region Register I have read that both Data TCMs are enabled, and both Instruction TCMs are disabled.

    I will give it a try and try enabling ITCM but I am not quite sure how it will work. I am a Windows CE application developer, I am not able to modify system on device in any way. All I can do is to set Region Register for Instruction TCM with some base address (using self made device driver). From what I have read it is not a general purpose cache but it is supposed to be explicitly used by the system developer to speed up code for handling interrupts etc. Is it true? Or maybe enabling it will make CPU use some more cache for processing instructions from our application? I am not sure if it is our application that is being slowed down or windows ce is just slow on this device.

    Martin



    [Edit] - not 16KB but 8KB
  • Note: This was originally posted on 1st December 2011 at http://forums.arm.com

    TCM memory is much faster than main memory, such as SDRAM, it is kind of L2 cache. The instruction TCM definitively improve the CPU's performance. So, please enable it and run your program again.
    BTW, what the size of instruction TCM on this chip?
    Anyway,  a lot code optimization can be done to increase the instruction cache hit rate.

    HTH.
    B.R
    Jerry
  • Note: This was originally posted on 12th December 2011 at http://forums.arm.com

    TCM is much more faster than external memory, but this only effect code which is placed in TCMs.
    If you doublt that too much cache miss causes the slow performance, you can try to read cache statistic registers if your chip has these registers.
    However, your program would be the same slow on other chip with big caches, because there should still be many cache misses for your program architecture. At least big cache will not do much better than small cache.
    Maybe your external memory should be the real bottleneck. If so, try to optimize timing of the external memory or choose a better ram. :)
    Pls forgive my bad english.
  • Note: This was originally posted on 18th January 2011 at http://forums.arm.com

    >> My question is what is the source of problem?

    You pretty much have the answer.

    This is quite a high frequency chip, but it only has a 16KB L1 cache and no L2.  The code runs fast as long as instructions are inside the I cache, but as soon as you run outside of the cache you slow down really fast. You are typically having single cycle latency to L1, typically 60-120 cycles to hit main memory although I've never used this specific device. If you miss in L1 you start introducing huge bubbles in your execution time while you wait for instructions to load from main memory.

    Your problem is essentially that your "active code" at any point in time is bigger than the cache - so when you run an instruction is has a high probability of not being in the cache. Unfortunately, that's simply the workings of cached processors - they statistically improve performance, but they can't work magic. You need to reduce the volume of "active" code at any point in your application so that it is smaller than the cache,  so you introduce fewer of these cache miss bubbles ...

    If you get really stuck and have a choice of device, then something with a larger L1 and an L2 may be an alternative ...

    Iso
  • Note: This was originally posted on 21st January 2011 at http://forums.arm.com

    Do you have any quantitative benchmarks for your application which rate it on these three platforms?

    There are two numbers in your QBench results which are of interest - although I don't know the guts of the benchmark or your application, so these are educated guesses.

    The first two devices have similar CPU to memory performance ratios, which means that hopefully performance scales with frequency across the two platforms (if you have a faster CPU you need the memory system to speed up too). The for S3C6410 the CPU number is almost as high as the at550, but the memory rating is only just over half of the score of the at550.  If your application is essentially a memory bound problem because it isn't caching well you should be seeing just over half the performance - you simply do not get the advantage of the faster CPU speed because the memory is your bottleneck.

    Last point would be that the file I/O performance of the S3C6410 integration is dire compared to the first two platforms. If your benchmark has to spend time loading code or data from file then that is obviously really not helping - again, the CPU is going to just sit idle if it cannot get data fast enough. If you do use file i/o then you may want to try and remove that from your application (or load from a ram drive if it fits) to remove that as a possible cause.

    Iso