Arm Community
Site
Search
User
Site
Search
User
Groups
Research Collaboration and Enablement
DesignStart
Education Hub
Innovation
Open Source Software and Platforms
Forums
AI and ML forum
Architectures and Processors forum
Arm Development Platforms forum
Arm Development Studio forum
Arm Virtual Hardware forum
Automotive forum
Compilers and Libraries forum
Graphics, Gaming, and VR forum
High Performance Computing (HPC) forum
Infrastructure Solutions forum
Internet of Things (IoT) forum
Keil forum
Morello Forum
Operating Systems forum
SoC Design and Simulation forum
中文社区论区
Blogs
AI and ML blog
Announcements
Architectures and Processors blog
Automotive blog
Graphics, Gaming, and VR blog
High Performance Computing (HPC) blog
Infrastructure Solutions blog
Innovation blog
Internet of Things (IoT) blog
Operating Systems blog
Research Articles
SoC Design and Simulation blog
Tools, Software and IDEs blog
中文社区博客
Support
Arm Support Services
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Support forums
Arm Development Studio forum
Slow performance on samsung S3C6410
Jump...
Cancel
Locked
Locked
Replies
8 replies
Subscribers
121 subscribers
Views
3799 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Slow performance on samsung S3C6410
Offline
Marcin Jędrzejewski
over 9 years ago
Note: This was originally posted on 18th January 2011 at
http://forums.arm.com
Hi,
I'am a software developer and I am trying to port our product to new device. This is Windows CE 6 device with S3C6410 (ARM1176JZF-S) CPU. The problem is that Q-Bench benchmarks show that this is very fast system but after executing our application it is actually very slow.
I have spend a lot of time profiling various parts of our product, but it shows nothing. Finally what I have found out is that the problem is with the huge code amount. Actually our .exe is ~10MB in size. I have made tests in which I have auto generated huge amounts of code (~200,000 lines of c++ code, VS2005 compiled), and now executing this exe (~1.5MB) on this device shows significant slow down, 8 - 10 times comparing it to other devices (with slower CPUs). This auto generated code does nothing with data, it just executes lots of functions which just increment some variables.
My question is what is the source of problem? From What I know this CPU has 16 KiB instruction cache. Can it be somehow badly configured? I actually have no contact with this device manufacturer. I can only give some hints to its reseler to maybe push information further.
some more info:
Q-Bench Pro - shows that Cache Line == 8, while on other devices it is 32
CeGetCacheInfo - gives below results:
dwL1Flags=0
dwL1ICacheSize=16384
dwL1ICacheLineSize=32
dwL1ICacheNumWays=4
dwL1DCacheSize=16384
dwL1DCacheLineSize=32
dwL1DCacheNumWays=4
dwL2Flags=0
dwL2ICacheSize=0
dwL2ICacheLineSize=0
dwL2ICacheNumWays=0
dwL2DCacheSize=0
dwL2DCacheLineSize=0
dwL2DCacheNumWays=0
Thank You for any help
Martin
Parents
Offline
Peter Harris
over 9 years ago
Note: This was originally posted on 18th January 2011 at
http://forums.arm.com
>> My question is what is the source of problem?
You pretty much have the answer.
This is quite a high frequency chip, but it only has a 16KB L1 cache and no L2. The code runs fast as long as instructions are inside the I cache, but as soon as you run outside of the cache you slow down really fast. You are typically having single cycle latency to L1, typically 60-120 cycles to hit main memory although I've never used this specific device. If you miss in L1 you start introducing huge bubbles in your execution time while you wait for instructions to load from main memory.
Your problem is essentially that your "active code" at any point in time is bigger than the cache - so when you run an instruction is has a high probability of not being in the cache. Unfortunately, that's simply the workings of cached processors - they statistically improve performance, but they can't work magic. You need to reduce the volume of "active" code at any point in your application so that it is smaller than the cache, so you introduce fewer of these cache miss bubbles ...
If you get really stuck and have a choice of device, then something with a larger L1 and an L2 may be an alternative ...
Iso
Cancel
Up
0
Down
Cancel
Reply
Offline
Peter Harris
over 9 years ago
Note: This was originally posted on 18th January 2011 at
http://forums.arm.com
>> My question is what is the source of problem?
You pretty much have the answer.
This is quite a high frequency chip, but it only has a 16KB L1 cache and no L2. The code runs fast as long as instructions are inside the I cache, but as soon as you run outside of the cache you slow down really fast. You are typically having single cycle latency to L1, typically 60-120 cycles to hit main memory although I've never used this specific device. If you miss in L1 you start introducing huge bubbles in your execution time while you wait for instructions to load from main memory.
Your problem is essentially that your "active code" at any point in time is bigger than the cache - so when you run an instruction is has a high probability of not being in the cache. Unfortunately, that's simply the workings of cached processors - they statistically improve performance, but they can't work magic. You need to reduce the volume of "active" code at any point in your application so that it is smaller than the cache, so you introduce fewer of these cache miss bubbles ...
If you get really stuck and have a choice of device, then something with a larger L1 and an L2 may be an alternative ...
Iso
Cancel
Up
0
Down
Cancel
Children
No data