• Deep Learning Episode 4: Supercomputer vs Pong II
    In the previous post we parallelized Andrej Karpathy's policy gradient code to see whether a very simple implementation coupled with supercomputer speeds could learn to play Atari Pong faster than the...
  • Deep Learning Episode 1: Optimizing DeepMind's A3C on Torch
    In February, a new paper from Google's DeepMind team appeared on arxiv. This one was interesting – they showed dramatically improved performance and training time of their Atari-playing Deep Q-Learning...
  • Deep Learning Episode 2: Scaling TensorFlow over multiple EC2 GPU nodes
    In episode one we optimized Torch A3C performance on the new Intel Xeon Phi (Knight's Landing) CPU. Arm MAP and Performance Reports identified bottlenecks in our framework and sped up model training by...
  • Four simple tips for optimizing your code
    Arm DDT and Arm MAP are excellent tools for finding program flaws and performance issues – they are also very helpful for studying codes and coding techniques. In this article I present a handful of optimization...
  • Writing a MAP Custom Metric: PAPI IPC
    Arm MAP isn't just a lightweight profiler to help you optimize your code. It also lets you add your own metrics with just a couple of lines of code. To show how this works, I'm going to add PAPI's instructions...