Performance Tuning: Profiling and Timing

Materials Simulations

James Sethna, Spring 1999

The efficiency of a simulation depends, in decreasing order of importance, on

  1. the algorithm used to solve the problem (emphasized in this course)
  2. the use of optimized subroutine libraries for standard operations (as in the Intel FFT packages we used for correlation functions)
  3. the efficient use of cache and memory (not discussed yet), and
  4. the optimization of the inner loops which dominate the cost of computation.

Optimization strategies depend in detail on the kind of computer you’re using, but the process of optimization is largely independent of platform. Broadly speaking, one first generates a functioning program, designed with intelligent algorithms and written with some concern for speed and memory usage. One then profiles the code. On Unix platforms, there are excellent programs (gprof and various cousins) which will report how much time is spent in each subroutine and function call. If you’ve broken down your task into natural pieces, you can quickly find out where most of your CPU time is spent. Unfortunately, I haven’t figured out how to get profiling to activate on Visual Studio.

The final step is to turn your elegant inner loop into an CPU-efficient machine, probably at the expense of making it somewhat uglier and more verbose. You try various changes in the code, testing each one to see if it speeds up the routine noticably without changing the answer. As a general rule, modifications which make the code harder to modify and/or which are platform dependent should be avoided unless they speed things up a lot. We’ll try out some optimization on the MD code we wrote earlier.

I’ll warn you in advance that your code is probably pretty sound: over the weekend I wasn’t able to speed mine up noticably. This is mostly because we’re running on Intel boxes running Microsoft operating systems. Because they needed to run old DOS programs forever, they have not been able to add extra floating-point units, pipelines, and registers to nearly the same degree as the competing RISC machines. When you port your code to a Unix supercomputer, remember to tune your inner loops again!

  1. Copy Timer.h and Timer.cpp from the course directory MDTiming\MDWin into your MDWin directory. You may want to make a backup copy of your MD solution before you start.
  2. Add the two files to your project. Compile and run.
  3. Add a Timer as a member variable of the Atoms class.
  4. Add the following lines to CMDView::OnDraw, just above pDC->BitBlt:
  5. // Draw average time per force calculation in upper LHS

    char timeForceCalc[100];

    sprintf(timeForceCalc, "1000000*Time per Force = %6.3lf",

    1000000*pDoc->sim.atoms->timer.AverageTime());

    bufferDC.TextOut(0,WINDOW_BUFFER_RESOLUTION,timeForceCalc);

  6. Add timer.StartClock() as you enter Atoms::Forces(), and timer.StopClock() just as you return from it.
  7. Compile and run. If you run with the default settings (10x10 box, 10 atoms, regular Lennard-Jones potential with rCutoff=2.5 and sigma=0.5), we should be able to compare efficiencies of the force calculation in different people’s programs. The time spent should show up on the screen.
  8. Once the program is running, try changing to Release mode (Build, Set Active Configuration). How much does your code speed up? By increasing the time steps between observations, you can get better averages.
  9. Try omitting the sqrt(r) correction due to the cutoff. How much does that help? Add it back.
  10. Try changing the method we implement periodic boundary conditions: my method of checking if X is outside the range is not very standard. Try using fmod, as we do in Atoms::SetX when the "if" statements don’t suffice. You’ll need to add 1.5*boxSize[coord] first, and subtract 0.5*boxSize[coord] afterward. Jakob Shiotz uses something akin to
  11. dx[coord] = dx[coord] - boxSize[coord] *

    ((int)(dx[coord]*inverseBoxSize[coord]+1.5)-1);

    It may work better on RISC platforms. It may also gain from loop unrolling.

  12. Try some loop unrolling, if you have time and energy left. You can get some examples from Jacob Schiotz’s directory. I found vecnorm4.c illuminating. Vecpbc.c has a neat trick in it to do periodic boundary conditions in boxes of width 1.0, which doesn’t work on Intel machines because they are not IEEE compliant.

Statistical Mechanics: Entropy, Order Parameters, and Complexity, now available at Oxford University Press (USA, Europe).