wiki:ProfilingOProfile

Profiling with OProfile

The oprofile tool is an unobtrusive profiler ‒ you don't need to rebuild the programs with any special options or run them in a special way, and you don't modify its code, which means the results are more accurate. Also, as it uses some newer CPU extensions, the results can measure which lines of code are the expensive ones with addition to functions and why it is so. As a bonus, you could even look at what is happening in the kernel (if the application spends too much there).

However, it is available only on linux and with reasonably new CPU, it needs a kernel support and you need to run the opcontrol tool as root (the other „examination“ tools can be run as normal user).

It lives on http://oprofile.sourceforge.net/docs/.

Installation

To run OProfile, you need several things:

  • New enough linux kernel (anything not really ancient should be new enough)
  • The OProfile kernel module (eg CONFIG_OPROFILE=m)
  • The oprofile package (should be installable by distro package)

Hardware & precision

OProfile uses some hardware features of the processor. Depending on what is available, the precision and capabilities differ.

Modern AMD processors have a feature called IBS (Instruction Based Sampling). The processor picks an instruction once in a time (the interval is configurable) and records down everything about the instruction (that it happened, if it needed data from cache/main memory, if the memory was on the same or different northbridge, if it was a misprediction, if the instruction itself was already decoded in the instruction cache…) and writes it to some buffer. When the buffer is full, an interrupt is fired and the oprofile kernel module takes it out and evaluates/stores to some other place. This one is the most precise thing currently available (event is assigned to the assembler instruction which really caused it). You can also watch as many IBS events as you want at once.

An older feature is called Intel Performance Counters, but is generally available on AMD processors as well. There's a small number of counters, which can be activated and configured to watch an event. Every time the event happens, the counter is increased. When it reaches a configured value, it fires off an interrupt and the oprofile module guesses which instruction caused it. You can have only few events active and the guessing is slightly inaccurate (as there are multiple instructions in the pipeline in the given time and any of them can be the cause), so the measured peaks are usually moved slightly down in the code and stretched across multiple instructions.

There's a fallback ‒ it can use a timer interrupt instead of the counter based interrupts, but it can only measure which instructions/places of code are executed most.

You can choose any of these methods at initialization time.

Initialization

First, you need to set parameters. Once you do, they are remembered until you change them to something else (even between restarts). You can do so for example this way (as root):

opcontrol --no-vmlinux -e IBS_OP_ALL:50000:0x00 --separate library

The --no-vmlinux tells oprofile you don't want to examine what is going on in the kernel itself. The data for it is still gathered, but there are no symbols assigned to them. If you want to have a look at these, you can use --vmlinux option, but you need to have an uncompress image of your kernel with debuging symbols around.

The -e options tells oprofile to look at the given event and the configuration for the event. This one would mean to count all instructions that are marked by the IBS and that a marked instruction should happen every 50000 processor cycles (the mask after the 50000 means to count cycles, if you leave it with the default 0x01, it will count instructions, which is slightly less accurate). You can have multiple -e options to list all the events you want to watch, but if you use IBS, it probably makes no sense to give each of them a different configuration. The counter based ones can have a different counts and mask each.

The --separate options tells oprofile to assign samples to running programs in addition to library names (if they happen inside library) ‒ so you see b10-auth as well as libdns++.so.0.0 or like.

Note that each parameter is configured separately, so if you set the events only, the --no-vmlinux and --separate options stay the same. You need to restart the daemon for it to make effect (eg. opcontrol --deinit before starting next profiling). You can reset the events to default by -e default.

If you have an Intel CPU, you can't use IBS_*, you can use the CPU_CLK_UNHALTED as the less accurate equivalent of IBS_OP_ALL.

Data gathering

First, prepare whatever you are going to profile, so the preparation itself doesn't clutter the data too much (it could be filtered, but why care). Be sure to have an optimized build (I usually pass -O0 to my build to speed it up, so not for profiling ;-), but with debugging symbols (they are not necessary to profile, but the output is not very useful then). You can even start the bind10 before starting to gather profiling data.

So, now remove data from previous sessions (if any) and start the profiling by (as root):

opcontrol --reset
opcontrol --start

and load the bind10 for few seconds.

After you're done, stop the profiling and make sure everything is written out to the disk.

opcontrol --stop
opcontrol --dump

If you want, you can also shut down oprofile (the data can be examined even when it is turned completely off) by:

opcontrol --deinit

Examinig the results

There are two programs for examining the results

opreport

This reports more global statistics. You probably want to run it with the -lg and possibly -D smart flags. It'll show a table, where each event you gathered has two columns ‒ number of samples gathered for the event and percent of the global occurrences for the symbol on that event. It is usually sorted by the first event.

You can tweak the output a little.

You can also request data for a single program only:

opreport -lg ~/work/bind10/src/bin/auth/.libs/b10-auth

Note: The exact path may depend on your OS and version of libtool. Some report to be in the form of .libs/lt-b10-auth

The output would look something like this:

PU: AMD64 family10, speed 3.3e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        linenr info                 image name               symbol name
175801   35.6836  (no location information)   libsqlite3.so.0.8.6      /usr/lib64/libsqlite3.so.0.8.6
116622   23.6716  (no location information)   libc-2.14.1.so           /lib64/libc-2.14.1.so
68819    13.9687  (no location information)   libstdc++.so.6.0.14      /usr/lib64/gcc/x86_64-pc-linux-gnu/4.5.3/libstdc++.so.6.0.14
25100     5.0947  (no location information)   libpthread-2.14.1.so     /lib64/libpthread-2.14.1.so
6677      1.3553  name.cc:411                 libdns++.so.0.1.0        isc::dns::Name::toText(bool) const
3499      0.7102  udp_server.cc:200           libasiodns.so.0.0.0      isc::asiodns::UDPServer::operator()(asio::error_code, unsigned long)
3306      0.6710  name.cc:666                 libdns++.so.0.1.0        isc::dns::Name::split(unsigned int, unsigned int) const

opannotate

This one can be used to produce annotated source code. If you call it like this:

opannotate --source bind10/src/bin/auth/.libs/b10-auth

It'll output source of each file that has at last one sample. The files are sorted by the count of samples on the first event. The lines of code are prepended with the columns for sample counts and percentages, so you can see the exact lines of code that are performance expensive, and if you enabled some other events (like cache misses, branch mispredictions and others), you can see why.

Note that you need a full path to the name.

The output would look like this:

               :void
    92  0.0187 :Message::makeResponse() { /* isc::dns::Message::makeResponse() total:    554  0.1124 */
    53  0.0108 :    if (impl_->mode_ != Message::PARSE) {
               :        isc_throw(InvalidMessageOperation,
               :                  "makeResponse() is performed in non-parse mode");
               :    }
               :
    45  0.0091 :    impl_->mode_ = Message::RENDER;
               :
               :    impl_->edns_ = EDNSPtr();
     3 6.1e-04 :    impl_->flags_ &= MESSAGE_REPLYPRESERVE;
    55  0.0112 :    setHeaderFlag(HEADERFLAG_QR, true);
               :
               :    impl_->rrsets_[SECTION_ANSWER].clear();
     6  0.0012 :    impl_->counts_[SECTION_ANSWER] = 0;
               :    impl_->rrsets_[SECTION_AUTHORITY].clear();
               :    impl_->counts_[SECTION_AUTHORITY] = 0;
               :    impl_->rrsets_[SECTION_ADDITIONAL].clear();
     2 4.1e-04 :    impl_->counts_[SECTION_ADDITIONAL] = 0;
     8  0.0016 :}
Last modified 6 years ago Last modified on Jan 18, 2012, 2:05:53 PM