Linux Profiling with gprof

Working with gprof

What can be derived from a profile
What is the overhead of profiling
How does profiling work
How to tell if profiling is enabled
How to generate a profile
How to read the profile
Appendices and supporting data

gprof cheat sheet

Compiling with profiling enabled

Add the switch -pg to have gcc insert the appropriate code to enable profiling at the compile/link stage.

Checking for a profile enabled build

Run nm <binary> | grep 'mcount' which will tell you if the profiling functions are present in the build.

Compiling with optimizations

gprof relies on function calls being made so that _mcount can keep track of them. Some kinds of optimization will eliminate function calls to speed up execution. When this occurs the overhead of profiling decreases - but the profile is mainly useless. Some optimizations I experimented with

-O does not eliminate function calls (but does speed up execution a bit)
-Ofast does eliminate function calls. Execution is sped up, but the profile is pretty useless.

Running with profiling enabled

When a binary is executed that has been built with the -pg option to gcc - it will generate a gmon.out file as long as the process exits cleanly. Depending on signal handlers - a process may not write out a gmon.out file if it does not exit cleanly.

Running gprof

gprof is the thing that looks at gmon.out and the compiled binary and produces a text output containing the flat-profile and the call-graph.

Run gprof <binary-compiled-with-gprof ; this will implicitly use gmon.out
Run gprof <binary-compiled-with-gprof> <profile-datafile-gmon.out-by-default>
- There is nothing stopping you from running gprof against a gmon.out and a that are totally unrelated. If the functions/symbols make no sense, then this is probably what’s happening.

What can be derived from a profile

A profile can be used to tell a performance analyst which functions are called (who calls whom) within a particular program and how long was spent in each function. The call-graph is generated by calling _mcount every time a function is called. The time spent in each function is generated by __monstartup which programs an interrupt to sample the PC.

What is the overhead of profiling

The overhead of profiling depends on the ratio of function calls to the amount of work performed in each function. Since we call __mcount on every function call - there is cost per function call. For instance if each function does a small amount of work - the overhead will be high. However if each function does a LOT of work - and relatively few calls/returns relative to the overall work - then the overhead will be low.

High Overhead (Overall runtime is 2X or more)
Low Overhead (Overall runtime is within 1%)

How does profiling work

We get a profile by building a binary via gcc with “profiling enabled” - we do that with the -pg option to gcc.

When we do this, gcc adds a few things to the compiled binary

A call to the function __mcount each time a function is called (this is what generates the call graph and counts of the calls)
A call to the function __monstartup at the beginning of the program (this is what generates the sampling of the PC)

Then, when we run the program, a file is generated called gmon.out which contains data about the running program such as which functions call other functions and an estimate of how much time was spent in each function. It is an estimate because the PC is sampled - but the time spent in each function should be correct relatively. We can then read the gmon file by using the tool gprof.

How to tell if profiling is enabled

For any given binary we can tell if profiling is enabled by searching the binary for the symbols __mcount and __monstartup using nm.

How to generate a profile

A profile is generated in three steps

Compile the binary with the flag -pg
Run the binary - after execution finished the file gmon.out is written to the PWD
Run the gprof tool - you will need to point it at the binary file that was running AND gmon.out

How to read the profile

Flat profile Example Flat Profile
Call Graph Example Call Graph

Appendices

How to check if a binary has profiling enabled?
How are mcount and `monstartup used?
Some examples of profiling overhead with different function “sizes”
Lowering the impact of profiling with compiler optimization
Test program (simple_caller.c)
Test makefile

How to check if a binary has profiling enabled

e.g. for my binary called simple_caller. If these symbols are part of the binary - then the binary was built using the -pg option and it will generate a gmon.out file.

$ nm simple_caller_pg | egrep "monstart|mcount"
                 U mcount@GLIBC_2.2.5
                 U __monstartup@GLIBC_2.2.5

The filetype of gmon.out is as follows

$ file gmon.out 
gmon.out: GNU prof performance data - version 1

How are mcount and monstartup used

We can use objdump -d to see the assembly language for our file - when profiling is enabled we will see the calls to the special profiling functions (mcount and monstartup) from our simple_caller file including main

Counting stack frames and callers with mcount

$ objdump -d simple_caller_pg
...
000000000000126b <calcme1>:
    126b:       f3 0f 1e fa             endbr64
    126f:       55                      push   %rbp
    1270:       48 89 e5                mov    %rsp,%rbp
    1273:       48 83 ec 20             sub    $0x20,%rsp
    1277:       ff 15 6b 2d 00 00       call   *0x2d6b(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
...
00000000000012e5 <calcme2>:
    12e5:       f3 0f 1e fa             endbr64
    12e9:       55                      push   %rbp
    12ea:       48 89 e5                mov    %rsp,%rbp
    12ed:       48 83 ec 40             sub    $0x40,%rsp
    12f1:       ff 15 f1 2c 00 00       call   *0x2cf1(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
...
 0000000000001391 <spinloop>:
    1391:       f3 0f 1e fa             endbr64
    1395:       55                      push   %rbp
    1396:       48 89 e5                mov    %rsp,%rbp
    1399:       48 83 ec 20             sub    $0x20,%rsp
    139d:       ff 15 45 2c 00 00       call   *0x2c45(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
...
000000000000141b <main>:
    141b:       f3 0f 1e fa             endbr64
    141f:       55                      push   %rbp
    1420:       48 89 e5                mov    %rsp,%rbp
    1423:       48 83 ec 40             sub    $0x40,%rsp
    1427:       ff 15 bb 2b 00 00       call   *0x2bbb(%rip)        # 3fe8 <mcount@GLIBC_2.2.5>
    142d:       48 c7 45 e0 00 00 00    movq   $0x0,-0x20(%rbp)

Setting up PC sampling with monstartup

Disassembly of section .init:

0000000000001000 <_init>:
    1000:       f3 0f 1e fa             endbr64
    1004:       48 83 ec 08             sub    $0x8,%rsp
    1008:       48 8d 05 21 01 00 00    lea    0x121(%rip),%rax        # 1130 <__gmon_start__>
    100f:       48 85 c0                test   %rax,%rax
    1012:       74 02                   je     1016 <_init+0x16>
    1014:       ff d0                   call   *%rax
    1016:       48 83 c4 08             add    $0x8,%rsp
    101a:       c3                      ret
...
0000000000001130 <__gmon_start__>:
    1130:       f3 0f 1e fa             endbr64
    1134:       8b 05 d6 2e 00 00       mov    0x2ed6(%rip),%eax        # 4010 <__TMC_END__>
    113a:       85 c0                   test   %eax,%eax
    113c:       74 02                   je     1140 <__gmon_start__+0x10>
    113e:       c3                      ret
    113f:       90                      nop
    1140:       48 83 ec 08             sub    $0x8,%rsp
    1144:       48 8d 35 f2 04 00 00    lea    0x4f2(%rip),%rsi        # 163d <etext>
    114b:       48 8d 3d ae ee ff ff    lea    -0x1152(%rip),%rdi        # 0 <__executable_start>
    1152:       c7 05 b4 2e 00 00 01    movl   $0x1,0x2eb4(%rip)        # 4010 <__TMC_END__>
    1159:       00 00 00
    115c:       e8 5f ff ff ff          call   10c0 <__monstartup@plt>
    1161:       48 8b 3d 78 2e 00 00    mov    0x2e78(%rip),%rdi        # 3fe0 <_mcleanup@GLIBC_2.2.5>
...
00000000000010c0 <__monstartup@plt>:
    10c0:       f3 0f 1e fa             endbr64
    10c4:       f2 ff 25 e5 2e 00 00    bnd jmp *0x2ee5(%rip)        # 3fb0 <__monstartup@GLIBC_2.2.5>
    10cb:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

Examples of overhead

Overhead is relative to the “size” of the function.

We can create a test program that loops around and calls a few functions. We can make each of these functions do a bit of work - that is also in a loop. By playing with the values of “outerloop” and “inner loop” we can alter how much time is spent within each function. We can set these valus such that they have about the same runtime - but have a very different execution profile.

With a high outerloop:innerloop (500M:1) ratio we do a lot of function calls and not much work inside each function
With a low outerloop:innerlop (10:1) reatio we do only a few function calls and a lot of work in each function

Using our test program (‘simple_caller’) we can compile an example binary with two variables

The number of times to run the outer loop which calls a set of 3 functions directly
The number of times to spin inside the functions.

We can use these two variables to make the binary execute for about the same amount of time (around 25 seconds).

Run1 : Outer loop is 500,000,000 Iterations Inner loop is 1 Iteration - Runtime 0m23.521s
Run2 : Outer loop is Outer loop is 100,000 Iterations Inner loop is 10,000 Iterations - Runtime 0m25.425s

The behavior is as follows

Run1 makes a lof of function calls - and very little work is done per call
Run2 makes fewer calls - but each function does more work (more inner loops)

If the primary overhead of profiling is counting the calls with __mcount then we would expect the runtime delta between the non-profiled and profiled executions to be much higher in Run1 than Run2

Overhead runtimes

Run1 lots of calls, each function does very little

Run1: Execution time with no profiling 23.5 Seconds
Run1 : Execution time with profiling 42 Seconds
Run1 : Overhead 42/23 == 1.8X

Run2 fewer calls but each function does a lot of work

Run2 : Execution time with no profiling 25.425 Seconds
Run2 : Execution time with profiling 25.470 Seconds
Run2 : Overhead 25.470/25.425 = 1.002 (0.2%)

These results support the hypothesis that the profiling overhead is mostly based on the to the number of functions called relative to the work done per call.

Profiles of Run1 and Run2

We can see how the different variables affect the execution of the program by examinig the profiles - and that is the whole point.

Run1 Profile

Lots of small function calls - one giveaway here is that of the 7.6 seconds sampled, almost 2 seconds was spent in main which does nothing much at all apart from calling other functions. Also each call (self ns/call) is very low.. 4NS per call. Be careful of the metrics, because in the following profile the time is measured in uSec - so 1000X different order of magnitude.

Flat profile : Run1

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 25.59      1.96     1.96 500000001     3.92     3.92  spinloop
 25.20      3.89     1.93                             main
 20.23      5.44     1.55 333333334     4.65     6.76  calcme2
 13.77      6.50     1.05 500000001     2.11     2.11  return3
 10.25      7.28     0.79 166666667     4.71     6.82  calcme1
  4.96      7.66     0.38                             _init

Call Graph : Run1

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.13% of 7.66 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     95.0    1.93    5.35                 main [1]
                1.55    0.70 333333334/333333334     calcme2 [2]
                1.96    0.00 500000001/500000001     spinloop [3]
                0.79    0.35 166666667/166666667     calcme1 [4]
-----------------------------------------------
                1.55    0.70 333333334/333333334     main [1]
[2]     29.4    1.55    0.70 333333334         calcme2 [2]
                0.70    0.00 333333334/500000001     return3 [5]
-----------------------------------------------
                1.96    0.00 500000001/500000001     main [1]
[3]     25.6    1.96    0.00 500000001         spinloop [3]
-----------------------------------------------
                0.79    0.35 166666667/166666667     main [1]
[4]     14.8    0.79    0.35 166666667         calcme1 [4]
                0.35    0.00 166666667/500000001     return3 [5]
-----------------------------------------------
                0.35    0.00 166666667/500000001     calcme1 [4]
                0.70    0.00 333333334/500000001     calcme2 [2]
[5]     13.8    1.05    0.00 500000001         return3 [5]
-----------------------------------------------
                                                 <spontaneous>
[6]      5.0    0.38    0.00                 _init [6]
-----------------------------------------------

Run2 Profile

Run2 is the run where the calls do a lot of work - we don’t see main in the list of samples at all. Also each function call is in the order of 10s of uSec - much more than the profile of Run1

Flat Profile : Run2

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.84      1.56     1.56    66668    23.40    23.40  calcme2
 26.52      2.39     0.83   100001     8.30     8.30  spinloop
 13.10      2.80     0.41                             _init
 10.54      3.13     0.33    33333     9.90     9.90  calcme1
  0.00      3.13     0.00   100001     0.00     0.00  return3

Call Graph : Run2

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.32% of 3.13 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     86.9    0.00    2.72                 main [1]
                1.56    0.00   66668/66668       calcme2 [2]
                0.83    0.00  100001/100001      spinloop [3]
                0.33    0.00   33333/33333       calcme1 [5]
-----------------------------------------------
                1.56    0.00   66668/66668       main [1]
[2]     49.8    1.56    0.00   66668         calcme2 [2]
                0.00    0.00   66668/100001      return3 [6]
-----------------------------------------------
                0.83    0.00  100001/100001      main [1]
[3]     26.5    0.83    0.00  100001         spinloop [3]
-----------------------------------------------
                                                 <spontaneous>
[4]     13.1    0.41    0.00                 _init [4]
-----------------------------------------------
                0.33    0.00   33333/33333       main [1]
[5]     10.5    0.33    0.00   33333         calcme1 [5]
                0.00    0.00   33333/100001      return3 [6]
-----------------------------------------------
                0.00    0.00   33333/100001      calcme1 [5]
                0.00    0.00   66668/100001      calcme2 [2]
[6]      0.0    0.00    0.00  100001         return3 [6]
-----------------------------------------------

Lowering the impact of profiling with compiler optimization

If part of the cost of profiling is the per-function call cost, then if we optimize the code using gcc optimizations that inline functions - we should see a reduction in both the runtime of the un-profiled binary and the delta in runtime between the un-profiled and profiled binaries. For now we will do optimization using gcc -Ofast

Summary

No Optimization : real 0m23.299s
No Optimization with profiling : real 0m42.057s
With -Ofast Optimization : real 0m18.205s
With -Ofast Optimization + profiling (-pg) : real 0m18.035s

So using -Ofast definitely reduces the overhead of profiling by 2X or more. Unfortunately, the reason is that gcc has inlined the function (which also speeds up the un-profiled exectution (but not by as much). Unfortunately inlining the function means that our profile is sort-of useless.

flat profile without optimization

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
 49.84      1.56     1.56    66668    23.40    23.40  calcme2
 26.52      2.39     0.83   100001     8.30     8.30  spinloop
 13.10      2.80     0.41                             _init
 10.54      3.13     0.33    33333     9.90     9.90  calcme1
  0.00      3.13     0.00   100001     0.00     0.00  return3

flat profile with optimization

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 78.70      1.33     1.33                             main
 21.30      1.69     0.36                             _init

Timings for Un-Optimized and Optimized executions

Un-Optimized, no profile

gary@:~/git/unixfun/simple-call$ time ./simple_caller
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for    500000026 iterations
counter incremented  500000001 times
Last result was      1500000144

real    0m23.299s
user    0m23.280s
sys     0m0.005s

Un-Optimized with profile

gary@:~/git/unixfun/simple-call$ time ./simple_caller_pg
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for    500000026 iterations
counter incremented  500000001 times
Last result was      1500000144

real    0m42.057s
user    0m42.021s
sys     0m0.006s
gary@:~/git/unixfun/simple-call$ time ./simple_caller_opt
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for    500000026 iterations
counter incremented  500000001 times
Last result was      1500000144

real    0m18.205s
user    0m18.188s
sys     0m0.007s
gary@:~/git/unixfun/simple-call$ time ./simple_caller_opt_pg
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for    500000026 iterations
counter incremented  500000001 times
Last result was      1500000144

real    0m18.035s
user    0m18.024s
sys     0m0.004s

Test program simpler_caller

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* The variables OUTERLOOP and SPINLOOP
   are epxpected to beset in the makefile */

#ifndef OUTERLOOP
#define OUTERLOOP 500000000
#endif

#ifndef SPINLOOP
#define SPINLOOP 1
#endif

/* Global counter */
long int counter = 0;

long int return3(int f) {
    return(f*3);
}   

long int calcme1(int f,long int numspins) {
    long int i;
    int random=0;
    for (i=0; i < numspins; i++) {
        i=i+1;
        random = rand() % 100;
    }   
    return(return3(f+random));
}

long int calcme2(int f,long int numspins,int *ptr) {
    long int i,res=0;  
    int random=0;
    /* For each time around the spin
       loop read an int from the memory 
       we allocated in main()
    */
    for (i=0; i  < numspins; i++) {
        res=i+ptr[i];
        random = rand() % 100;
    }   
    return(return3(f+res+random));
}

long int spinloop(long int z) {
    counter +=1;
    long i;
    int random=0;
    for (i=0; i < z; i++) {
        i=i+1;
        random = rand() % 100;
    }
    return(counter+random);
}

int main() {
    long int i,res,lastres;
    long int loopcounter=0;
    long int iterations = OUTERLOOP; 
    long int numspins = SPINLOOP;
    int *ptr;

    /* Allocate some memory and write to it
       give our functions some  memory reads
       to work with
    */
    ptr=(int *)malloc(1000000*sizeof(int));
    for (i=0; i<1000000; i++) {
        ptr[i]=rand() % 100;
    }
    printf("Outer loop is %ld\n", iterations);
    printf("Inner loop is %ld\n", numspins);

    for (i = 0; i <= iterations; i++) {
        // Call calcme1 one in three iterations
        if ((i + 1) % 3 == 0) {
            res=calcme1(i,numspins);
        } else {
        // Call calcme2 two in three iterations
            res=calcme2(i,numspins,ptr);
        }
        // Call spinloop every iteration
        loopcounter=spinloop(numspins);
        lastres=res;
    }
    printf("Outer iteration counter is %ld\n", iterations);
    printf("spinloop spun for    %ld iterations\n", loopcounter);
    printf("counter incremented  %ld times\n", counter);
    printf("Last result was      %ld\n", lastres);
    return 0;
}

Test makefile

# Define the compiler and the flags
CC = gcc
CFLAGS = -Wall -g

# Low Overhead
#SPINLOOP   = 10000
#OUTERLOOP  = 100000

# High Overhead
SPINLOOP   = 1
OUTERLOOP  = 500000000

# Create multiple outputs
# 1 - Unoptimized executable - will make the function call using standard calling convention
# 2 - Optimized executable - will make the function call using optimized calling convention (inlined)

# As above but in this case we will use the -pg flag to enable profiling

# Define the target executables
TARGET_BASE = simple_caller
TARGET_PG   = simple_caller_pg
TARGET_OPT  = simple_caller_opt
TARGET_OPT_PG   = simple_caller_opt_pg

# Define the source files
SRCS = simple_caller.c

# Define the object files
OBJS = $(SRCS:.c=.o)
OBJS_PG = $(SRCS:.c=_pg.o)
OBJS_OPT = $(SRCS:.c=_opt.o)
OBJS_OPT_PG = $(SRCS:.c=_pg_opt.o)

# Default target
all: $(TARGET_BASE) $(TARGET_PG) $(TARGET_OPT) $(TARGET_OPT_PG)

# Link the object files to create the executable without profiling
$(TARGET_BASE): $(OBJS)
	$(CC) $(CFLAGS) -o $(TARGET_BASE) $(OBJS)

# Link the object files to create the executable with profiling
$(TARGET_PG): $(OBJS_PG)
	$(CC) $(CFLAGS) -pg -o $(TARGET_PG) $(OBJS_PG)

# Link the object files to create the optimized executable without profiling
$(TARGET_OPT): $(OBJS_OPT)
	$(CC) $(CFLAGS) -Ofast -o $(TARGET_OPT) $(OBJS_OPT)

# Link the object files to create the optimized executable with profiling
$(TARGET_OPT_PG): $(OBJS_OPT_PG)
	$(CC) $(CFLAGS) -Ofast -pg -o $(TARGET_OPT_PG) $(OBJS_OPT_PG)

# Compile the source files into object files without profiling
%.o: %.c
	$(CC) $(CFLAGS) -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -c $< -o $@

# Compile the source files into object files with profiling
%_pg.o: %.c
	$(CC) $(CFLAGS) -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -pg -c $< -o $@

# Compile the source files into optimized object files without profiling
%_opt.o: %.c
	$(CC) $(CFLAGS) -Ofast -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -c $< -o $@

# Compile the source files into optimized object files with profiling
%_pg_opt.o: %.c
	$(CC) $(CFLAGS) -Ofast -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -pg -c $< -o $@

# Clean up the build files
clean:
	rm -f $(OBJS) $(OBJS_PG) $(OBJS_OPT) $(OBJS_OPT_PG) $(TARGET_BASE) $(TARGET_PG) $(TARGET_OPT) $(TARGET_OPT_PG)

.PHONY: all clean

When we want to get a profile from gprof we ask gcc to insert additional code (a function call) into the compiled binary, hence there is no need for any special hardware support - no issues running the tool in a VM/container etc whereas some tools need to access the CPU directly via passthrough, or you need to be root etc.

The general workflow is as follows

Build a binary using the flag -pg this will create a “special” build of your program which implements profiling - the program should operate as normal, albeit a bit slower. the implementation of profiling is detailed here in the gprof man/help/doc page

There will be an additional symbol/function inserted into the binary that gcc produces. In the output of nm you will see this function - it is called mcount. This is the function that does the work of figuring out which function calls what - so an easy way to see if a binary has profiling enabled is to look for that symbol.

During compilation/linking gcc inserts calls to mcount into each function call in the assembly/binary that it produces for you.

In the output below the symbol counter is just a variable in my program.

gary:~/git/unixfun/simple-call$ cc -o 1b-pg -pg simple_call_1B.c
gary:~/git/unixfun/simple-call$ nm ./1b-pg |grep count

0000000000004018 B counter
                 U mcount@GLIBC_2.2.5    <----- This binary has profiling enabled

gary:~/git/unixfun/simple-call$ cc -o 1b simple_call_1B.c
gary@:~/git/unixfun/simple-call$ nm ./1b |grep count

0000000000004014 B counter
                                         <------- No mcount, means no profiling

Also in the assembly you will see calls to _mcount. Output from objdump -d <binary>

    1242:       53                      push   %rbx
    1243:       ff 15 9f 2d 00 00       call   *0x2d9f(%rip) # 3fe8 <mcount@GLIBC_2.2.5>  <----- this is only seen in pg builds
    1249:       bb 00 ca 9a 3b          mov    $0x3b9aca00,%ebx

When you run the binary compiled with -pg it will create a binary output file called gmon.out in the currenty working directory.
Then use the gprof tool to read the gmon.out - it needs to reference the binary (compiled with pg) to show the right symbols/functions etc. In our example below for the binary compiled with profiling. After the run completes - issue gprof 1b-pg and the gprof command will read the file gmon.out and use the symbols in the binary to display the correct function names.