Linux Profiling with gprof
Working with gprof
- What can be derived from a profile
- What is the overhead of profiling
- How does profiling work
- How to tell if profiling is enabled
- How to generate a profile
- How to read the profile
- Appendices and supporting data
gprof cheat sheet
Compiling with profiling enabled
Add the switch -pg to have gcc insert the appropriate code to enable profiling at the compile/link stage.
Checking for a profile enabled build
Run nm <binary> | grep 'mcount' which will tell you if the profiling functions are present in the build.
Compiling with optimizations
gprof relies on function calls being made so that _mcount can keep track of them. Some kinds of optimization will eliminate function calls to speed up execution. When this occurs the overhead of profiling decreases - but the profile is mainly useless. Some optimizations I experimented with
-Odoes not eliminate function calls (but does speed up execution a bit)-Ofastdoes eliminate function calls. Execution is sped up, but the profile is pretty useless.
Running with profiling enabled
When a binary is executed that has been built with the -pg option to gcc - it will generate a gmon.out file as long as the process exits cleanly. Depending on signal handlers - a process may not write out a gmon.out file if it does not exit cleanly.
Running gprof
gprof is the thing that looks at gmon.out and the compiled binary and produces a text output containing the flat-profile and the call-graph.
- Run
gprof <binary-compiled-with-gprof; this will implicitly usegmon.out - Run
gprof <binary-compiled-with-gprof> <profile-datafile-gmon.out-by-default>- There is nothing stopping you from running gprof against a
gmon.outand a that are totally unrelated. If the functions/symbols make no sense, then this is probably what’s happening.
- There is nothing stopping you from running gprof against a
What can be derived from a profile
A profile can be used to tell a performance analyst which functions are called (who calls whom) within a particular program and how long was spent in each function. The call-graph is generated by calling _mcount every time a function is called. The time spent in each function is generated by __monstartup which programs an interrupt to sample the PC.
What is the overhead of profiling
The overhead of profiling depends on the ratio of function calls to the amount of work performed in each function. Since we call __mcount on every function call - there is cost per function call. For instance if each function does a small amount of work - the overhead will be high. However if each function does a LOT of work - and relatively few calls/returns relative to the overall work - then the overhead will be low.
- High Overhead (Overall runtime is 2X or more)
- Low Overhead (Overall runtime is within 1%)
How does profiling work
We get a profile by building a binary via gcc with “profiling enabled” - we do that with the -pg option to gcc.
When we do this, gcc adds a few things to the compiled binary
- A call to the function
__mcounteach time a function is called (this is what generates the call graph and counts of the calls) - A call to the function
__monstartupat the beginning of the program (this is what generates the sampling of the PC)
Then, when we run the program, a file is generated called gmon.out which contains data about the running program such as which functions call other functions and an estimate of how much time was spent in each function. It is an estimate because the PC is sampled - but the time spent in each function should be correct relatively. We can then read the gmon file by using the tool gprof.
How to tell if profiling is enabled
For any given binary we can tell if profiling is enabled by searching the binary for the symbols __mcount and __monstartup using nm.
How to generate a profile
A profile is generated in three steps
- Compile the binary with the flag
-pg - Run the binary - after execution finished the file
gmon.outis written to the PWD - Run the
gproftool - you will need to point it at the binary file that was running ANDgmon.out
How to read the profile
- Flat profile Example Flat Profile
- Call Graph Example Call Graph
Appendices
- How to check if a binary has profiling enabled?
- How are
mcountand `monstartup used? - Some examples of profiling overhead with different function “sizes”
- Lowering the impact of profiling with compiler optimization
- Test program (simple_caller.c)
- Test makefile
How to check if a binary has profiling enabled
e.g. for my binary called simple_caller. If these symbols are part of the binary - then the binary was built using the -pg option and it will generate a gmon.out file.
$ nm simple_caller_pg | egrep "monstart|mcount"
U mcount@GLIBC_2.2.5
U __monstartup@GLIBC_2.2.5
The filetype of gmon.out is as follows
$ file gmon.out
gmon.out: GNU prof performance data - version 1
How are mcount and monstartup used
We can use objdump -d to see the assembly language for our file - when profiling is enabled we will see the calls to the special profiling functions (mcount and monstartup) from our simple_caller file including main
Counting stack frames and callers with mcount
$ objdump -d simple_caller_pg
...
000000000000126b <calcme1>:
126b: f3 0f 1e fa endbr64
126f: 55 push %rbp
1270: 48 89 e5 mov %rsp,%rbp
1273: 48 83 ec 20 sub $0x20,%rsp
1277: ff 15 6b 2d 00 00 call *0x2d6b(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
...
00000000000012e5 <calcme2>:
12e5: f3 0f 1e fa endbr64
12e9: 55 push %rbp
12ea: 48 89 e5 mov %rsp,%rbp
12ed: 48 83 ec 40 sub $0x40,%rsp
12f1: ff 15 f1 2c 00 00 call *0x2cf1(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
...
0000000000001391 <spinloop>:
1391: f3 0f 1e fa endbr64
1395: 55 push %rbp
1396: 48 89 e5 mov %rsp,%rbp
1399: 48 83 ec 20 sub $0x20,%rsp
139d: ff 15 45 2c 00 00 call *0x2c45(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
...
000000000000141b <main>:
141b: f3 0f 1e fa endbr64
141f: 55 push %rbp
1420: 48 89 e5 mov %rsp,%rbp
1423: 48 83 ec 40 sub $0x40,%rsp
1427: ff 15 bb 2b 00 00 call *0x2bbb(%rip) # 3fe8 <mcount@GLIBC_2.2.5>
142d: 48 c7 45 e0 00 00 00 movq $0x0,-0x20(%rbp)
Setting up PC sampling with monstartup
Disassembly of section .init:
0000000000001000 <_init>:
1000: f3 0f 1e fa endbr64
1004: 48 83 ec 08 sub $0x8,%rsp
1008: 48 8d 05 21 01 00 00 lea 0x121(%rip),%rax # 1130 <__gmon_start__>
100f: 48 85 c0 test %rax,%rax
1012: 74 02 je 1016 <_init+0x16>
1014: ff d0 call *%rax
1016: 48 83 c4 08 add $0x8,%rsp
101a: c3 ret
...
0000000000001130 <__gmon_start__>:
1130: f3 0f 1e fa endbr64
1134: 8b 05 d6 2e 00 00 mov 0x2ed6(%rip),%eax # 4010 <__TMC_END__>
113a: 85 c0 test %eax,%eax
113c: 74 02 je 1140 <__gmon_start__+0x10>
113e: c3 ret
113f: 90 nop
1140: 48 83 ec 08 sub $0x8,%rsp
1144: 48 8d 35 f2 04 00 00 lea 0x4f2(%rip),%rsi # 163d <etext>
114b: 48 8d 3d ae ee ff ff lea -0x1152(%rip),%rdi # 0 <__executable_start>
1152: c7 05 b4 2e 00 00 01 movl $0x1,0x2eb4(%rip) # 4010 <__TMC_END__>
1159: 00 00 00
115c: e8 5f ff ff ff call 10c0 <__monstartup@plt>
1161: 48 8b 3d 78 2e 00 00 mov 0x2e78(%rip),%rdi # 3fe0 <_mcleanup@GLIBC_2.2.5>
...
00000000000010c0 <__monstartup@plt>:
10c0: f3 0f 1e fa endbr64
10c4: f2 ff 25 e5 2e 00 00 bnd jmp *0x2ee5(%rip) # 3fb0 <__monstartup@GLIBC_2.2.5>
10cb: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
Examples of overhead
Overhead is relative to the “size” of the function.
We can create a test program that loops around and calls a few functions. We can make each of these functions do a bit of work - that is also in a loop. By playing with the values of “outerloop” and “inner loop” we can alter how much time is spent within each function. We can set these valus such that they have about the same runtime - but have a very different execution profile.
- With a high outerloop:innerloop (500M:1) ratio we do a lot of function calls and not much work inside each function
- With a low outerloop:innerlop (10:1) reatio we do only a few function calls and a lot of work in each function
Using our test program (‘simple_caller’) we can compile an example binary with two variables
- The number of times to run the outer loop which calls a set of 3 functions directly
- The number of times to spin inside the functions.
We can use these two variables to make the binary execute for about the same amount of time (around 25 seconds).
- Run1 : Outer loop is 500,000,000 Iterations Inner loop is 1 Iteration - Runtime 0m23.521s
- Run2 : Outer loop is Outer loop is 100,000 Iterations Inner loop is 10,000 Iterations - Runtime 0m25.425s
The behavior is as follows
- Run1 makes a lof of function calls - and very little work is done per call
- Run2 makes fewer calls - but each function does more work (more inner loops)
If the primary overhead of profiling is counting the calls with __mcount then we would expect the runtime delta between the non-profiled and profiled executions to be much higher in Run1 than Run2
Overhead runtimes
Run1 lots of calls, each function does very little
- Run1: Execution time with no profiling 23.5 Seconds
- Run1 : Execution time with profiling 42 Seconds
- Run1 : Overhead 42/23 == 1.8X
Run2 fewer calls but each function does a lot of work
- Run2 : Execution time with no profiling 25.425 Seconds
- Run2 : Execution time with profiling 25.470 Seconds
- Run2 : Overhead 25.470/25.425 = 1.002 (0.2%)
These results support the hypothesis that the profiling overhead is mostly based on the to the number of functions called relative to the work done per call.
Profiles of Run1 and Run2
We can see how the different variables affect the execution of the program by examinig the profiles - and that is the whole point.
Run1 Profile
Lots of small function calls - one giveaway here is that of the 7.6 seconds sampled, almost 2 seconds was spent in main which does nothing much at all apart from calling other functions. Also each call (self ns/call) is very low.. 4NS per call. Be careful of the metrics, because in the following profile the time is measured in uSec - so 1000X different order of magnitude.
Flat profile : Run1
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
25.59 1.96 1.96 500000001 3.92 3.92 spinloop
25.20 3.89 1.93 main
20.23 5.44 1.55 333333334 4.65 6.76 calcme2
13.77 6.50 1.05 500000001 2.11 2.11 return3
10.25 7.28 0.79 166666667 4.71 6.82 calcme1
4.96 7.66 0.38 _init
Call Graph : Run1
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.13% of 7.66 seconds
index % time self children called name
<spontaneous>
[1] 95.0 1.93 5.35 main [1]
1.55 0.70 333333334/333333334 calcme2 [2]
1.96 0.00 500000001/500000001 spinloop [3]
0.79 0.35 166666667/166666667 calcme1 [4]
-----------------------------------------------
1.55 0.70 333333334/333333334 main [1]
[2] 29.4 1.55 0.70 333333334 calcme2 [2]
0.70 0.00 333333334/500000001 return3 [5]
-----------------------------------------------
1.96 0.00 500000001/500000001 main [1]
[3] 25.6 1.96 0.00 500000001 spinloop [3]
-----------------------------------------------
0.79 0.35 166666667/166666667 main [1]
[4] 14.8 0.79 0.35 166666667 calcme1 [4]
0.35 0.00 166666667/500000001 return3 [5]
-----------------------------------------------
0.35 0.00 166666667/500000001 calcme1 [4]
0.70 0.00 333333334/500000001 calcme2 [2]
[5] 13.8 1.05 0.00 500000001 return3 [5]
-----------------------------------------------
<spontaneous>
[6] 5.0 0.38 0.00 _init [6]
-----------------------------------------------
Run2 Profile
Run2 is the run where the calls do a lot of work - we don’t see main in the list of samples at all. Also each function call is in the order of 10s of uSec - much more than the profile of Run1
Flat Profile : Run2
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
49.84 1.56 1.56 66668 23.40 23.40 calcme2
26.52 2.39 0.83 100001 8.30 8.30 spinloop
13.10 2.80 0.41 _init
10.54 3.13 0.33 33333 9.90 9.90 calcme1
0.00 3.13 0.00 100001 0.00 0.00 return3
Call Graph : Run2
Call graph (explanation follows)
granularity: each sample hit covers 4 byte(s) for 0.32% of 3.13 seconds
index % time self children called name
<spontaneous>
[1] 86.9 0.00 2.72 main [1]
1.56 0.00 66668/66668 calcme2 [2]
0.83 0.00 100001/100001 spinloop [3]
0.33 0.00 33333/33333 calcme1 [5]
-----------------------------------------------
1.56 0.00 66668/66668 main [1]
[2] 49.8 1.56 0.00 66668 calcme2 [2]
0.00 0.00 66668/100001 return3 [6]
-----------------------------------------------
0.83 0.00 100001/100001 main [1]
[3] 26.5 0.83 0.00 100001 spinloop [3]
-----------------------------------------------
<spontaneous>
[4] 13.1 0.41 0.00 _init [4]
-----------------------------------------------
0.33 0.00 33333/33333 main [1]
[5] 10.5 0.33 0.00 33333 calcme1 [5]
0.00 0.00 33333/100001 return3 [6]
-----------------------------------------------
0.00 0.00 33333/100001 calcme1 [5]
0.00 0.00 66668/100001 calcme2 [2]
[6] 0.0 0.00 0.00 100001 return3 [6]
-----------------------------------------------
Lowering the impact of profiling with compiler optimization
If part of the cost of profiling is the per-function call cost, then if we optimize the code using gcc optimizations that inline functions - we should see a reduction in both the runtime of the un-profiled binary and the delta in runtime between the un-profiled and profiled binaries. For now we will do optimization using gcc -Ofast
Summary
- No Optimization : real 0m23.299s
- No Optimization with profiling : real 0m42.057s
- With -Ofast Optimization : real 0m18.205s
- With -Ofast Optimization + profiling (-pg) : real 0m18.035s
So using -Ofast definitely reduces the overhead of profiling by 2X or more. Unfortunately, the reason is that gcc has inlined the function (which also speeds up the un-profiled exectution (but not by as much). Unfortunately inlining the function means that our profile is sort-of useless.
flat profile without optimization
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
49.84 1.56 1.56 66668 23.40 23.40 calcme2
26.52 2.39 0.83 100001 8.30 8.30 spinloop
13.10 2.80 0.41 _init
10.54 3.13 0.33 33333 9.90 9.90 calcme1
0.00 3.13 0.00 100001 0.00 0.00 return3
flat profile with optimization
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
78.70 1.33 1.33 main
21.30 1.69 0.36 _init
Timings for Un-Optimized and Optimized executions
Un-Optimized, no profile
gary@:~/git/unixfun/simple-call$ time ./simple_caller
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for 500000026 iterations
counter incremented 500000001 times
Last result was 1500000144
real 0m23.299s
user 0m23.280s
sys 0m0.005s
Un-Optimized with profile
gary@:~/git/unixfun/simple-call$ time ./simple_caller_pg
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for 500000026 iterations
counter incremented 500000001 times
Last result was 1500000144
real 0m42.057s
user 0m42.021s
sys 0m0.006s
gary@:~/git/unixfun/simple-call$ time ./simple_caller_opt
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for 500000026 iterations
counter incremented 500000001 times
Last result was 1500000144
real 0m18.205s
user 0m18.188s
sys 0m0.007s
gary@:~/git/unixfun/simple-call$ time ./simple_caller_opt_pg
Outer loop is 500000000
Inner loop is 1
Outer iteration counter is 500000000
spinloop spun for 500000026 iterations
counter incremented 500000001 times
Last result was 1500000144
real 0m18.035s
user 0m18.024s
sys 0m0.004s
Test program simpler_caller
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
/* The variables OUTERLOOP and SPINLOOP
are epxpected to beset in the makefile */
#ifndef OUTERLOOP
#define OUTERLOOP 500000000
#endif
#ifndef SPINLOOP
#define SPINLOOP 1
#endif
/* Global counter */
long int counter = 0;
long int return3(int f) {
return(f*3);
}
long int calcme1(int f,long int numspins) {
long int i;
int random=0;
for (i=0; i < numspins; i++) {
i=i+1;
random = rand() % 100;
}
return(return3(f+random));
}
long int calcme2(int f,long int numspins,int *ptr) {
long int i,res=0;
int random=0;
/* For each time around the spin
loop read an int from the memory
we allocated in main()
*/
for (i=0; i < numspins; i++) {
res=i+ptr[i];
random = rand() % 100;
}
return(return3(f+res+random));
}
long int spinloop(long int z) {
counter +=1;
long i;
int random=0;
for (i=0; i < z; i++) {
i=i+1;
random = rand() % 100;
}
return(counter+random);
}
int main() {
long int i,res,lastres;
long int loopcounter=0;
long int iterations = OUTERLOOP;
long int numspins = SPINLOOP;
int *ptr;
/* Allocate some memory and write to it
give our functions some memory reads
to work with
*/
ptr=(int *)malloc(1000000*sizeof(int));
for (i=0; i<1000000; i++) {
ptr[i]=rand() % 100;
}
printf("Outer loop is %ld\n", iterations);
printf("Inner loop is %ld\n", numspins);
for (i = 0; i <= iterations; i++) {
// Call calcme1 one in three iterations
if ((i + 1) % 3 == 0) {
res=calcme1(i,numspins);
} else {
// Call calcme2 two in three iterations
res=calcme2(i,numspins,ptr);
}
// Call spinloop every iteration
loopcounter=spinloop(numspins);
lastres=res;
}
printf("Outer iteration counter is %ld\n", iterations);
printf("spinloop spun for %ld iterations\n", loopcounter);
printf("counter incremented %ld times\n", counter);
printf("Last result was %ld\n", lastres);
return 0;
}
Test makefile
# Define the compiler and the flags
CC = gcc
CFLAGS = -Wall -g
# Low Overhead
#SPINLOOP = 10000
#OUTERLOOP = 100000
# High Overhead
SPINLOOP = 1
OUTERLOOP = 500000000
# Create multiple outputs
# 1 - Unoptimized executable - will make the function call using standard calling convention
# 2 - Optimized executable - will make the function call using optimized calling convention (inlined)
# As above but in this case we will use the -pg flag to enable profiling
# Define the target executables
TARGET_BASE = simple_caller
TARGET_PG = simple_caller_pg
TARGET_OPT = simple_caller_opt
TARGET_OPT_PG = simple_caller_opt_pg
# Define the source files
SRCS = simple_caller.c
# Define the object files
OBJS = $(SRCS:.c=.o)
OBJS_PG = $(SRCS:.c=_pg.o)
OBJS_OPT = $(SRCS:.c=_opt.o)
OBJS_OPT_PG = $(SRCS:.c=_pg_opt.o)
# Default target
all: $(TARGET_BASE) $(TARGET_PG) $(TARGET_OPT) $(TARGET_OPT_PG)
# Link the object files to create the executable without profiling
$(TARGET_BASE): $(OBJS)
$(CC) $(CFLAGS) -o $(TARGET_BASE) $(OBJS)
# Link the object files to create the executable with profiling
$(TARGET_PG): $(OBJS_PG)
$(CC) $(CFLAGS) -pg -o $(TARGET_PG) $(OBJS_PG)
# Link the object files to create the optimized executable without profiling
$(TARGET_OPT): $(OBJS_OPT)
$(CC) $(CFLAGS) -Ofast -o $(TARGET_OPT) $(OBJS_OPT)
# Link the object files to create the optimized executable with profiling
$(TARGET_OPT_PG): $(OBJS_OPT_PG)
$(CC) $(CFLAGS) -Ofast -pg -o $(TARGET_OPT_PG) $(OBJS_OPT_PG)
# Compile the source files into object files without profiling
%.o: %.c
$(CC) $(CFLAGS) -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -c $< -o $@
# Compile the source files into object files with profiling
%_pg.o: %.c
$(CC) $(CFLAGS) -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -pg -c $< -o $@
# Compile the source files into optimized object files without profiling
%_opt.o: %.c
$(CC) $(CFLAGS) -Ofast -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -c $< -o $@
# Compile the source files into optimized object files with profiling
%_pg_opt.o: %.c
$(CC) $(CFLAGS) -Ofast -DOUTERLOOP=$(OUTERLOOP) -DINNERLOOP=$(INNERLOOP) -pg -c $< -o $@
# Clean up the build files
clean:
rm -f $(OBJS) $(OBJS_PG) $(OBJS_OPT) $(OBJS_OPT_PG) $(TARGET_BASE) $(TARGET_PG) $(TARGET_OPT) $(TARGET_OPT_PG)
.PHONY: all clean
When we want to get a profile from gprof we ask gcc to insert additional code (a function call) into the compiled binary, hence there is no need for any special hardware support - no issues running the tool in a VM/container etc whereas some tools need to access the CPU directly via passthrough, or you need to be root etc.
The general workflow is as follows
Build a binary using the flag -pg this will create a “special” build of your program which implements profiling - the program should operate as normal, albeit a bit slower. the implementation of profiling is detailed here in the gprof man/help/doc page
There will be an additional symbol/function inserted into the binary that gcc produces. In the output of nm you will see this function - it is called mcount. This is the function that does the work of figuring out which function calls what - so an easy way to see if a binary has profiling enabled is to look for that symbol.
During compilation/linking gcc inserts calls to mcount into each function call in the assembly/binary that it produces for you.
In the output below the symbol counter is just a variable in my program.
gary:~/git/unixfun/simple-call$ cc -o 1b-pg -pg simple_call_1B.c
gary:~/git/unixfun/simple-call$ nm ./1b-pg |grep count
0000000000004018 B counter
U mcount@GLIBC_2.2.5 <----- This binary has profiling enabled
gary:~/git/unixfun/simple-call$ cc -o 1b simple_call_1B.c
gary@:~/git/unixfun/simple-call$ nm ./1b |grep count
0000000000004014 B counter
<------- No mcount, means no profiling
Also in the assembly you will see calls to _mcount. Output from objdump -d <binary>
1242: 53 push %rbx
1243: ff 15 9f 2d 00 00 call *0x2d9f(%rip) # 3fe8 <mcount@GLIBC_2.2.5> <----- this is only seen in pg builds
1249: bb 00 ca 9a 3b mov $0x3b9aca00,%ebx
- When you run the binary compiled with
-pgit will create a binary output file calledgmon.outin the currenty working directory. - Then use the
gproftool to read the gmon.out - it needs to reference the binary (compiled with pg) to show the right symbols/functions etc. In our example below for the binary compiled with profiling. After the run completes - issuegprof 1b-pgand the gprof command will read the file gmon.out and use the symbols in the binary to display the correct function names.