Profile application on linux

Содержание

How to profile multi-threaded c++ application on linux?
Method 1: Using gprof
Method 2: Using Valgrind
Step 1: Install Valgrind
Step 2: Compile your C++ application with debugging symbols
Step 3: Run your application with Valgrind
Step 4: Analyze the Valgrind output
Method 3: Using Perf
Step 1: Install Perf
Step 2: Compile the Application with Debug Symbols
Step 3: Start Profiling
Step 4: Stop Profiling
Step 5: Analyze the Report
Method 4: Using performance-related GCC flags
Method 5: Using Intel VTune Amplifier
Example 1: Analyzing Thread Concurrency
Example 2: Analyzing Thread Synchronization
Example 3: Analyzing Thread Contentions
Example 4: Analyzing Thread Wait Time
How to profile your applications using the Linux perf tools
Installation
Usage

How to profile multi-threaded c++ application on linux?

When optimizing a multi-threaded C++ application, profiling is an essential step to identify performance bottlenecks and improve overall application performance. Profiling provides information about how your application uses CPU and memory resources, which can help you to identify areas for improvement. In this article, we will look at the various methods available for profiling multi-threaded C++ applications on Linux.

Method 1: Using gprof

To profile a multi-threaded C++ application on Linux using gprof, follow these steps:

After the application finishes running, a file called gmon.out will be generated in the current directory. This file contains the profiling information.
Use gprof to generate a report from the profiling information:

gprof myapp gmon.out > report.txt

Here is an example of a multi-threaded C++ application that calculates the sum of an array using multiple threads:

#include #include #include const int ARRAY_SIZE = 1000000; const int NUM_THREADS = 4; int sum = 0; void calculateSum(int* array, int start, int end)  for (int i = start; i  end; i++)  sum += array[i]; > > int main()  int array[ARRAY_SIZE]; for (int i = 0; i  ARRAY_SIZE; i++)  array[i] = i; > std::vectorstd::thread> threads; int chunkSize = ARRAY_SIZE / NUM_THREADS; for (int i = 0; i  NUM_THREADS; i++)  int start = i * chunkSize; int end = (i + 1) * chunkSize; threads.push_back(std::thread(calculateSum, array, start, end)); > for (int i = 0; i  NUM_THREADS; i++)  threads[i].join(); > std::cout  <"Sum: "   ::endl; return 0; >

When you run this application with gprof and generate a report, you will see the amount of time spent in each function, including the threads. This can help you identify performance bottlenecks in your application.

Method 2: Using Valgrind

Valgrind is a powerful tool for profiling C++ applications on Linux. It provides a suite of tools that can help you detect memory leaks, race conditions, and other common issues that can arise in multi-threaded applications.

Here is an example of how to profile a multi-threaded C++ application using Valgrind:

Step 1: Install Valgrind

First, you need to install Valgrind on your Linux machine. You can do this by running the following command:

sudo apt-get install valgrind

Step 2: Compile your C++ application with debugging symbols

Next, you need to compile your C++ application with debugging symbols. This will allow Valgrind to provide more detailed information about your application’s behavior. You can do this by adding the -g flag to your compiler command:

Step 3: Run your application with Valgrind

Now you can run your application with Valgrind. You can do this by running the following command:

valgrind --tool=helgrind ./myapp

The —tool=helgrind option tells Valgrind to use the helgrind tool, which is designed for detecting race conditions in multi-threaded applications.

Step 4: Analyze the Valgrind output

After running your application with Valgrind, you can analyze the output to identify any issues that Valgrind has detected. Valgrind will provide detailed information about any race conditions that it has detected, including the location in your code where the race condition occurred.

Here is an example of the output that Valgrind might produce:

==12345== Possible data race during read of size 4 at 0x12345678 by thread #1 ==12345== at myapp.cpp:42 ==12345== by thread #1 ==12345== This conflicts with a previous write of size 4 by thread #2 ==12345== at myapp.cpp:23 ==12345== by thread #2 ==12345== Locks held: ==12345== mutex at myapp.cpp:10 by thread #1 ==12345== mutex at myapp.cpp:10 by thread #2

In this example, Valgrind has detected a race condition where two threads are accessing the same memory location without proper synchronization. The output provides information about the location in the code where the race condition occurred, as well as the locks that were held by each thread at the time of the access.

Method 3: Using Perf

Perf is a performance analysis tool for Linux that provides a wide range of profiling features. In this tutorial, we will explore how to use Perf to profile multi-threaded C++ applications on Linux.

Step 1: Install Perf

Perf is usually installed by default on most Linux distributions. If it is not installed, you can install it using the following command:

sudo apt-get install linux-tools-common linux-tools-generic linux-tools-kernel-version>

Step 2: Compile the Application with Debug Symbols

To enable Perf to provide meaningful profiling data, we need to compile our application with debug symbols. We can do this by passing the -g flag to the compiler.

g++ -g -o myapp myapp.cpp -pthread

Step 3: Start Profiling

We can start profiling our application by running the following command:

This command starts Perf and records profiling data for the specified process ID (PID). The -g flag enables call-graph profiling, which provides more detailed profiling information.

Step 4: Stop Profiling

To stop profiling, we can run the following command:

This command generates a report that shows the profiling data collected by Perf.

Step 5: Analyze the Report

The report generated by Perf provides detailed information about the performance of our application. We can use various options to customize the report and extract specific information.

For example, we can use the —sort option to sort the report by a specific metric. The following command sorts the report by the amount of time spent in each function:

We can also use the —call-graph option to generate a call-graph report that shows the performance of each function and its callers. The following command generates a call-graph report for our application:

sudo perf report --call-graph

To profile a multi-threaded C++ application on Linux using performance-related GCC flags, you can follow these steps:

Run your application as usual.
Once your application has finished running, a file named gmon.out will be generated. This file contains the profiling information.
Use the gprof tool to analyze the profiling information. For example:

gprof myapp gmon.out > analysis.txt

This will generate a text file named analysis.txt containing the profiling analysis.

Here are some additional GCC flags that you can use to optimize the performance of your application:

-O1 , -O2 , -O3 : optimization levels that can improve the performance of your application. Use -O3 for maximum optimization.
-march=native : generate code that is optimized for the host machine’s architecture.
-funroll-loops : unroll loops in your code to improve performance.
-fprofile-use : use profiling information to optimize your code.

Here is an example of how to compile your application with these flags:

g++ -pg -O3 -march=native -funroll-loops -fprofile-use -o myapp myapp.cpp

Note that these flags may not always improve the performance of your application, and may even have negative effects in some cases. It’s important to measure the performance of your application before and after applying these flags to determine if they are beneficial.

Method 5: Using Intel VTune Amplifier

Install Intel VTune Amplifier on your Linux machine.
Compile your C++ application with debug symbols (-g flag).
Run your application with VTune Amplifier:

amplxe-cl -collect threading -r result-dir> -- your-app> app-args>

Here are some examples of using VTune Amplifier to profile a multi-threaded C++ application:

Example 1: Analyzing Thread Concurrency

#include #include #include void worker(int id) < std::cout std::cout int main() < std::vectorthreads; for (int i = 0; i < 4; ++i) < threads.emplace_back(worker, i); >for (auto& thread : threads) < thread.join(); >return 0; >

To analyze thread concurrency, run:

amplxe-cl -collect threading -r concurrency -- ./my-app

Example 2: Analyzing Thread Synchronization

#include #include #include #include std::mutex g_mutex; void worker(int id) < std::cout lock(g_mutex); // do some work > std::cout int main() < std::vectorthreads; for (int i = 0; i < 4; ++i) < threads.emplace_back(worker, i); >for (auto& thread : threads) < thread.join(); >return 0; >

To analyze thread synchronization, run:

amplxe-cl -collect threading -r synchronization -- ./my-app

Example 3: Analyzing Thread Contentions

#include #include #include #include std::mutex g_mutex; void worker(int id) < std::cout lock(g_mutex); // do some work > std::cout int main() < std::vectorthreads; for (int i = 0; i < 4; ++i) < threads.emplace_back(worker, i); >for (auto& thread : threads) < thread.join(); >return 0; >

To analyze thread contentions, run:

amplxe-cl -collect threading -r contentions -- ./my-app

Example 4: Analyzing Thread Wait Time

#include #include #include #include void worker(int id) < std::cout int main() < std::vectorthreads; for (int i = 0; i < 4; ++i) < threads.emplace_back(worker, i); >for (auto& thread : threads) < thread.join(); >return 0; >

To analyze thread wait time, run:

amplxe-cl -collect threading -r wait-time -- ./my-app

These are just a few examples of how to use Intel VTune Amplifier to profile a multi-threaded C++ application on Linux. The tool provides many more features and options for advanced profiling and analysis.

Источник

How to profile your applications using the Linux perf tools

When an application encounters some performance issues, we have to find the code that causes the problem to optimize only what really matters.

To find the code we have to optimize, the profilers are really useful. In this post, we’ll use the Linux perf tools to profile a simple C++ application.

The perf tools are integrated in the Linux kernel since the 2.6 version. The perf tools are based on the perf events subsystem. The perf profiler uses hardware counters to profile the application. The result of this profiler are really precise and because it is not doing instrumentation of the code, it is really fast.

Installation

First of all, if it is not already done, you have to install the perf tools on your computer. On Ubuntu, you can just use apt-get to install it :

sudo apt-get install linux-tools

On the other systems, just use your favorite package manager to install the perf tools.

Usage

To profile an application, you have to record information about an executation, for that, you just have to use perf record :

perf record program [program_options]

perf record eddic assembly.eddi

Once the execution is over, perf will gives you some information about the record, like:

[ perf record: Woken up 44 times to write data ] [ perf record: Captured and wrote 11.483 MB perf.data (~501721 samples) ]

If you see that the size of the perf.data is really small, generally for small execution time, you can configure the event period in order to have more information with the —count=period option using a small period. I usually uses 1000, but it can be useful to use a smaller number when the application has a short execution time.

Then, to see the list of the most costly functions, you just have to use perf report :

It will display a list of the most costly functions ordered by cost. Here is an example taken from one of my C++ applications (the length of the function names is reduced to fit the screen) here :

# Events: 374K cycles # # Overhead Command Shared Object # . . . . # 87.58% inlining [vesafb] [k] 0xffffffff81100700 83.27% readelf [vesafb] [k] 0xffffffff815c3930 41.40% sh [vesafb] [k] 0xffffffff815c3930 37.74% inlining libstdc++.so.6.0.14 [.] 0x653e0 12.49% readelf libc-2.13.so [.] vfprintf 5.37% inlining inlining [.] parseFunction(std::string, std::string, std::map 5.20% readelf libc-2.13.so [.] _IO_new_file_xsputn 4.50% readelf readelf [.] 0x150e 4.10% inlining libc-2.13.so [.] _int_malloc 4.01% inlining libc-2.13.so [.] memcpy 3.80% readelf libc-2.13.so [.] ___printf_chk 2.58% inlining libc-2.13.so [.] __malloc 1.86% inlining libc-2.13.so [.] _IO_fgets 1.84% inlining inlining [.] parseExecutable(std::string, std::set 1.83% readelf libc-2.13.so [.] __strchrnul 1.83% inlining libc-2.13.so [.] _int_free 1.77% inlining libc-2.13.so [.] __strlen_sse42 1.50% inlining libc-2.13.so [.] cfree 1.48% inlining libc-2.13.so [.] __memchr 1.23% inlining inlining [.] parseLibrary(std::string, std::se 1.19% inlining libboost_graph.so.1.46.1 [.] char* std::string::_S_construct(char const*) 1.17% readelf libc-2.13.so [.] __dcigettext 1.15% inlining libc-2.13.so [.] _IO_getline_info_internal

For every functions, you have the information about the cost of the function, the command used to launch it and the shared object in which the function is located. You can navigate through the list like in more utility.

This tool can be really useful to see which functions is interesting to optimize in order to increase the overall performance of the application.

For more information about the perf tools, you can read the perf wiki

In a future article, I will talk about another profiler, Callgrind.