Init
333
Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/README.md
Normal file
@ -0,0 +1,333 @@
|
||||
# Simpleperf
|
||||
|
||||
Android Studio includes a graphical front end to Simpleperf, documented in
|
||||
[Inspect CPU activity with CPU Profiler](https://developer.android.com/studio/profile/cpu-profiler).
|
||||
Most users will prefer to use that instead of using Simpleperf directly.
|
||||
|
||||
Simpleperf is a native CPU profiling tool for Android. It can be used to profile
|
||||
both Android applications and native processes running on Android. It can
|
||||
profile both Java and C++ code on Android. The simpleperf executable can run on Android >=L,
|
||||
and Python scripts can be used on Android >= N.
|
||||
|
||||
Simpleperf is part of the Android Open Source Project.
|
||||
The source code is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/).
|
||||
The latest document is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/README.md).
|
||||
|
||||
[TOC]
|
||||
|
||||
## Introduction
|
||||
|
||||
An introduction slide deck is [here](./introduction.pdf).
|
||||
|
||||
Simpleperf contains two parts: the simpleperf executable and Python scripts.
|
||||
|
||||
The simpleperf executable works similar to linux-tools-perf, but has some specific features for
|
||||
the Android profiling environment:
|
||||
|
||||
1. It collects more info in profiling data. Since the common workflow is "record on the device, and
|
||||
report on the host", simpleperf not only collects samples in profiling data, but also collects
|
||||
needed symbols, device info and recording time.
|
||||
|
||||
2. It delivers new features for recording.
|
||||
1) When recording dwarf based call graph, simpleperf unwinds the stack before writing a sample
|
||||
to file. This is to save storage space on the device.
|
||||
2) Support tracing both on CPU time and off CPU time with --trace-offcpu option.
|
||||
3) Support recording callgraphs of JITed and interpreted Java code on Android >= P.
|
||||
|
||||
3. It relates closely to the Android platform.
|
||||
1) Is aware of Android environment, like using system properties to enable profiling, using
|
||||
run-as to profile in application's context.
|
||||
2) Supports reading symbols and debug information from the .gnu_debugdata section, because
|
||||
system libraries are built with .gnu_debugdata section starting from Android O.
|
||||
3) Supports profiling shared libraries embedded in apk files.
|
||||
4) It uses the standard Android stack unwinder, so its results are consistent with all other
|
||||
Android tools.
|
||||
|
||||
4. It builds executables and shared libraries for different usages.
|
||||
1) Builds static executables on the device. Since static executables don't rely on any library,
|
||||
simpleperf executables can be pushed on any Android device and used to record profiling data.
|
||||
2) Builds executables on different hosts: Linux, Mac and Windows. These executables can be used
|
||||
to report on hosts.
|
||||
3) Builds report shared libraries on different hosts. The report library is used by different
|
||||
Python scripts to parse profiling data.
|
||||
|
||||
Detailed documentation for the simpleperf executable is [here](#executable-commands-reference).
|
||||
|
||||
Python scripts are split into three parts according to their functions:
|
||||
|
||||
1. Scripts used for recording, like app_profiler.py, run_simpleperf_without_usb_connection.py.
|
||||
|
||||
2. Scripts used for reporting, like report.py, report_html.py, inferno.
|
||||
|
||||
3. Scripts used for parsing profiling data, like simpleperf_report_lib.py.
|
||||
|
||||
The python scripts are tested on Python >= 3.9. Older versions may not be supported.
|
||||
Detailed documentation for the Python scripts is [here](#scripts-reference).
|
||||
|
||||
|
||||
## Tools in simpleperf
|
||||
|
||||
The simpleperf executables and Python scripts are located in simpleperf/ in ndk releases, and in
|
||||
system/extras/simpleperf/scripts/ in AOSP. Their functions are listed below.
|
||||
|
||||
bin/: contains executables and shared libraries.
|
||||
|
||||
bin/android/${arch}/simpleperf: static simpleperf executables used on the device.
|
||||
|
||||
bin/${host}/${arch}/simpleperf: simpleperf executables used on the host, only supports reporting.
|
||||
|
||||
bin/${host}/${arch}/libsimpleperf_report.${so/dylib/dll}: report shared libraries used on the host.
|
||||
|
||||
*.py, inferno, purgatorio: Python scripts used for recording and reporting. Details are in [scripts_reference.md](scripts_reference.md).
|
||||
|
||||
|
||||
## Android application profiling
|
||||
|
||||
See [android_application_profiling.md](./android_application_profiling.md).
|
||||
|
||||
|
||||
## Android platform profiling
|
||||
|
||||
See [android_platform_profiling.md](./android_platform_profiling.md).
|
||||
|
||||
|
||||
## Executable commands reference
|
||||
|
||||
See [executable_commands_reference.md](./executable_commands_reference.md).
|
||||
|
||||
|
||||
## Scripts reference
|
||||
|
||||
See [scripts_reference.md](./scripts_reference.md).
|
||||
|
||||
## View the profile
|
||||
|
||||
See [view_the_profile.md](./view_the_profile.md).
|
||||
|
||||
## Answers to common issues
|
||||
|
||||
### Support on different Android versions
|
||||
|
||||
On Android < N, the kernel may be too old (< 3.18) to support features like recording DWARF
|
||||
based call graphs.
|
||||
On Android M - O, we can only profile C++ code and fully compiled Java code.
|
||||
On Android >= P, the ART interpreter supports DWARF based unwinding. So we can profile Java code.
|
||||
On Android >= Q, we can used simpleperf shipped on device to profile released Android apps, with
|
||||
`<profileable android:shell="true" />`.
|
||||
|
||||
|
||||
### Comparing DWARF based and stack frame based call graphs
|
||||
|
||||
Simpleperf supports two ways recording call stacks with samples. One is DWARF based call graph,
|
||||
the other is stack frame based call graph. Below is their comparison:
|
||||
|
||||
Recording DWARF based call graph:
|
||||
1. Needs support of debug information in binaries.
|
||||
2. Behaves normally well on both ARM and ARM64, for both Java code and C++ code.
|
||||
3. Can only unwind 64K stack for each sample. So it isn't always possible to unwind to the bottom.
|
||||
However, this is alleviated in simpleperf, as explained in the next section.
|
||||
4. Takes more CPU time than stack frame based call graphs. So it has higher overhead, and can't
|
||||
sample at very high frequency (usually <= 4000 Hz).
|
||||
|
||||
Recording stack frame based call graph:
|
||||
1. Needs support of stack frame registers.
|
||||
2. Doesn't work well on ARM. Because ARM is short of registers, and ARM and THUMB code have
|
||||
different stack frame registers. So the kernel can't unwind user stack containing both ARM and
|
||||
THUMB code.
|
||||
3. Also doesn't work well on Java code. Because the ART compiler doesn't reserve stack frame
|
||||
registers. And it can't get frames for interpreted Java code.
|
||||
4. Works well when profiling native programs on ARM64. One example is profiling surfacelinger. And
|
||||
usually shows complete flamegraph when it works well.
|
||||
5. Takes much less CPU time than DWARF based call graphs. So the sample frequency can be 10000 Hz or
|
||||
higher.
|
||||
|
||||
So if you need to profile code on ARM or profile Java code, DWARF based call graph is better. If you
|
||||
need to profile C++ code on ARM64, stack frame based call graphs may be better. After all, you can
|
||||
fisrt try DWARF based call graph, which is also the default option when `-g` is used. Because it
|
||||
always produces reasonable results. If it doesn't work well enough, then try stack frame based call
|
||||
graph instead.
|
||||
|
||||
|
||||
### Fix broken DWARF based call graph
|
||||
|
||||
A DWARF-based call graph is generated by unwinding thread stacks. When a sample is recorded, a
|
||||
kernel dumps up to 64 kilobytes of stack data. By unwinding the stack based on DWARF information,
|
||||
we can get a call stack.
|
||||
|
||||
Two reasons may cause a broken call stack:
|
||||
1. The kernel can only dump up to 64 kilobytes of stack data for each sample, but a thread can have
|
||||
much larger stack. In this case, we can't unwind to the thread start point.
|
||||
|
||||
2. We need binaries containing DWARF call frame information to unwind stack frames. The binary
|
||||
should have one of the following sections: .eh_frame, .debug_frame, .ARM.exidx or .gnu_debugdata.
|
||||
|
||||
To mitigate these problems,
|
||||
|
||||
|
||||
For the missing stack data problem:
|
||||
1. To alleviate it, simpleperf joins callchains (call stacks) after recording. If two callchains of
|
||||
a thread have an entry containing the same ip and sp address, then simpleperf tries to join them
|
||||
to make the callchains longer. So we can get more complete callchains by recording longer and
|
||||
joining more samples. This doesn't guarantee to get complete call graphs. But it usually works
|
||||
well.
|
||||
|
||||
2. Simpleperf stores samples in a buffer before unwinding them. If the bufer is low in free space,
|
||||
simpleperf may decide to truncate stack data for a sample to 1K. Hopefully, this can be recovered
|
||||
by callchain joiner. But when a high percentage of samples are truncated, many callchains can be
|
||||
broken. We can tell if many samples are truncated in the record command output, like:
|
||||
|
||||
```sh
|
||||
$ simpleperf record ...
|
||||
simpleperf I cmd_record.cpp:809] Samples recorded: 105584 (cut 86291). Samples lost: 6501.
|
||||
|
||||
$ simpleperf record ...
|
||||
simpleperf I cmd_record.cpp:894] Samples recorded: 7,365 (1,857 with truncated stacks).
|
||||
```
|
||||
|
||||
There are two ways to avoid truncating samples. One is increasing the buffer size, like
|
||||
`--user-buffer-size 1G`. But `--user-buffer-size` is only available on latest simpleperf. If that
|
||||
option isn't available, we can use `--no-cut-samples` to disable truncating samples.
|
||||
|
||||
For the missing DWARF call frame info problem:
|
||||
1. Most C++ code generates binaries containing call frame info, in .eh_frame or .ARM.exidx sections.
|
||||
These sections are not stripped, and are usually enough for stack unwinding.
|
||||
|
||||
2. For C code and a small percentage of C++ code that the compiler is sure will not generate
|
||||
exceptions, the call frame info is generated in .debug_frame section. .debug_frame section is
|
||||
usually stripped with other debug sections. One way to fix it, is to download unstripped binaries
|
||||
on device, as [here](#fix-broken-callchain-stopped-at-c-functions).
|
||||
|
||||
3. The compiler doesn't generate unwind instructions for function prologue and epilogue. Because
|
||||
they operates stack frames and will not generate exceptions. But profiling may hit these
|
||||
instructions, and fails to unwind them. This usually doesn't matter in a frame graph. But in a
|
||||
time based Stack Chart (like in Android Studio and Firefox profiler), this causes stack gaps once
|
||||
in a while. We can remove stack gaps via `--remove-gaps`, which is already enabled by default.
|
||||
|
||||
|
||||
### Fix broken callchain stopped at C functions
|
||||
|
||||
When using dwarf based call graphs, simpleperf generates callchains during recording to save space.
|
||||
The debug information needed to unwind C functions is in .debug_frame section, which is usually
|
||||
stripped in native libraries in apks. To fix this, we can download unstripped version of native
|
||||
libraries on device, and ask simpleperf to use them when recording.
|
||||
|
||||
To use simpleperf directly:
|
||||
|
||||
```sh
|
||||
# create native_libs dir on device, and push unstripped libs in it (nested dirs are not supported).
|
||||
$ adb shell mkdir /data/local/tmp/native_libs
|
||||
$ adb push <unstripped_dir>/*.so /data/local/tmp/native_libs
|
||||
# run simpleperf record with --symfs option.
|
||||
$ adb shell simpleperf record xxx --symfs /data/local/tmp/native_libs
|
||||
```
|
||||
|
||||
To use app_profiler.py:
|
||||
|
||||
```sh
|
||||
$ ./app_profiler.py -lib <unstripped_dir>
|
||||
```
|
||||
|
||||
|
||||
### How to solve missing symbols in report?
|
||||
|
||||
The simpleperf record command collects symbols on device in perf.data. But if the native libraries
|
||||
you use on device are stripped, this will result in a lot of unknown symbols in the report. A
|
||||
solution is to build binary_cache on host.
|
||||
|
||||
```sh
|
||||
# Collect binaries needed by perf.data in binary_cache/.
|
||||
$ ./binary_cache_builder.py -lib NATIVE_LIB_DIR,...
|
||||
```
|
||||
|
||||
The NATIVE_LIB_DIRs passed in -lib option are the directories containing unstripped native
|
||||
libraries on host. After running it, the native libraries containing symbol tables are collected
|
||||
in binary_cache/ for use when reporting.
|
||||
|
||||
```sh
|
||||
$ ./report.py --symfs binary_cache
|
||||
|
||||
# report_html.py searches binary_cache/ automatically, so you don't need to
|
||||
# pass it any argument.
|
||||
$ ./report_html.py
|
||||
```
|
||||
|
||||
|
||||
### Show annotated source code and disassembly
|
||||
|
||||
To show hot places at source code and instruction level, we need to show source code and
|
||||
disassembly with event count annotation. Simpleperf supports showing annotated source code and
|
||||
disassembly for C++ code and fully compiled Java code. Simpleperf supports two ways to do it:
|
||||
|
||||
1. Through report_html.py:
|
||||
1) Generate perf.data and pull it on host.
|
||||
2) Generate binary_cache, containing elf files with debug information. Use -lib option to add
|
||||
libs with debug info. Do it with
|
||||
`binary_cache_builder.py -i perf.data -lib <dir_of_lib_with_debug_info>`.
|
||||
3) Use report_html.py to generate report.html with annotated source code and disassembly,
|
||||
as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#report_html_py).
|
||||
|
||||
2. Through pprof.
|
||||
1) Generate perf.data and binary_cache as above.
|
||||
2) Use pprof_proto_generator.py to generate pprof proto file. `pprof_proto_generator.py`.
|
||||
3) Use pprof to report a function with annotated source code, as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#pprof_proto_generator_py).
|
||||
|
||||
|
||||
### Reduce lost samples and samples with truncated stack
|
||||
|
||||
When using `simpleperf record`, we may see lost samples or samples with truncated stack data. Before
|
||||
saving samples to a file, simpleperf uses two buffers to cache samples in memory. One is a kernel
|
||||
buffer, the other is a userspace buffer. The kernel puts samples to the kernel buffer. Simpleperf
|
||||
moves samples from the kernel buffer to the userspace buffer before processing them. If a buffer
|
||||
overflows, we lose samples or get samples with truncated stack data. Below is an example.
|
||||
|
||||
```sh
|
||||
$ simpleperf record -a --duration 1 -g --user-buffer-size 100k
|
||||
simpleperf I cmd_record.cpp:799] Recorded for 1.00814 seconds. Start post processing.
|
||||
simpleperf I cmd_record.cpp:894] Samples recorded: 79 (16 with truncated stacks).
|
||||
Samples lost: 2,129 (kernelspace: 18, userspace: 2,111).
|
||||
simpleperf W cmd_record.cpp:911] Lost 18.5567% of samples in kernel space, consider increasing
|
||||
kernel buffer size(-m), or decreasing sample frequency(-f), or
|
||||
increasing sample period(-c).
|
||||
simpleperf W cmd_record.cpp:928] Lost/Truncated 97.1233% of samples in user space, consider
|
||||
increasing userspace buffer size(--user-buffer-size), or
|
||||
decreasing sample frequency(-f), or increasing sample period(-c).
|
||||
```
|
||||
|
||||
In the above example, we get 79 samples, 16 of them are with truncated stack data. We lose 18
|
||||
samples in the kernel buffer, and lose 2111 samples in the userspace buffer.
|
||||
|
||||
To reduce lost samples in the kernel buffer, we can increase kernel buffer size via `-m`. To reduce
|
||||
lost samples in the userspace buffer, or reduce samples with truncated stack data, we can increase
|
||||
userspace buffer size via `--user-buffer-size`.
|
||||
|
||||
We can also reduce samples generated in a fixed time period, like reducing sample frequency using
|
||||
`-f`, reducing monitored threads, not monitoring multiple perf events at the same time.
|
||||
|
||||
|
||||
## Bugs and contribution
|
||||
|
||||
Bugs and feature requests can be submitted at https://github.com/android/ndk/issues.
|
||||
Patches can be uploaded to android-review.googlesource.com as [here](https://source.android.com/setup/contribute/),
|
||||
or sent to email addresses listed [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/OWNERS).
|
||||
|
||||
If you want to compile simpleperf C++ source code, follow below steps:
|
||||
1. Download AOSP main branch as [here](https://source.android.com/setup/build/requirements).
|
||||
2. Build simpleperf.
|
||||
```sh
|
||||
$ . build/envsetup.sh
|
||||
$ lunch aosp_arm64-trunk_staging-userdebug
|
||||
$ mmma system/extras/simpleperf -j30
|
||||
```
|
||||
|
||||
If built successfully, out/target/product/generic_arm64/system/bin/simpleperf is for ARM64, and
|
||||
out/target/product/generic_arm64/system/bin/simpleperf32 is for ARM.
|
||||
|
||||
The source code of simpleperf python scripts is in [system/extras/simpleperf/scripts](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/scripts/).
|
||||
Most scripts rely on simpleperf binaries to work. To update binaries for scripts (using linux
|
||||
x86_64 host and android arm64 target as an example):
|
||||
```sh
|
||||
$ cp out/host/linux-x86/lib64/libsimpleperf_report.so system/extras/simpleperf/scripts/bin/linux/x86_64/libsimpleperf_report.so
|
||||
$ cp out/target/product/generic_arm64/system/bin/simpleperf_ndk64 system/extras/simpleperf/scripts/bin/android/arm64/simpleperf
|
||||
```
|
||||
|
||||
Then you can try the latest simpleperf scripts and binaries in system/extras/simpleperf/scripts.
|
||||
@ -0,0 +1,313 @@
|
||||
# Android application profiling
|
||||
|
||||
This section shows how to profile an Android application.
|
||||
Some examples are [Here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo/README.md).
|
||||
|
||||
Profiling an Android application involves three steps:
|
||||
1. Prepare an Android application.
|
||||
2. Record profiling data.
|
||||
3. Report profiling data.
|
||||
|
||||
[TOC]
|
||||
|
||||
## Prepare an Android application
|
||||
|
||||
Based on the profiling situation, we may need to customize the build script to generate an apk file
|
||||
specifically for profiling. Below are some suggestions.
|
||||
|
||||
1. If you want to profile a debug build of an application:
|
||||
|
||||
For the debug build type, Android studio sets android::debuggable="true" in AndroidManifest.xml,
|
||||
enables JNI checks and may not optimize C/C++ code. It can be profiled by simpleperf without any
|
||||
change.
|
||||
|
||||
2. If you want to profile a release build of an application:
|
||||
|
||||
For the release build type, Android studio sets android::debuggable="false" in AndroidManifest.xml,
|
||||
disables JNI checks and optimizes C/C++ code. However, security restrictions mean that only apps
|
||||
with android::debuggable set to true can be profiled. So simpleperf can only profile a release
|
||||
build under these three circumstances:
|
||||
If you are on a rooted device, you can profile any app.
|
||||
|
||||
If you are on Android >= Q, you can add profileableFromShell flag in AndroidManifest.xml, this makes
|
||||
a released app profileable by preinstalled profiling tools. In this case, simpleperf downloaded by
|
||||
adb will invoke simpleperf preinstalled in system image to profile the app.
|
||||
|
||||
```
|
||||
<manifest ...>
|
||||
<application ...>
|
||||
<profileable android:shell="true" />
|
||||
</application>
|
||||
</manifest>
|
||||
```
|
||||
|
||||
If you are on Android >= O, we can use [wrap.sh](https://developer.android.com/ndk/guides/wrap-script.html)
|
||||
to profile a release build:
|
||||
Step 1: Add android::debuggable="true" in AndroidManifest.xml to enable profiling.
|
||||
```
|
||||
<manifest ...>
|
||||
<application android::debuggable="true" ...>
|
||||
```
|
||||
|
||||
Step 2: Add wrap.sh in lib/`arch` directories. wrap.sh runs the app without passing any debug flags
|
||||
to ART, so the app runs as a release app. wrap.sh can be done by adding the script below in
|
||||
app/build.gradle.
|
||||
```
|
||||
android {
|
||||
buildTypes {
|
||||
release {
|
||||
sourceSets {
|
||||
release {
|
||||
resources {
|
||||
srcDir {
|
||||
"wrap_sh_lib_dir"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
task createWrapShLibDir
|
||||
for (String abi : ["armeabi-v7a", "arm64-v8a", "x86", "x86_64"]) {
|
||||
def dir = new File("app/wrap_sh_lib_dir/lib/" + abi)
|
||||
dir.mkdirs()
|
||||
def wrapFile = new File(dir, "wrap.sh")
|
||||
wrapFile.withWriter { writer ->
|
||||
writer.write('#!/system/bin/sh\n\$@\n')
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
3. If you want to profile C/C++ code:
|
||||
|
||||
Android studio strips symbol table and debug info of native libraries in the apk. So the profiling
|
||||
results may contain unknown symbols or broken callgraphs. To fix this, we can pass app_profiler.py
|
||||
a directory containing unstripped native libraries via the -lib option. Usually the directory can
|
||||
be the path of your Android Studio project.
|
||||
|
||||
|
||||
4. If you want to profile Java code:
|
||||
|
||||
On Android >= P, simpleperf supports profiling Java code, no matter whether it is executed by
|
||||
the interpreter, or JITed, or compiled into native instructions. So you don't need to do anything.
|
||||
|
||||
On Android O, simpleperf supports profiling Java code which is compiled into native instructions,
|
||||
and it also needs wrap.sh to use the compiled Java code. To compile Java code, we can pass
|
||||
app_profiler.py the --compile_java_code option.
|
||||
|
||||
On Android N, simpleperf supports profiling Java code that is compiled into native instructions.
|
||||
To compile java code, we can pass app_profiler.py the --compile_java_code option.
|
||||
|
||||
On Android <= M, simpleperf doesn't support profiling Java code.
|
||||
|
||||
|
||||
Below I use application [SimpleperfExampleCpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo/SimpleperfExampleCpp).
|
||||
It builds an app-debug.apk for profiling.
|
||||
|
||||
```sh
|
||||
$ git clone https://android.googlesource.com/platform/system/extras
|
||||
$ cd extras/simpleperf/demo
|
||||
# Open SimpleperfExampleCpp project with Android studio, and build this project
|
||||
# successfully, otherwise the `./gradlew` command below will fail.
|
||||
$ cd SimpleperfExampleCpp
|
||||
|
||||
# On windows, use "gradlew" instead.
|
||||
$ ./gradlew clean assemble
|
||||
$ adb install -r app/build/outputs/apk/debug/app-debug.apk
|
||||
```
|
||||
|
||||
## Record and report profiling data
|
||||
|
||||
We can use [app-profiler.py](scripts_reference.md#app_profilerpy) to profile Android applications.
|
||||
|
||||
```sh
|
||||
# Cd to the directory of simpleperf scripts. Record perf.data.
|
||||
# -p option selects the profiled app using its package name.
|
||||
# --compile_java_code option compiles Java code into native instructions, which isn't needed on
|
||||
# Android >= P.
|
||||
# -a option selects the Activity to profile.
|
||||
# -lib option gives the directory to find debug native libraries.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -a .MixActivity -lib path_of_SimpleperfExampleCpp
|
||||
```
|
||||
|
||||
This will collect profiling data in perf.data in the current directory, and related native
|
||||
binaries in binary_cache/.
|
||||
|
||||
Normally we need to use the app when profiling, otherwise we may record no samples. But in this
|
||||
case, the MixActivity starts a busy thread. So we don't need to use the app while profiling.
|
||||
|
||||
```sh
|
||||
# Report perf.data in stdio interface.
|
||||
$ ./report.py
|
||||
Cmdline: /data/data/simpleperf.example.cpp/simpleperf record ...
|
||||
Arch: arm64
|
||||
Event: task-clock:u (type 1, config 1)
|
||||
Samples: 10023
|
||||
Event count: 10023000000
|
||||
|
||||
Overhead Command Pid Tid Shared Object Symbol
|
||||
27.04% BusyThread 5703 5729 /system/lib64/libart.so art::JniMethodStart(art::Thread*)
|
||||
25.87% BusyThread 5703 5729 /system/lib64/libc.so long StrToI<long, ...
|
||||
...
|
||||
```
|
||||
|
||||
[report.py](scripts_reference.md#reportpy) reports profiling data in stdio interface. If there
|
||||
are a lot of unknown symbols in the report, check [here](README.md#how-to-solve-missing-symbols-in-report).
|
||||
|
||||
```sh
|
||||
# Report perf.data in html interface.
|
||||
$ ./report_html.py
|
||||
|
||||
# Add source code and disassembly. Change the path of source_dirs if it not correct.
|
||||
$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp \
|
||||
--add_disassembly
|
||||
```
|
||||
|
||||
[report_html.py](scripts_reference.md#report_htmlpy) generates report in report.html, and pops up
|
||||
a browser tab to show it.
|
||||
|
||||
## Record and report call graph
|
||||
|
||||
We can record and report [call graphs](executable_commands_reference.md#record-call-graphs) as below.
|
||||
|
||||
```sh
|
||||
# Record dwarf based call graphs: add "-g" in the -r option.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp \
|
||||
-r "-e task-clock:u -f 1000 --duration 10 -g" -lib path_of_SimpleperfExampleCpp
|
||||
|
||||
# Record stack frame based call graphs: add "--call-graph fp" in the -r option.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp \
|
||||
-r "-e task-clock:u -f 1000 --duration 10 --call-graph fp" \
|
||||
-lib path_of_SimpleperfExampleCpp
|
||||
|
||||
# Report call graphs in stdio interface.
|
||||
$ ./report.py -g
|
||||
|
||||
# Report call graphs in python Tk interface.
|
||||
$ ./report.py -g --gui
|
||||
|
||||
# Report call graphs in html interface.
|
||||
$ ./report_html.py
|
||||
|
||||
# Report call graphs in flamegraphs.
|
||||
# On Windows, use inferno.bat instead of ./inferno.sh.
|
||||
$ ./inferno.sh -sc
|
||||
```
|
||||
|
||||
## Report in html interface
|
||||
|
||||
We can use [report_html.py](scripts_reference.md#report_htmlpy) to show profiling results in a web browser.
|
||||
report_html.py integrates chart statistics, sample table, flamegraphs, source code annotation
|
||||
and disassembly annotation. It is the recommended way to show reports.
|
||||
|
||||
```sh
|
||||
$ ./report_html.py
|
||||
```
|
||||
|
||||
## Show flamegraph
|
||||
|
||||
To show flamegraphs, we need to first record call graphs. Flamegraphs are shown by
|
||||
report_html.py in the "Flamegraph" tab.
|
||||
We can also use [inferno](scripts_reference.md#inferno) to show flamegraphs directly.
|
||||
|
||||
```sh
|
||||
# On Windows, use inferno.bat instead of ./inferno.sh.
|
||||
$ ./inferno.sh -sc
|
||||
```
|
||||
|
||||
We can also build flamegraphs using https://github.com/brendangregg/FlameGraph.
|
||||
Please make sure you have perl installed.
|
||||
|
||||
```sh
|
||||
$ git clone https://github.com/brendangregg/FlameGraph.git
|
||||
$ ./report_sample.py --symfs binary_cache >out.perf
|
||||
$ FlameGraph/stackcollapse-perf.pl out.perf >out.folded
|
||||
$ FlameGraph/flamegraph.pl out.folded >a.svg
|
||||
```
|
||||
|
||||
## Report in Android Studio
|
||||
|
||||
simpleperf report-sample command can convert perf.data into protobuf format accepted by
|
||||
Android Studio cpu profiler. The conversion can be done either on device or on host. If you have
|
||||
more symbol info on host, then prefer do it on host with --symdir option.
|
||||
|
||||
```sh
|
||||
$ simpleperf report-sample --protobuf --show-callchain -i perf.data -o perf.trace
|
||||
# Then open perf.trace in Android Studio to show it.
|
||||
```
|
||||
|
||||
## Deobfuscate Java symbols
|
||||
|
||||
Java symbols may be obfuscated by ProGuard. To restore the original symbols in a report, we can
|
||||
pass a Proguard mapping file to the report scripts or report-sample command via
|
||||
`--proguard-mapping-file`.
|
||||
|
||||
```sh
|
||||
$ ./report_html.py --proguard-mapping-file proguard_mapping_file.txt
|
||||
```
|
||||
|
||||
## Record both on CPU time and off CPU time
|
||||
|
||||
We can [record both on CPU time and off CPU time](executable_commands_reference.md#record-both-on-cpu-time-and-off-cpu-time).
|
||||
|
||||
First check if trace-offcpu feature is supported on the device.
|
||||
|
||||
```sh
|
||||
$ ./run_simpleperf_on_device.py list --show-features
|
||||
dwarf-based-call-graph
|
||||
trace-offcpu
|
||||
```
|
||||
|
||||
If trace-offcpu is supported, it will be shown in the feature list. Then we can try it.
|
||||
|
||||
```sh
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -a .SleepActivity \
|
||||
-r "-g -e task-clock:u -f 1000 --duration 10 --trace-offcpu" \
|
||||
-lib path_of_SimpleperfExampleCpp
|
||||
$ ./report_html.py --add_disassembly --add_source_code \
|
||||
--source_dirs path_of_SimpleperfExampleCpp
|
||||
```
|
||||
|
||||
## Profile from launch
|
||||
|
||||
We can [profile from launch of an application](scripts_reference.md#profile-from-launch-of-an-application).
|
||||
|
||||
```sh
|
||||
# Start simpleperf recording, then start the Activity to profile.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -a .MainActivity
|
||||
|
||||
# We can also start the Activity on the device manually.
|
||||
# 1. Make sure the application isn't running or one of the recent apps.
|
||||
# 2. Start simpleperf recording.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp
|
||||
# 3. Start the app manually on the device.
|
||||
```
|
||||
|
||||
## Control recording in application code
|
||||
|
||||
Simpleperf supports controlling recording from application code. Below is the workflow:
|
||||
|
||||
1. Run `api_profiler.py prepare -p <package_name>` to allow an app recording itself using
|
||||
simpleperf. By default, the permission is reset after device reboot. So we need to run the
|
||||
script every time the device reboots. But on Android >= 13, we can use `--days` options to
|
||||
set how long we want the permission to last.
|
||||
|
||||
2. Link simpleperf app_api code in the application. The app needs to be debuggable or
|
||||
profileableFromShell as described [here](#prepare-an-android-application). Then the app can
|
||||
use the api to start/pause/resume/stop recording. To start recording, the app_api forks a child
|
||||
process running simpleperf, and uses pipe files to send commands to the child process. After
|
||||
recording, a profiling data file is generated.
|
||||
|
||||
3. Run `api_profiler.py collect -p <package_name>` to collect profiling data files to host.
|
||||
|
||||
Examples are CppApi and JavaApi in [demo](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo).
|
||||
|
||||
|
||||
## Parse profiling data manually
|
||||
|
||||
We can also write python scripts to parse profiling data manually, by using
|
||||
[simpleperf_report_lib.py](scripts_reference.md#simpleperf_report_libpy). Examples are report_sample.py,
|
||||
report_html.py.
|
||||
@ -0,0 +1,109 @@
|
||||
# Android platform profiling
|
||||
|
||||
[TOC]
|
||||
|
||||
## General Tips
|
||||
|
||||
Here are some tips for Android platform developers, who build and flash system images on rooted
|
||||
devices:
|
||||
1. After running `adb root`, simpleperf can be used to profile any process or system wide.
|
||||
2. It is recommended to use the latest simpleperf available in AOSP main, if you are not working
|
||||
on the current main branch. Scripts are in `system/extras/simpleperf/scripts`, binaries are in
|
||||
`system/extras/simpleperf/scripts/bin/android`.
|
||||
3. It is recommended to use `app_profiler.py` for recording, and `report_html.py` for reporting.
|
||||
Below is an example.
|
||||
|
||||
```sh
|
||||
# Record surfaceflinger process for 10 seconds with dwarf based call graph. More examples are in
|
||||
# scripts reference in the doc.
|
||||
$ ./app_profiler.py -np surfaceflinger -r "-g --duration 10"
|
||||
|
||||
# Generate html report.
|
||||
$ ./report_html.py
|
||||
```
|
||||
|
||||
4. Since Android >= O has symbols for system libraries on device, we don't need to use unstripped
|
||||
binaries in `$ANDROID_PRODUCT_OUT/symbols` to report call graphs. However, they are needed to add
|
||||
source code and disassembly (with line numbers) in the report. Below is an example.
|
||||
|
||||
```sh
|
||||
# Doing recording with app_profiler.py or simpleperf on device, and generates perf.data on host.
|
||||
$ ./app_profiler.py -np surfaceflinger -r "--call-graph fp --duration 10"
|
||||
|
||||
# Collect unstripped binaries from $ANDROID_PRODUCT_OUT/symbols to binary_cache/.
|
||||
$ ./binary_cache_builder.py -lib $ANDROID_PRODUCT_OUT/symbols
|
||||
|
||||
# Report source code and disassembly. Disassembling all binaries is slow, so it's better to add
|
||||
# --binary_filter option to only disassemble selected binaries.
|
||||
$ ./report_html.py --add_source_code --source_dirs $ANDROID_BUILD_TOP --add_disassembly \
|
||||
--binary_filter surfaceflinger.so
|
||||
```
|
||||
|
||||
## Start simpleperf from system_server process
|
||||
|
||||
Sometimes we want to profile a process/system-wide when a special situation happens. In this case,
|
||||
we can add code starting simpleperf at the point where the situation is detected.
|
||||
|
||||
1. Disable selinux by `adb shell setenforce 0`. Because selinux only allows simpleperf running
|
||||
in shell or debuggable/profileable apps.
|
||||
|
||||
2. Add below code at the point where the special situation is detected.
|
||||
|
||||
```java
|
||||
try {
|
||||
// for capability check
|
||||
Os.prctl(OsConstants.PR_CAP_AMBIENT, OsConstants.PR_CAP_AMBIENT_RAISE,
|
||||
OsConstants.CAP_SYS_PTRACE, 0, 0);
|
||||
// Write to /data instead of /data/local/tmp. Because /data can be written by system user.
|
||||
Runtime.getRuntime().exec("/system/bin/simpleperf record -g -p " + String.valueOf(Process.myPid())
|
||||
+ " -o /data/perf.data --duration 30 --log-to-android-buffer --log verbose");
|
||||
} catch (Exception e) {
|
||||
Slog.e(TAG, "error while running simpleperf");
|
||||
e.printStackTrace();
|
||||
}
|
||||
```
|
||||
|
||||
## Hardware PMU counter limit
|
||||
|
||||
When monitoring instruction and cache related perf events (in hw/cache/raw/pmu category of list cmd),
|
||||
these events are mapped to PMU counters on each cpu core. But each core only has a limited number
|
||||
of PMU counters. If number of events > number of PMU counters, then the counters are multiplexed
|
||||
among events, which probably isn't what we want. We can use `simpleperf stat --print-hw-counter` to
|
||||
show hardware counters (per core) available on the device.
|
||||
|
||||
On Pixel devices, the number of PMU counters on each core is usually 7, of which 4 of them are used
|
||||
by the kernel to monitor memory latency. So only 3 counters are available. It's fine to monitor up
|
||||
to 3 PMU events at the same time. To monitor more than 3 events, the `--use-devfreq-counters` option
|
||||
can be used to borrow from the counters used by the kernel.
|
||||
|
||||
## Get boot-time profile
|
||||
|
||||
On userdebug/eng devices, we can get boot-time profile via simpleperf.
|
||||
|
||||
Step 1. Customize the configuration if needed. By default, simpleperf tracks all processes
|
||||
except for itself, starts at `early-init`, and stops when `sys.boot_completed` is set.
|
||||
You can customize it by changing the trigger or command line flags in
|
||||
`system/extras/simpleperf/simpleperf.rc`.
|
||||
|
||||
Step 2. Add `androidboot.simpleperf.boot_record=1` to the kernel command line.
|
||||
For example, on Pixel devices, you can do
|
||||
```
|
||||
$ fastboot oem cmdline add androidboot.simpleperf.boot_record=1
|
||||
```
|
||||
|
||||
Step 3. Reboot the device. When booting, init finds that the kernel command line flag is set,
|
||||
so it forks a background process to run simpleperf to record boot-time profile.
|
||||
init starts simpleperf at `early-init` stage, which is very soon after second-stage init starts.
|
||||
|
||||
Step 4. After boot, the boot-time profile is stored in /tmp/boot_perf.data. Then we can pull
|
||||
the profile to host to report.
|
||||
|
||||
```
|
||||
$ adb shell ls /tmp/boot_perf.data
|
||||
/tmp/boot_perf.data
|
||||
```
|
||||
|
||||
Following is a boot-time profile example. From timestamp, the first sample is generated at about
|
||||
4.5s after booting.
|
||||
|
||||

|
||||
|
After Width: | Height: | Size: 158 KiB |
@ -0,0 +1,268 @@
|
||||
# Collect ETM data for AutoFDO
|
||||
|
||||
[TOC]
|
||||
|
||||
## Introduction
|
||||
|
||||
ETM is a hardware feature available on arm64 devices. It collects the instruction stream running on
|
||||
each cpu. ARM uses ETM as an alternative for LBR (last branch record) on x86.
|
||||
Simpleperf supports collecting ETM data, and converting it to input files for AutoFDO, which can
|
||||
then be used for PGO (profile-guided optimization) during compilation.
|
||||
|
||||
On ARMv8, ETM is considered as an external debug interface (unless ARMv8.4 Self-hosted Trace
|
||||
extension is impelemented). So it needs to be enabled explicitly in the bootloader, and isn't
|
||||
available on user devices. For Pixel devices, it's available on EVT and DVT devices on Pixel 4,
|
||||
Pixel 4a (5G) and Pixel 5. To test if it's available on other devices, you can follow commands in
|
||||
this doc and see if you can record any ETM data.
|
||||
|
||||
## Examples
|
||||
|
||||
Below are examples collecting ETM data for AutoFDO. It has two steps: first recording ETM data,
|
||||
second converting ETM data to AutoFDO input files.
|
||||
|
||||
Record ETM data:
|
||||
|
||||
```sh
|
||||
# preparation: we need to be root to record ETM data
|
||||
$ adb root
|
||||
$ adb shell
|
||||
redfin:/ \# cd data/local/tmp
|
||||
redfin:/data/local/tmp \#
|
||||
|
||||
# Do a system wide collection, it writes output to perf.data.
|
||||
# If only want ETM data for kernel, use `-e cs-etm:k`.
|
||||
# If only want ETM data for userspace, use `-e cs-etm:u`.
|
||||
redfin:/data/local/tmp \# simpleperf record -e cs-etm --duration 3 -a
|
||||
|
||||
# To reduce file size and time converting to AutoFDO input files, we recommend converting ETM data
|
||||
# into an intermediate branch-list format.
|
||||
redfin:/data/local/tmp \# simpleperf inject --output branch-list -o branch_list.data
|
||||
```
|
||||
|
||||
Converting ETM data to AutoFDO input files needs to read binaries.
|
||||
So for userspace libraries, they can be converted on device. For kernel, it needs
|
||||
to be converted on host, with vmlinux and kernel modules available.
|
||||
|
||||
Convert ETM data for userspace libraries:
|
||||
|
||||
```sh
|
||||
# Injecting ETM data on device. It writes output to perf_inject.data.
|
||||
# perf_inject.data is a text file, containing branch counts for each library.
|
||||
redfin:/data/local/tmp \# simpleperf inject -i branch_list.data
|
||||
```
|
||||
|
||||
Convert ETM data for kernel:
|
||||
|
||||
```sh
|
||||
# pull ETM data to host.
|
||||
host $ adb pull /data/local/tmp/branch_list.data
|
||||
# download vmlinux and kernel modules to <binary_dir>
|
||||
# host simpleperf is in <aosp-top>/system/extras/simpleperf/scripts/bin/linux/x86_64/simpleperf,
|
||||
# or you can build simpleperf by `mmma system/extras/simpleperf`.
|
||||
host $ simpleperf inject --symdir <binary_dir> -i branch_list.data
|
||||
```
|
||||
|
||||
The generated perf_inject.data may contain branch info for multiple binaries. But AutoFDO only
|
||||
accepts one at a time. So we need to split perf_inject.data.
|
||||
The format of perf_inject.data is below:
|
||||
|
||||
```perf_inject.data format
|
||||
|
||||
executed range with count info for binary1
|
||||
branch with count info for binary1
|
||||
// name for binary1
|
||||
|
||||
executed range with count info for binary2
|
||||
branch with count info for binary2
|
||||
// name for binary2
|
||||
|
||||
...
|
||||
```
|
||||
|
||||
We need to split perf_inject.data, and make sure one file only contains info for one binary.
|
||||
|
||||
Then we can use [AutoFDO](https://github.com/google/autofdo) to create profile. AutoFDO only works
|
||||
for binaries having an executable segment as its first loadable segment. But binaries built in
|
||||
Android may not follow this rule. Simpleperf inject command knows how to work around this problem.
|
||||
But there is a check in AutoFDO forcing binaries to start with an executable segment. We need to
|
||||
disable the check in AutoFDO, by commenting out L127-L136 in
|
||||
https://github.com/google/autofdo/commit/188db2834ce74762ed17108ca344916994640708#diff-2d132ecbb5e4f13e0da65419f6d1759dd27d6b696786dd7096c0c34d499b1710R127-R136.
|
||||
Then we can use `create_llvm_prof` in AutoFDO to create profiles used by clang.
|
||||
|
||||
```sh
|
||||
# perf_inject_binary1.data is split from perf_inject.data, and only contains branch info for binary1.
|
||||
host $ autofdo/create_llvm_prof -profile perf_inject_binary1.data -profiler text -binary path_of_binary1 -out a.prof -format binary
|
||||
|
||||
# perf_inject_kernel.data is split from perf_inject.data, and only contains branch info for [kernel.kallsyms].
|
||||
host $ autofdo/create_llvm_prof -profile perf_inject_kernel.data -profiler text -binary vmlinux -out a.prof -format binary
|
||||
```
|
||||
|
||||
Then we can use a.prof for PGO during compilation, via `-fprofile-sample-use=a.prof`.
|
||||
[Here](https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers) are more details.
|
||||
|
||||
### A complete example: etm_test_loop.cpp
|
||||
|
||||
`etm_test_loop.cpp` is an example to show the complete process.
|
||||
The source code is in [etm_test_loop.cpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/etm_test_loop.cpp).
|
||||
The build script is in [Android.bp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/Android.bp).
|
||||
It builds an executable called `etm_test_loop`, which runs on device.
|
||||
|
||||
Step 1: Build `etm_test_loop` binary.
|
||||
|
||||
```sh
|
||||
(host) <AOSP>$ . build/envsetup.sh
|
||||
(host) <AOSP>$ lunch aosp_arm64-trunk_staging-userdebug
|
||||
(host) <AOSP>$ make etm_test_loop
|
||||
```
|
||||
|
||||
Step 2: Run `etm_test_loop` on device, and collect ETM data for its running.
|
||||
|
||||
```sh
|
||||
(host) <AOSP>$ adb push out/target/product/generic_arm64/system/bin/etm_test_loop /data/local/tmp
|
||||
(host) <AOSP>$ adb root
|
||||
(host) <AOSP>$ adb shell
|
||||
(device) / # cd /data/local/tmp
|
||||
(device) /data/local/tmp # chmod a+x etm_test_loop
|
||||
(device) /data/local/tmp # simpleperf record -e cs-etm:u ./etm_test_loop
|
||||
simpleperf I cmd_record.cpp:729] Recorded for 0.0370068 seconds. Start post processing.
|
||||
simpleperf I cmd_record.cpp:799] Aux data traced: 1689136
|
||||
(device) /data/local/tmp # simpleperf inject -i perf.data --output branch-list -o branch_list.data
|
||||
simpleperf W dso.cpp:557] failed to read min virtual address of [vdso]: File not found
|
||||
(device) /data/local/tmp # exit
|
||||
(host) <AOSP>$ adb pull /data/local/tmp/branch_list.data
|
||||
```
|
||||
|
||||
Step 3: Convert ETM data to AutoFDO data.
|
||||
|
||||
```sh
|
||||
# Build simpleperf tool on host.
|
||||
(host) <AOSP>$ make simpleperf_ndk
|
||||
(host) <AOSP>$ simpleperf_ndk64 inject -i branch_list.data -o perf_inject_etm_test_loop.data --symdir out/target/product/generic_arm64/symbols/system/bin
|
||||
simpleperf W cmd_inject.cpp:505] failed to build instr ranges for binary [vdso]: File not found
|
||||
(host) <AOSP>$ cat perf_inject_etm_test_loop.data
|
||||
13
|
||||
1000-1010:1
|
||||
1014-1050:1
|
||||
...
|
||||
112c->0:1
|
||||
// /data/local/tmp/etm_test_loop
|
||||
|
||||
(host) <AOSP>$ create_llvm_prof -profile perf_inject_etm_test_loop.data -profiler text -binary out/target/product/generic_arm64/symbols/system/bin/etm_test_loop -out etm_test_loop.afdo -format binary
|
||||
(host) <AOSP>$ ls -lh etm_test_loop.afdo
|
||||
rw-r--r-- 1 user group 241 Aug 29 16:04 etm_test_loop.afdo
|
||||
```
|
||||
|
||||
Step 4: Use AutoFDO data to build optimized binary.
|
||||
|
||||
```sh
|
||||
(host) <AOSP>$ mkdir toolchain/pgo-profiles/sampling/
|
||||
(host) <AOSP>$ cp etm_test_loop.afdo toolchain/pgo-profiles/sampling/
|
||||
(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/Android.bp
|
||||
# edit Android.bp to add a fdo_profile module
|
||||
# soong_namespace {}
|
||||
#
|
||||
# fdo_profile {
|
||||
# name: "etm_test_loop_afdo",
|
||||
# profile: ["etm_test_loop.afdo"],
|
||||
# }
|
||||
```
|
||||
|
||||
`soong_namespace` is added to support fdo_profile modules with the same name
|
||||
|
||||
In a product config mk file, update `PRODUCT_AFDO_PROFILES` with
|
||||
|
||||
```make
|
||||
PRODUCT_AFDO_PROFILES += etm_test_loop://toolchain/pgo-profiles/sampling:etm_test_loop_afdo
|
||||
```
|
||||
|
||||
```sh
|
||||
(host) <AOSP>$ vi system/extras/simpleperf/runtest/Android.bp
|
||||
# edit Android.bp to enable afdo for etm_test_loop.
|
||||
# cc_binary {
|
||||
# name: "etm_test_loop",
|
||||
# srcs: ["etm_test_loop.cpp"],
|
||||
# afdo: true,
|
||||
# }
|
||||
(host) <AOSP>$ make etm_test_loop
|
||||
```
|
||||
|
||||
If comparing the disassembly of `out/target/product/generic_arm64/symbols/system/bin/etm_test_loop`
|
||||
before and after optimizing with AutoFDO data, we can see different preferences when branching.
|
||||
|
||||
|
||||
## Collect ETM data with a daemon
|
||||
|
||||
Android also has a daemon collecting ETM data periodically. It only runs on userdebug and eng
|
||||
devices. The source code is in https://android.googlesource.com/platform/system/extras/+/main/profcollectd/.
|
||||
|
||||
## Support ETM in the kernel
|
||||
|
||||
To let simpleperf use ETM function, we need to enable Coresight driver in the kernel, which lives in
|
||||
`<linux_kernel>/drivers/hwtracing/coresight`.
|
||||
|
||||
The Coresight driver can be enabled by below kernel configs:
|
||||
|
||||
```config
|
||||
CONFIG_CORESIGHT=y
|
||||
CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
|
||||
CONFIG_CORESIGHT_SOURCE_ETM4X=y
|
||||
```
|
||||
|
||||
On Kernel 5.10+, we recommend building Coresight driver as kernel modules. Because it works with
|
||||
GKI kernel.
|
||||
|
||||
```config
|
||||
CONFIG_CORESIGHT=m
|
||||
CONFIG_CORESIGHT_LINK_AND_SINK_TMC=m
|
||||
CONFIG_CORESIGHT_SOURCE_ETM4X=m
|
||||
```
|
||||
|
||||
Android common kernel 5.10+ should have all the Coresight patches needed to collect ETM data.
|
||||
Android common kernel 5.4 misses two patches. But by adding patches in
|
||||
https://android-review.googlesource.com/q/topic:test_etm_on_hikey960_5.4, we can collect ETM data
|
||||
on hikey960 with 5.4 kernel.
|
||||
For Android common kernel 4.14 and 4.19, we have backported all necessary Coresight patches.
|
||||
|
||||
Besides Coresight driver, we also need to add Coresight devices in device tree. An example is in
|
||||
https://github.com/torvalds/linux/blob/master/arch/arm64/boot/dts/arm/juno-base.dtsi. There should
|
||||
be a path flowing ETM data from ETM device through funnels, ETF and replicators, all the way to
|
||||
ETR, which writes ETM data to system memory.
|
||||
|
||||
One optional flag in ETM device tree is "arm,coresight-loses-context-with-cpu". It saves ETM
|
||||
registers when a CPU enters low power state. It may be needed to avoid
|
||||
"coresight_disclaim_device_unlocked" warning when doing system wide collection.
|
||||
|
||||
One optional flag in ETR device tree is "arm,scatter-gather". Simpleperf requests 4M system memory
|
||||
for ETR to store ETM data. Without IOMMU, the memory needs to be contiguous. If the kernel can't
|
||||
fulfill the request, simpleperf will report out of memory error. Fortunately, we can use
|
||||
"arm,scatter-gather" flag to let ETR run in scatter gather mode, which uses non-contiguous memory.
|
||||
|
||||
|
||||
### A possible problem: trace_id mismatch
|
||||
|
||||
Each CPU has an ETM device, which has a unique trace_id assigned from the kernel.
|
||||
The formula is: `trace_id = 0x10 + cpu * 2`, as in https://github.com/torvalds/linux/blob/master/include/linux/coresight-pmu.h#L37.
|
||||
If the formula is modified by local patches, then simpleperf inject command can't parse ETM data
|
||||
properly and is likely to give empty output.
|
||||
|
||||
|
||||
## Enable ETM in the bootloader
|
||||
|
||||
Unless ARMv8.4 Self-hosted Trace extension is implemented, ETM is considered as an external debug
|
||||
interface. It may be disabled by fuse (like JTAG). So we need to check if ETM is disabled, and
|
||||
if bootloader provides a way to reenable it.
|
||||
|
||||
We can tell if ETM is disable by checking its TRCAUTHSTATUS register, which is exposed in sysfs,
|
||||
like /sys/bus/coresight/devices/coresight-etm0/mgmt/trcauthstatus. To reenable ETM, we need to
|
||||
enable non-Secure non-invasive debug on ARM CPU. The method depends on chip vendors(SOCs).
|
||||
|
||||
|
||||
## Related docs
|
||||
|
||||
* [Arm Architecture Reference Manual Armv8, D3 AArch64 Self-hosted Trace](https://developer.arm.com/documentation/ddi0487/latest)
|
||||
* [ARM ETM Architecture Specification](https://developer.arm.com/documentation/ihi0064/latest/)
|
||||
* [ARM CoreSight Architecture Specification](https://developer.arm.com/documentation/ihi0029/latest)
|
||||
* [CoreSight Components Technical Reference Manual](https://developer.arm.com/documentation/ddi0314/h/)
|
||||
* [CoreSight Trace Memory Controller Technical Reference Manual](https://developer.arm.com/documentation/ddi0461/b/)
|
||||
* [OpenCSD library for decoding ETM data](https://github.com/Linaro/OpenCSD)
|
||||
* [AutoFDO tool for converting profile data](https://github.com/google/autofdo)
|
||||
@ -0,0 +1,79 @@
|
||||
# Debug dwarf unwinding
|
||||
|
||||
Dwarf unwinding is the default way of getting call graphs in simpleperf. In this process,
|
||||
simpleperf asks the kernel to add stack and register data to each sample. Then it uses
|
||||
[libunwindstack](https://cs.android.com/android/platform/superproject/+/main:system/unwinding/libunwindstack/)
|
||||
to unwind the call stack. libunwindstack uses dwarf sections (like .debug_frame or .eh_frame) in
|
||||
elf files to know how to unwind the stack.
|
||||
|
||||
By default, `simpleperf record` unwinds a sample before saving it to disk, to reduce space consumed
|
||||
by stack data. But this behavior makes it harder to reproduce unwinding problems. So we added
|
||||
debug-unwind command, to help debug and profile dwarf unwinding. Below are two use cases.
|
||||
|
||||
[TOC]
|
||||
|
||||
## Debug failed unwinding cases
|
||||
|
||||
Unwinding a sample can fail for different reasons: not enough stack or register data, unknown
|
||||
thread maps, no dwarf info, bugs in code, etc. And to fix them, we need to get error details
|
||||
and be able to reproduce them. simpleperf record cmd has two options for this:
|
||||
`--keep-failed-unwinding-result` keeps error code for failed unwinding samples. It's lightweight
|
||||
and gives us a brief idea why unwinding stops.
|
||||
`--keep-failed-unwinding-debug-info` keeps stack and register data for failed unwinding samples. It
|
||||
can be used to reproduce the unwinding process given proper elf files. Below is an example.
|
||||
|
||||
```sh
|
||||
# Run record cmd and keep failed unwinding debug info.
|
||||
$ simpleperf64 record --app com.example.android.displayingbitmaps -g --duration 10 \
|
||||
--keep-failed-unwinding-debug-info
|
||||
...
|
||||
simpleperf I cmd_record.cpp:762] Samples recorded: 22026. Samples lost: 0.
|
||||
|
||||
# Generate a text report containing failed unwinding cases.
|
||||
$ simpleperf debug-unwind --generate-report -o report.txt
|
||||
|
||||
# Pull report.txt on host and show it using debug_unwind_reporter.py.
|
||||
# Show summary.
|
||||
$ debug_unwind_reporter.py -i report.txt --summary
|
||||
# Show summary of samples failed at a symbol.
|
||||
$ debug_unwind_reporter.py -i report.txt --summary --include-end-symbol SocketInputStream_socketRead0
|
||||
# Show details of samples failed at a symbol.
|
||||
$ debug_unwind_reporter.py -i report.txt --include-end-symbol SocketInputStream_socketRead0
|
||||
|
||||
# Reproduce unwinding a failed case.
|
||||
$ simpleperf debug-unwind --unwind-sample --sample-time 256666343213301
|
||||
|
||||
# Generate a test file containing a failed case and elf files for debugging it.
|
||||
$ simpleperf debug-unwind --generate-test-file --sample-time 256666343213301 --keep-binaries-in-test-file \
|
||||
/apex/com.android.runtime/lib64/bionic/libc.so,/apex/com.android.art/lib64/libopenjdk.so -o test.data
|
||||
```
|
||||
|
||||
## Profile unwinding process
|
||||
|
||||
We can also record samples without unwinding them. Then we can use debug-unwind cmd to unwind the
|
||||
samples after recording. Below is an example.
|
||||
|
||||
```sh
|
||||
# Record samples without unwinding them.
|
||||
$ simpleperf record --app com.example.android.displayingbitmaps -g --duration 10 \
|
||||
--no-unwind
|
||||
...
|
||||
simpleperf I cmd_record.cpp:762] Samples recorded: 9923. Samples lost: 0.
|
||||
|
||||
# Use debug-unwind cmd to unwind samples.
|
||||
$ simpleperf debug-unwind --unwind-sample
|
||||
```
|
||||
|
||||
We can profile the unwinding process, get hot functions for improvement.
|
||||
|
||||
```sh
|
||||
# Profile debug-unwind cmd.
|
||||
$ simpleperf record -g -o perf_unwind.data simpleperf debug-unwind --unwind-sample --skip-sample-print
|
||||
|
||||
# Then pull perf_unwind.data and report it.
|
||||
$ report_html.py -i perf_unwind.data
|
||||
|
||||
# We can also add source code annotation in report.html.
|
||||
$ binary_cache_builder.py -i perf_unwind.data -lib <path to aosp-main>/out/target/product/<device-name>/symbols/system
|
||||
$ report_html.py -i perf_unwind.data --add_source_code --source_dirs <path to aosp-main>/system/
|
||||
```
|
||||
@ -0,0 +1,696 @@
|
||||
# Executable commands reference
|
||||
|
||||
[TOC]
|
||||
|
||||
## How simpleperf works
|
||||
|
||||
Modern CPUs have a hardware component called the performance monitoring unit (PMU). The PMU has
|
||||
several hardware counters, counting events like how many cpu cycles have happened, how many
|
||||
instructions have executed, or how many cache misses have happened.
|
||||
|
||||
The Linux kernel wraps these hardware counters into hardware perf events. In addition, the Linux
|
||||
kernel also provides hardware independent software events and tracepoint events. The Linux kernel
|
||||
exposes all events to userspace via the perf_event_open system call, which is used by simpleperf.
|
||||
|
||||
Simpleperf has three main commands: stat, record and report.
|
||||
|
||||
The stat command gives a summary of how many events have happened in the profiled processes in a
|
||||
time period. Here’s how it works:
|
||||
1. Given user options, simpleperf enables profiling by making a system call to the kernel.
|
||||
2. The kernel enables counters while the profiled processes are running.
|
||||
3. After profiling, simpleperf reads counters from the kernel, and reports a counter summary.
|
||||
|
||||
The record command records samples of the profiled processes in a time period. Here’s how it works:
|
||||
1. Given user options, simpleperf enables profiling by making a system call to the kernel.
|
||||
2. Simpleperf creates mapped buffers between simpleperf and the kernel.
|
||||
3. The kernel enables counters while the profiled processes are running.
|
||||
4. Each time a given number of events happen, the kernel dumps a sample to the mapped buffers.
|
||||
5. Simpleperf reads samples from the mapped buffers and stores profiling data in a file called
|
||||
perf.data.
|
||||
|
||||
The report command reads perf.data and any shared libraries used by the profiled processes,
|
||||
and outputs a report showing where the time was spent.
|
||||
|
||||
## Commands
|
||||
|
||||
Simpleperf supports several commands, listed below:
|
||||
|
||||
```
|
||||
The debug-unwind command: debug/test dwarf based offline unwinding, used for debugging simpleperf.
|
||||
The dump command: dumps content in perf.data, used for debugging simpleperf.
|
||||
The help command: prints help information for other commands.
|
||||
The kmem command: collects kernel memory allocation information (will be replaced by Python scripts).
|
||||
The list command: lists all event types supported on the Android device.
|
||||
The record command: profiles processes and stores profiling data in perf.data.
|
||||
The report command: reports profiling data in perf.data.
|
||||
The report-sample command: reports each sample in perf.data, used for supporting integration of
|
||||
simpleperf in Android Studio.
|
||||
The stat command: profiles processes and prints counter summary.
|
||||
|
||||
```
|
||||
|
||||
Each command supports different options, which can be seen through help message.
|
||||
|
||||
```sh
|
||||
# List all commands.
|
||||
$ simpleperf --help
|
||||
|
||||
# Print help message for record command.
|
||||
$ simpleperf record --help
|
||||
```
|
||||
|
||||
Below describes the most frequently used commands, which are list, stat, record and report.
|
||||
|
||||
## The list command
|
||||
|
||||
The list command lists all events available on the device. Different devices may support different
|
||||
events because they have different hardware and kernels.
|
||||
|
||||
```sh
|
||||
$ simpleperf list
|
||||
List of hw-cache events:
|
||||
branch-loads
|
||||
...
|
||||
List of hardware events:
|
||||
cpu-cycles
|
||||
instructions
|
||||
...
|
||||
List of software events:
|
||||
cpu-clock
|
||||
task-clock
|
||||
...
|
||||
```
|
||||
|
||||
On ARM/ARM64, the list command also shows a list of raw events, they are the events supported by
|
||||
the ARM PMU on the device. The kernel has wrapped part of them into hardware events and hw-cache
|
||||
events. For example, raw-cpu-cycles is wrapped into cpu-cycles, raw-instruction-retired is wrapped
|
||||
into instructions. The raw events are provided in case we want to use some events supported on the
|
||||
device, but unfortunately not wrapped by the kernel.
|
||||
|
||||
## The stat command
|
||||
|
||||
The stat command is used to get event counter values of the profiled processes. By passing options,
|
||||
we can select which events to use, which processes/threads to monitor, how long to monitor and the
|
||||
print interval.
|
||||
|
||||
```sh
|
||||
# Stat using default events (cpu-cycles,instructions,...), and monitor process 7394 for 10 seconds.
|
||||
$ simpleperf stat -p 7394 --duration 10
|
||||
Performance counter statistics:
|
||||
|
||||
# count event_name # count / runtime
|
||||
16,513,564 cpu-cycles # 1.612904 GHz
|
||||
4,564,133 stalled-cycles-frontend # 341.490 M/sec
|
||||
6,520,383 stalled-cycles-backend # 591.666 M/sec
|
||||
4,900,403 instructions # 612.859 M/sec
|
||||
47,821 branch-misses # 6.085 M/sec
|
||||
25.274251(ms) task-clock # 0.002520 cpus used
|
||||
4 context-switches # 158.264 /sec
|
||||
466 page-faults # 18.438 K/sec
|
||||
|
||||
Total test time: 10.027923 seconds.
|
||||
```
|
||||
|
||||
### Select events to stat
|
||||
|
||||
We can select which events to use via -e.
|
||||
|
||||
```sh
|
||||
# Stat event cpu-cycles.
|
||||
$ simpleperf stat -e cpu-cycles -p 11904 --duration 10
|
||||
|
||||
# Stat event cache-references and cache-misses.
|
||||
$ simpleperf stat -e cache-references,cache-misses -p 11904 --duration 10
|
||||
```
|
||||
|
||||
When running the stat command, if the number of hardware events is larger than the number of
|
||||
hardware counters available in the PMU, the kernel shares hardware counters between events, so each
|
||||
event is only monitored for part of the total time. As a result, the number of events shown is
|
||||
smaller than the number of events that actually happened. The following is an example.
|
||||
|
||||
```sh
|
||||
# Stat using event cache-references, cache-references:u,....
|
||||
$ simpleperf stat -p 7394 -e cache-references,cache-references:u,cache-references:k \
|
||||
-e cache-misses,cache-misses:u,cache-misses:k,instructions --duration 1
|
||||
Performance counter statistics:
|
||||
|
||||
# count event_name # count / runtime
|
||||
490,713 cache-references # 151.682 M/sec
|
||||
899,652 cache-references:u # 130.152 M/sec
|
||||
855,218 cache-references:k # 111.356 M/sec
|
||||
61,602 cache-misses # 7.710 M/sec
|
||||
33,282 cache-misses:u # 5.050 M/sec
|
||||
11,662 cache-misses:k # 4.478 M/sec
|
||||
0 instructions #
|
||||
|
||||
Total test time: 1.000867 seconds.
|
||||
simpleperf W cmd_stat.cpp:946] It seems the number of hardware events are more than the number of
|
||||
available CPU PMU hardware counters. That will trigger hardware counter
|
||||
multiplexing. As a result, events are not counted all the time processes
|
||||
running, and event counts are smaller than what really happens.
|
||||
Use --print-hw-counter to show available hardware counters.
|
||||
```
|
||||
|
||||
In the example above, we monitor 7 events. Each event is only monitored part of the total time.
|
||||
Because the number of cache-references is smaller than the number of cache-references:u
|
||||
(cache-references only in userspace) and cache-references:k (cache-references only in kernel).
|
||||
The number of instructions is zero. After printing the result, simpleperf checks if CPUs have
|
||||
enough hardware counters to count hardware events at the same time. If not, it prints a warning.
|
||||
|
||||
To avoid hardware counter multiplexing, we can use `simpleperf stat --print-hw-counter` to show
|
||||
available counters on each CPU. Then don't monitor more hardware events than counters available.
|
||||
|
||||
```sh
|
||||
$ simpleperf stat --print-hw-counter
|
||||
There are 2 CPU PMU hardware counters available on cpu 0.
|
||||
There are 2 CPU PMU hardware counters available on cpu 1.
|
||||
There are 2 CPU PMU hardware counters available on cpu 2.
|
||||
There are 2 CPU PMU hardware counters available on cpu 3.
|
||||
There are 2 CPU PMU hardware counters available on cpu 4.
|
||||
There are 2 CPU PMU hardware counters available on cpu 5.
|
||||
There are 2 CPU PMU hardware counters available on cpu 6.
|
||||
There are 2 CPU PMU hardware counters available on cpu 7.
|
||||
```
|
||||
|
||||
When counter multiplexing happens, there is no guarantee of which events will be monitored at
|
||||
which time. If we want to ensure some events are always monitored at the same time, we can use
|
||||
`--group`.
|
||||
|
||||
```sh
|
||||
# Stat using event cache-references, cache-references:u,....
|
||||
$ simpleperf stat -p 7964 --group cache-references,cache-misses \
|
||||
--group cache-references:u,cache-misses:u --group cache-references:k,cache-misses:k \
|
||||
--duration 1
|
||||
Performance counter statistics:
|
||||
|
||||
# count event_name # count / runtime
|
||||
2,088,463 cache-references # 181.360 M/sec
|
||||
47,871 cache-misses # 2.292164% miss rate
|
||||
1,277,600 cache-references:u # 136.419 M/sec
|
||||
25,977 cache-misses:u # 2.033265% miss rate
|
||||
326,305 cache-references:k # 74.724 M/sec
|
||||
13,596 cache-misses:k # 4.166654% miss rate
|
||||
|
||||
Total test time: 1.029729 seconds.
|
||||
simpleperf W cmd_stat.cpp:946] It seems the number of hardware events are more than the number of
|
||||
...
|
||||
```
|
||||
|
||||
### Select target to stat
|
||||
|
||||
We can select which processes or threads to monitor via -p or -t. Monitoring a
|
||||
process is the same as monitoring all threads in the process. Simpleperf can also fork a child
|
||||
process to run the new command and then monitor the child process.
|
||||
|
||||
```sh
|
||||
# Stat process 11904 and 11905.
|
||||
$ simpleperf stat -p 11904,11905 --duration 10
|
||||
|
||||
# Stat processes with name containing "chrome".
|
||||
$ simpleperf stat -p chrome --duration 10
|
||||
# Stat processes with name containing part matching regex "chrome:(privileged|sandboxed)".
|
||||
$ simpleperf stat -p "chrome:(privileged|sandboxed)" --duration 10
|
||||
|
||||
# Stat thread 11904 and 11905.
|
||||
$ simpleperf stat -t 11904,11905 --duration 10
|
||||
|
||||
# Start a child process running `ls`, and stat it.
|
||||
$ simpleperf stat ls
|
||||
|
||||
# Stat the process of an Android application. On non-root devices, this only works for debuggable
|
||||
# or profileable from shell apps.
|
||||
$ simpleperf stat --app simpleperf.example.cpp --duration 10
|
||||
|
||||
# Stat only selected thread 11904 in an app.
|
||||
$ simpleperf stat --app simpleperf.example.cpp -t 11904 --duration 10
|
||||
|
||||
# Stat system wide using -a.
|
||||
$ simpleperf stat -a --duration 10
|
||||
```
|
||||
|
||||
### Decide how long to stat
|
||||
|
||||
When monitoring existing threads, we can use --duration to decide how long to monitor. When
|
||||
monitoring a child process running a new command, simpleperf monitors until the child process ends.
|
||||
In this case, we can use Ctrl-C to stop monitoring at any time.
|
||||
|
||||
```sh
|
||||
# Stat process 11904 for 10 seconds.
|
||||
$ simpleperf stat -p 11904 --duration 10
|
||||
|
||||
# Stat until the child process running `ls` finishes.
|
||||
$ simpleperf stat ls
|
||||
|
||||
# Stop monitoring using Ctrl-C.
|
||||
$ simpleperf stat -p 11904 --duration 10
|
||||
^C
|
||||
```
|
||||
|
||||
If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
|
||||
SIGHUP signals to simpleperf to stop monitoring.
|
||||
|
||||
### Decide the print interval
|
||||
|
||||
When monitoring perf counters, we can also use --interval to decide the print interval.
|
||||
|
||||
```sh
|
||||
# Print stat for process 11904 every 300ms.
|
||||
$ simpleperf stat -p 11904 --duration 10 --interval 300
|
||||
|
||||
# Print system wide stat at interval of 300ms for 10 seconds. Note that system wide profiling needs
|
||||
# root privilege.
|
||||
$ su 0 simpleperf stat -a --duration 10 --interval 300
|
||||
```
|
||||
|
||||
### Display counters in systrace
|
||||
|
||||
Simpleperf can also work with systrace to dump counters in the collected trace. Below is an example
|
||||
to do a system wide stat.
|
||||
|
||||
```sh
|
||||
# Capture instructions (kernel only) and cache misses with interval of 300 milliseconds for 15
|
||||
# seconds.
|
||||
$ su 0 simpleperf stat -e instructions:k,cache-misses -a --interval 300 --duration 15
|
||||
# On host launch systrace to collect trace for 10 seconds.
|
||||
(HOST)$ external/chromium-trace/systrace.py --time=10 -o new.html sched gfx view
|
||||
# Open the collected new.html in browser and perf counters will be shown up.
|
||||
```
|
||||
|
||||
### Show event count per thread
|
||||
|
||||
By default, stat cmd outputs an event count sum for all monitored targets. But when `--per-thread`
|
||||
option is used, stat cmd outputs an event count for each thread in monitored targets. It can be
|
||||
used to find busy threads in a process or system wide. With `--per-thread` option, stat cmd opens
|
||||
a perf_event_file for each exisiting thread. If a monitored thread creates new threads, event
|
||||
count for new threads will be added to the monitored thread by default, otherwise omitted if
|
||||
`--no-inherit` option is also used.
|
||||
|
||||
```sh
|
||||
# Print event counts for each thread in process 11904. Event counts for threads created after
|
||||
# stat cmd will be added to threads creating them.
|
||||
$ simpleperf stat --per-thread -p 11904 --duration 1
|
||||
|
||||
# Print event counts for all threads running in the system every 1s. Threads not running will not
|
||||
# be reported.
|
||||
$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values
|
||||
|
||||
# Print event counts for all threads running in the system every 1s. Event counts for threads
|
||||
# created after stat cmd will be omitted.
|
||||
$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values --no-inherit
|
||||
```
|
||||
|
||||
### Show event count per core
|
||||
|
||||
By default, stat cmd outputs an event count sum for all monitored cpu cores. But when `--per-core`
|
||||
option is used, stat cmd outputs an event count for each core. It can be used to see how events
|
||||
are distributed on different cores.
|
||||
When stating non-system wide with `--per-core` option, simpleperf creates a perf event for each
|
||||
monitored thread on each core. When a thread is in running state, perf events on all cores are
|
||||
enabled, but only the perf event on the core running the thread is in running state. So the
|
||||
percentage comment shows runtime_on_a_core / runtime_on_all_cores. Note that, percentage is still
|
||||
affected by hardware counter multiplexing. Check simpleperf log output for ways to distinguish it.
|
||||
|
||||
```sh
|
||||
# Print event counts for each cpu running threads in process 11904.
|
||||
# A percentage shows runtime_on_a_cpu / runtime_on_all_cpus.
|
||||
$ simpleperf stat -e cpu-cycles --per-core -p 1057 --duration 3
|
||||
Performance counter statistics:
|
||||
|
||||
# cpu count event_name # count / runtime
|
||||
0 1,667,660 cpu-cycles # 1.571565 GHz
|
||||
1 3,850,440 cpu-cycles # 1.736958 GHz
|
||||
2 2,463,792 cpu-cycles # 1.701367 GHz
|
||||
3 2,350,528 cpu-cycles # 1.700841 GHz
|
||||
5 7,919,520 cpu-cycles # 2.377081 GHz
|
||||
6 105,622,673 cpu-cycles # 2.381331 GHz
|
||||
|
||||
Total test time: 3.002703 seconds.
|
||||
|
||||
# Print event counts for each cpu system wide.
|
||||
$ su 0 simpleperf stat --per-core -a --duration 1
|
||||
|
||||
# Print cpu-cycle event counts for each cpu for each thread running in the system.
|
||||
$ su 0 simpleperf stat -e cpu-cycles -a --per-thread --per-core --duration 1
|
||||
```
|
||||
|
||||
### Monitor different events on different cores
|
||||
|
||||
Android devices usually have big and little cores. Different cores may support different events.
|
||||
Therefore, we may want to monitor different events on different cores. We can do this using
|
||||
the `--cpu` option. The `--cpu` option selects the cores on which to monitor events. A `--cpu`
|
||||
option affects all the following events until meeting another `--cpu` option. The first `--cpu`
|
||||
option also affects all events before it. Following are some examples:
|
||||
|
||||
```sh
|
||||
# By default, cpu-cycles and instructions are monitored on all cpus.
|
||||
$ su 0 simpleperf stat -e cpu-cycles,instructions -a --duration 1 --per-core
|
||||
|
||||
# Use one `--cpu` option to monitor cpu-cycles and instructions only on cpu 0-3,8.
|
||||
$ su 0 simpleperf stat -e cpu-cycles --cpu 0-3,8 -e instructions -a --duration 1 --per-core
|
||||
|
||||
# Use two `--cpu` options to monitor raw-l3d-cache-refill-rd on cpu 0-3, and raw-l3d-cache-refill on
|
||||
# cpu 4-8.
|
||||
$ su 0 simpleperf stat --cpu 0-3 -e raw-l3d-cache-refill-rd --cpu 4-8 -e raw-l3d-cache-refill \
|
||||
-a --duration 1 --per-core
|
||||
```
|
||||
|
||||
## The record command
|
||||
|
||||
The record command is used to dump samples of the profiled processes. Each sample can contain
|
||||
information like the time at which the sample was generated, the number of events since last
|
||||
sample, the program counter of a thread, the call chain of a thread.
|
||||
|
||||
By passing options, we can select which events to use, which processes/threads to monitor,
|
||||
what frequency to dump samples, how long to monitor, and where to store samples.
|
||||
|
||||
```sh
|
||||
# Record on process 7394 for 10 seconds, using default event (cpu-cycles), using default sample
|
||||
# frequency (4000 samples per second), writing records to perf.data.
|
||||
$ simpleperf record -p 7394 --duration 10
|
||||
simpleperf I cmd_record.cpp:316] Samples recorded: 21430. Samples lost: 0.
|
||||
```
|
||||
|
||||
### Select events to record
|
||||
|
||||
By default, the cpu-cycles event is used to evaluate consumed cpu cycles. But we can also use other
|
||||
events via -e.
|
||||
|
||||
```sh
|
||||
# Record using event instructions.
|
||||
$ simpleperf record -e instructions -p 11904 --duration 10
|
||||
|
||||
# Record using task-clock, which shows the passed CPU time in nanoseconds.
|
||||
$ simpleperf record -e task-clock -p 11904 --duration 10
|
||||
```
|
||||
|
||||
### Select target to record
|
||||
|
||||
The way to select target in record command is similar to that in the stat command.
|
||||
|
||||
```sh
|
||||
# Record process 11904 and 11905.
|
||||
$ simpleperf record -p 11904,11905 --duration 10
|
||||
|
||||
# Record processes with name containing "chrome".
|
||||
$ simpleperf record -p chrome --duration 10
|
||||
# Record processes with name containing part matching regex "chrome:(privileged|sandboxed)".
|
||||
$ simpleperf record -p "chrome:(privileged|sandboxed)" --duration 10
|
||||
|
||||
# Record thread 11904 and 11905.
|
||||
$ simpleperf record -t 11904,11905 --duration 10
|
||||
|
||||
# Record a child process running `ls`.
|
||||
$ simpleperf record ls
|
||||
|
||||
# Record the process of an Android application. On non-root devices, this only works for debuggable
|
||||
# or profileable from shell apps.
|
||||
$ simpleperf record --app simpleperf.example.cpp --duration 10
|
||||
|
||||
# Record only selected thread 11904 in an app.
|
||||
$ simpleperf record --app simpleperf.example.cpp -t 11904 --duration 10
|
||||
|
||||
# Record system wide.
|
||||
$ simpleperf record -a --duration 10
|
||||
```
|
||||
|
||||
### Set the frequency to record
|
||||
|
||||
We can set the frequency to dump records via -f or -c. For example, -f 4000 means
|
||||
dumping approximately 4000 records every second when the monitored thread runs. If a monitored
|
||||
thread runs 0.2s in one second (it can be preempted or blocked in other times), simpleperf dumps
|
||||
about 4000 * 0.2 / 1.0 = 800 records every second. Another way is using -c. For example, -c 10000
|
||||
means dumping one record whenever 10000 events happen.
|
||||
|
||||
```sh
|
||||
# Record with sample frequency 1000: sample 1000 times every second running.
|
||||
$ simpleperf record -f 1000 -p 11904,11905 --duration 10
|
||||
|
||||
# Record with sample period 100000: sample 1 time every 100000 events.
|
||||
$ simpleperf record -c 100000 -t 11904,11905 --duration 10
|
||||
```
|
||||
|
||||
To avoid taking too much time generating samples, kernel >= 3.10 sets the max percent of cpu time
|
||||
used for generating samples (default is 25%), and decreases the max allowed sample frequency when
|
||||
hitting that limit. Simpleperf uses --cpu-percent option to adjust it, but it needs either root
|
||||
privilege or to be on Android >= Q.
|
||||
|
||||
```sh
|
||||
# Record with sample frequency 10000, with max allowed cpu percent to be 50%.
|
||||
$ simpleperf record -f 1000 -p 11904,11905 --duration 10 --cpu-percent 50
|
||||
```
|
||||
|
||||
### Decide how long to record
|
||||
|
||||
The way to decide how long to monitor in record command is similar to that in the stat command.
|
||||
|
||||
```sh
|
||||
# Record process 11904 for 10 seconds.
|
||||
$ simpleperf record -p 11904 --duration 10
|
||||
|
||||
# Record until the child process running `ls` finishes.
|
||||
$ simpleperf record ls
|
||||
|
||||
# Stop monitoring using Ctrl-C.
|
||||
$ simpleperf record -p 11904 --duration 10
|
||||
^C
|
||||
```
|
||||
|
||||
If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
|
||||
SIGHUP signals to simpleperf to stop monitoring.
|
||||
|
||||
### Set the path to store profiling data
|
||||
|
||||
By default, simpleperf stores profiling data in perf.data in the current directory. But the path
|
||||
can be changed using -o.
|
||||
|
||||
```sh
|
||||
# Write records to data/perf2.data.
|
||||
$ simpleperf record -p 11904 -o data/perf2.data --duration 10
|
||||
```
|
||||
|
||||
#### Record call graphs
|
||||
|
||||
A call graph is a tree showing function call relations. Below is an example.
|
||||
|
||||
```
|
||||
main() {
|
||||
FunctionOne();
|
||||
FunctionTwo();
|
||||
}
|
||||
FunctionOne() {
|
||||
FunctionTwo();
|
||||
FunctionThree();
|
||||
}
|
||||
a call graph:
|
||||
main-> FunctionOne
|
||||
| |
|
||||
| |-> FunctionTwo
|
||||
| |-> FunctionThree
|
||||
|
|
||||
|-> FunctionTwo
|
||||
```
|
||||
|
||||
A call graph shows how a function calls other functions, and a reversed call graph shows how
|
||||
a function is called by other functions. To show a call graph, we need to first record it, then
|
||||
report it.
|
||||
|
||||
There are two ways to record a call graph, one is recording a dwarf based call graph, the other is
|
||||
recording a stack frame based call graph. Recording dwarf based call graphs needs support of debug
|
||||
information in native binaries. While recording stack frame based call graphs needs support of
|
||||
stack frame registers.
|
||||
|
||||
```sh
|
||||
# Record a dwarf based call graph
|
||||
$ simpleperf record -p 11904 -g --duration 10
|
||||
|
||||
# Record a stack frame based call graph
|
||||
$ simpleperf record -p 11904 --call-graph fp --duration 10
|
||||
```
|
||||
|
||||
[Here](README.md#suggestions-about-recording-call-graphs) are some suggestions about recording call graphs.
|
||||
|
||||
### Record both on CPU time and off CPU time
|
||||
|
||||
Simpleperf is a CPU profiler, which generates samples for a thread only when it is running on a
|
||||
CPU. But sometimes we want to know where the thread time is spent off-cpu (like preempted by other
|
||||
threads, blocked in IO or waiting for some events). To support this, simpleperf added a
|
||||
--trace-offcpu option to the record command. When --trace-offcpu is used, simpleperf does the
|
||||
following things:
|
||||
|
||||
1) Only cpu-clock/task-clock event is allowed to be used with --trace-offcpu. This let simpleperf
|
||||
generate on-cpu samples for cpu-clock event.
|
||||
2) Simpleperf also monitors sched:sched_switch event, which will generate a sched_switch sample
|
||||
each time the monitored thread is scheduled off cpu.
|
||||
3) Simpleperf also records context switch records. So it knows when the thread is scheduled back on
|
||||
a cpu.
|
||||
|
||||
The samples and context switch records collected by simpleperf for a thread are shown below:
|
||||
|
||||

|
||||
|
||||
Here we have two types of samples:
|
||||
1) on-cpu samples generated for cpu-clock event. The period value in each sample means how many
|
||||
nanoseconds are spent on cpu (for the callchain of this sample).
|
||||
2) off-cpu (sched_switch) samples generated for sched:sched_switch event. The period value is
|
||||
calculated as **Timestamp of the next switch on record** minus **Timestamp of the current sample**
|
||||
by simpleperf. So the period value in each sample means how many nanoseconds are spent off cpu
|
||||
(for the callchain of this sample).
|
||||
|
||||
**note**: In reality, switch on records and samples may lost. To mitigate the loss of accuracy, we
|
||||
calculate the period of an off-cpu sample as **Timestamp of the next switch on record or sample**
|
||||
minus **Timestamp of the current sample**.
|
||||
|
||||
When reporting via python scripts, simpleperf_report_lib.py provides SetTraceOffCpuMode() method
|
||||
to control how to report the samples:
|
||||
1) on-cpu mode: only report on-cpu samples.
|
||||
2) off-cpu mode: only report off-cpu samples.
|
||||
3) on-off-cpu mode: report both on-cpu and off-cpu samples, which can be split by event name.
|
||||
4) mixed-on-off-cpu mode: report on-cpu and off-cpu samples under the same event name.
|
||||
|
||||
If not set, mixed-on-off-cpu mode will be used to report.
|
||||
|
||||
When using report_html.py, inferno and report_sample.py, the report mode can be set by
|
||||
--trace-offcpu option.
|
||||
|
||||
Below are some examples recording and reporting trace offcpu profiles.
|
||||
|
||||
```sh
|
||||
# Check if --trace-offcpu is supported by the kernel (should be available on kernel >= 4.2).
|
||||
$ simpleperf list --show-features
|
||||
trace-offcpu
|
||||
...
|
||||
|
||||
# Record with --trace-offcpu.
|
||||
$ simpleperf record -g -p 11904 --duration 10 --trace-offcpu -e cpu-clock
|
||||
|
||||
# Record system wide with --trace-offcpu.
|
||||
$ simpleperf record -a -g --duration 3 --trace-offcpu -e cpu-clock
|
||||
|
||||
# Record with --trace-offcpu using app_profiler.py.
|
||||
$ ./app_profiler.py -p com.google.samples.apps.sunflower \
|
||||
-r "-g -e cpu-clock:u --duration 10 --trace-offcpu"
|
||||
|
||||
# Report on-cpu samples.
|
||||
$ ./report_html.py --trace-offcpu on-cpu
|
||||
# Report off-cpu samples.
|
||||
$ ./report_html.py --trace-offcpu off-cpu
|
||||
# Report on-cpu and off-cpu samples under different event names.
|
||||
$ ./report_html.py --trace-offcpu on-off-cpu
|
||||
# Report on-cpu and off-cpu samples under the same event name.
|
||||
$ ./report_html.py --trace-offcpu mixed-on-off-cpu
|
||||
```
|
||||
|
||||
## The report command
|
||||
|
||||
The report command is used to report profiling data generated by the record command. The report
|
||||
contains a table of sample entries. Each sample entry is a row in the report. The report command
|
||||
groups samples belong to the same process, thread, library, function in the same sample entry. Then
|
||||
sort the sample entries based on the event count a sample entry has.
|
||||
|
||||
By passing options, we can decide how to filter out uninteresting samples, how to group samples
|
||||
into sample entries, and where to find profiling data and binaries.
|
||||
|
||||
Below is an example. Records are grouped into 4 sample entries, each entry is a row. There are
|
||||
several columns, each column shows piece of information belonging to a sample entry. The first
|
||||
column is Overhead, which shows the percentage of events inside the current sample entry in total
|
||||
events. As the perf event is cpu-cycles, the overhead is the percentage of CPU cycles used in each
|
||||
function.
|
||||
|
||||
```sh
|
||||
# Reports perf.data, using only records sampled in libsudo-game-jni.so, grouping records using
|
||||
# thread name(comm), process id(pid), thread id(tid), function name(symbol), and showing sample
|
||||
# count for each row.
|
||||
$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so \
|
||||
--sort comm,pid,tid,symbol -n
|
||||
Cmdline: /data/data/com.example.sudogame/simpleperf record -p 7394 --duration 10
|
||||
Arch: arm64
|
||||
Event: cpu-cycles (type 0, config 0)
|
||||
Samples: 28235
|
||||
Event count: 546356211
|
||||
|
||||
Overhead Sample Command Pid Tid Symbol
|
||||
59.25% 16680 sudogame 7394 7394 checkValid(Board const&, int, int)
|
||||
20.42% 5620 sudogame 7394 7394 canFindSolution_r(Board&, int, int)
|
||||
13.82% 4088 sudogame 7394 7394 randomBlock_r(Board&, int, int, int, int, int)
|
||||
6.24% 1756 sudogame 7394 7394 @plt
|
||||
```
|
||||
|
||||
### Set the path to read profiling data
|
||||
|
||||
By default, the report command reads profiling data from perf.data in the current directory.
|
||||
But the path can be changed using -i.
|
||||
|
||||
```sh
|
||||
$ simpleperf report -i data/perf2.data
|
||||
```
|
||||
|
||||
### Set the path to find binaries
|
||||
|
||||
To report function symbols, simpleperf needs to read executable binaries used by the monitored
|
||||
processes to get symbol table and debug information. By default, the paths are the executable
|
||||
binaries used by monitored processes while recording. However, these binaries may not exist when
|
||||
reporting or not contain symbol table and debug information. So we can use --symfs to redirect
|
||||
the paths.
|
||||
|
||||
```sh
|
||||
# In this case, when simpleperf wants to read executable binary /A/b, it reads file in /A/b.
|
||||
$ simpleperf report
|
||||
|
||||
# In this case, when simpleperf wants to read executable binary /A/b, it prefers file in
|
||||
# /debug_dir/A/b to file in /A/b.
|
||||
$ simpleperf report --symfs /debug_dir
|
||||
|
||||
# Read symbols for system libraries built locally. Note that this is not needed since Android O,
|
||||
# which ships symbols for system libraries on device.
|
||||
$ simpleperf report --symfs $ANDROID_PRODUCT_OUT/symbols
|
||||
```
|
||||
|
||||
### Filter samples
|
||||
|
||||
When reporting, it happens that not all records are of interest. The report command supports four
|
||||
filters to select samples of interest.
|
||||
|
||||
```sh
|
||||
# Report records in threads having name sudogame.
|
||||
$ simpleperf report --comms sudogame
|
||||
|
||||
# Report records in process 7394 or 7395
|
||||
$ simpleperf report --pids 7394,7395
|
||||
|
||||
# Report records in thread 7394 or 7395.
|
||||
$ simpleperf report --tids 7394,7395
|
||||
|
||||
# Report records in libsudo-game-jni.so.
|
||||
$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so
|
||||
```
|
||||
|
||||
### Group samples into sample entries
|
||||
|
||||
The report command uses --sort to decide how to group sample entries.
|
||||
|
||||
```sh
|
||||
# Group records based on their process id: records having the same process id are in the same
|
||||
# sample entry.
|
||||
$ simpleperf report --sort pid
|
||||
|
||||
# Group records based on their thread id and thread comm: records having the same thread id and
|
||||
# thread name are in the same sample entry.
|
||||
$ simpleperf report --sort tid,comm
|
||||
|
||||
# Group records based on their binary and function: records in the same binary and function are in
|
||||
# the same sample entry.
|
||||
$ simpleperf report --sort dso,symbol
|
||||
|
||||
# Default option: --sort comm,pid,tid,dso,symbol. Group records in the same thread, and belong to
|
||||
# the same function in the same binary.
|
||||
$ simpleperf report
|
||||
```
|
||||
|
||||
#### Report call graphs
|
||||
|
||||
To report a call graph, please make sure the profiling data is recorded with call graphs,
|
||||
as [here](#record-call-graphs).
|
||||
|
||||
```
|
||||
$ simpleperf report -g
|
||||
```
|
||||
@ -0,0 +1,109 @@
|
||||
# Inferno
|
||||
|
||||

|
||||
|
||||
[TOC]
|
||||
|
||||
## Description
|
||||
|
||||
Inferno is a flamegraph generator for native (C/C++) Android apps. It was
|
||||
originally written to profile and improve surfaceflinger performance
|
||||
(Android compositor) but it can be used for any native Android application
|
||||
. You can see a sample report generated with Inferno
|
||||
[here](./report.html). Report are self-contained in HTML so they can be
|
||||
exchanged easily.
|
||||
|
||||
Notice there is no concept of time in a flame graph since all callstack are
|
||||
merged together. As a result, the width of a flamegraph represents 100% of
|
||||
the number of samples and the height is related to the number of functions on
|
||||
the stack when sampling occurred.
|
||||
|
||||
|
||||

|
||||
|
||||
In the flamegraph featured above you can see the main thread of SurfaceFlinger.
|
||||
It is immediatly apparent that most of the CPU time is spent processing messages
|
||||
`android::SurfaceFlinger::onMessageReceived`. The most expensive task is to ask
|
||||
the screen to be refreshed as `android::DisplayDevice::prepare` shows in orange
|
||||
. This graphic division helps to see what part of the program is costly and
|
||||
where a developer's effort to improve performances should go.
|
||||
|
||||
## Example of bottleneck
|
||||
|
||||
A flamegraph give you instant vision on the CPU cycles cost centers but
|
||||
it can also be used to find specific offenders. To find them, look for
|
||||
plateaus. It is easier to see an example:
|
||||
|
||||

|
||||
|
||||
In the previous flamegraph, two
|
||||
plateaus (due to `android::BufferQueueCore::validateConsistencyLocked`)
|
||||
are immediately apparent.
|
||||
|
||||
## How it works
|
||||
|
||||
Inferno relies on simpleperf to record the callstack of a native application
|
||||
thousands of times per second. Simpleperf takes care of unwinding the stack
|
||||
either using frame pointer (recommended) or dwarf. At the end of the recording
|
||||
`simpleperf` also symbolize all IPs automatically. The record are aggregated and
|
||||
dumps dumped to a file `perf.data`. This file is pulled from the Android device
|
||||
and processed on the host by Inferno. The callstacks are merged together to
|
||||
visualize in which part of an app the CPU cycles are spent.
|
||||
|
||||
## How to use it
|
||||
|
||||
Open a terminal and from `simpleperf/scripts` directory type:
|
||||
```
|
||||
./inferno.sh (on Linux/Mac)
|
||||
inferno.bat (on Windows)
|
||||
```
|
||||
|
||||
Inferno will collect data, process them and automatically open your web browser
|
||||
to display the HTML report.
|
||||
|
||||
## Parameters
|
||||
|
||||
You can select how long to sample for, the color of the node and many other
|
||||
things. Use `-h` to get a list of all supported parameters.
|
||||
|
||||
```
|
||||
./inferno.sh -h
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Messy flame graph
|
||||
|
||||
A healthy flame graph features a single call site at its base (see [here](./report.html)).
|
||||
If you don't see a unique call site like `_start` or `_start_thread` at the base
|
||||
from which all flames originate, something went wrong. : Stack unwinding may
|
||||
fail to reach the root callsite. These incomplete
|
||||
callstack are impossible to merge properly. By default Inferno asks
|
||||
`simpleperf` to unwind the stack via the kernel and frame pointers. Try to
|
||||
perform unwinding with dwarf `-du`, you can further tune this setting.
|
||||
|
||||
|
||||
### No flames
|
||||
|
||||
If you see no flames at all or a mess of 1 level flame without a common base,
|
||||
this may be because you compiled without frame pointers. Make sure there is no
|
||||
` -fomit-frame-pointer` in your build config. Alternatively, ask simpleperf to
|
||||
collect data with dward unwinding `-du`.
|
||||
|
||||
|
||||
|
||||
### High percentage of lost samples
|
||||
|
||||
If simpleperf reports a lot of lost sample it is probably because you are
|
||||
unwinding with `dwarf`. Dwarf unwinding involves copying the stack before it is
|
||||
processed. Try to use frame pointer unwinding which can be done by the kernel
|
||||
and it much faster.
|
||||
|
||||
The cost of frame pointer is negligible on arm64 parameter but considerable
|
||||
on arm 32-bit arch (due to register pressure). Use a 64-bit build for better
|
||||
profiling.
|
||||
|
||||
### run-as: package not debuggable
|
||||
|
||||
If you cannot run as root, make sure the app is debuggable otherwise simpleperf
|
||||
will not be able to profile it.
|
||||
|
After Width: | Height: | Size: 309 KiB |
|
After Width: | Height: | Size: 20 KiB |
@ -0,0 +1,56 @@
|
||||
# JIT symbols
|
||||
|
||||
[TOC]
|
||||
|
||||
## Java JIT symbols
|
||||
|
||||
On Android >= P, simpleperf supports profiling Java code, no matter whether it is executed by
|
||||
the interpreter, or JITed, or compiled into native instructions. So you don't need to do anything.
|
||||
|
||||
For details on Android O and N, see
|
||||
[android_application_profiling.md](./android_application_profiling.md#prepare-an-android-application).
|
||||
|
||||
## Generic JIT symbols
|
||||
|
||||
Simpleperf supports picking up symbols from per-pid symbol map files, somewhat similar to what
|
||||
Linux kernel perftool does. Application should create those files at specific locations.
|
||||
|
||||
### Symbol map file location for application
|
||||
|
||||
Application should create symbol map files in its data directory.
|
||||
|
||||
For example, process `123` of application `foo.bar.baz` should create
|
||||
`/data/data/foo.bar.baz/perf-123.map`.
|
||||
|
||||
### Symbol map file location for standalone program
|
||||
|
||||
Standalone programs should create symbol map files in `/data/local/tmp`.
|
||||
|
||||
For example, standalone program process `123` should create `/data/local/tmp/perf-123.map`.
|
||||
|
||||
### Symbol map file format
|
||||
|
||||
Symbol map file is a text file.
|
||||
|
||||
Every line describes a new symbol. Line format is:
|
||||
```
|
||||
<symbol-absolute-address> <symbol-size> <symbol-name>
|
||||
```
|
||||
|
||||
For example:
|
||||
```
|
||||
0x10000000 0x16 jit_symbol_one
|
||||
0x20000000 0x332 jit_symbol_two
|
||||
0x20002004 0x8 jit_symbol_three
|
||||
```
|
||||
|
||||
All characters after the symbol size and until the end of the line are parsed as the symbol name,
|
||||
with leading and trailing spaces removed. This means spaces are allowed in symbol names themselves.
|
||||
|
||||
### Known issues
|
||||
|
||||
Current implementation gets confused if memory pages where JIT symbols reside are reused by mapping
|
||||
a file either before or after.
|
||||
|
||||
For example, if memory pages were first used by `dlopen("libfoo.so")`, then freed by `dlclose`,
|
||||
then allocated for JIT symbols - simpleperf will report symbols from `libfoo.so` instead.
|
||||
|
After Width: | Height: | Size: 178 KiB |
|
After Width: | Height: | Size: 84 KiB |
|
After Width: | Height: | Size: 141 KiB |
|
After Width: | Height: | Size: 14 KiB |
|
After Width: | Height: | Size: 10 KiB |
|
After Width: | Height: | Size: 88 KiB |
|
After Width: | Height: | Size: 449 KiB |
|
After Width: | Height: | Size: 328 KiB |
|
After Width: | Height: | Size: 69 KiB |
|
After Width: | Height: | Size: 28 KiB |
|
After Width: | Height: | Size: 275 KiB |
|
After Width: | Height: | Size: 148 KiB |
|
After Width: | Height: | Size: 285 KiB |
14859
Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report.html
Normal file
@ -0,0 +1,89 @@
|
||||
# Sample Filter
|
||||
|
||||
Sometimes we want to report samples only for selected processes, threads, libraries, or time
|
||||
ranges. To filter samples, we can pass filter options to the report commands or scripts.
|
||||
|
||||
|
||||
## filter file format
|
||||
|
||||
To filter samples based on time ranges, simpleperf accepts a filter file when reporting. The filter
|
||||
file is in text format, containing a list of lines. Each line is a filter command. The filter file
|
||||
can be generated by `sample_filter.py`, and passed to report scripts via `--filter-file`.
|
||||
|
||||
```
|
||||
filter_command1 command_args
|
||||
filter_command2 command_args
|
||||
...
|
||||
```
|
||||
|
||||
### clock command
|
||||
|
||||
```
|
||||
CLOCK <clock_name>
|
||||
```
|
||||
|
||||
Set the clock used to generate timestamps in the filter file. Supported clocks are: `monotonic`,
|
||||
`realtime`. By default it is monotonic. The clock here should be the same as the clock used in
|
||||
profile data, which is set by `--clockid` in simpleperf record command.
|
||||
|
||||
### global time filter commands
|
||||
|
||||
```
|
||||
GLOBAL_BEGIN <begin_timestamp>
|
||||
GLOBAL_END <end_timestamp>
|
||||
```
|
||||
|
||||
The nearest pair of GLOBAL_BEGIN and GLOBAL_END commands makes a time range. When these commands
|
||||
are used, only samples in the time ranges are reported. Timestamps are 64-bit integers in
|
||||
nanoseconds.
|
||||
|
||||
```
|
||||
GLOBAL_BEGIN 1000
|
||||
GLOBAL_END 2000
|
||||
GLOBAL_BEGIN 3000
|
||||
GLOBAL_BEGIN 4000
|
||||
```
|
||||
|
||||
For the example above, samples in time ranges [1000, 2000) and [3000, 4000) are reported.
|
||||
|
||||
### process time filter commands
|
||||
|
||||
```
|
||||
PROCESS_BEGIN <pid> <begin_timestamp>
|
||||
PROCESS_END <pid> <end_timestamp>
|
||||
```
|
||||
|
||||
The nearest pair of PROCESS_BEGIN and PROCESS_END commands for the same process makes a time
|
||||
range. When these commands are used, each process has a list of time ranges, and only samples
|
||||
in the time ranges are reported.
|
||||
|
||||
```
|
||||
PROCESS_BEGIN 1 1000
|
||||
PROCESS_BEGIN 2 2000
|
||||
PROCESS_END 1 3000
|
||||
PROCESS_END 2 4000
|
||||
```
|
||||
|
||||
For the example above, process 1 samples in time range [1000, 3000) and process 2 samples in time
|
||||
range [2000, 4000) are reported.
|
||||
|
||||
### thread time filter commands
|
||||
|
||||
```
|
||||
THREAD_BEGIN <tid> <begin_timestamp>
|
||||
THREAD_END <tid> <end_timestamp>
|
||||
```
|
||||
|
||||
The nearest pair of THREAD_BEGIN and THREAD_END commands for the same thread makes a time
|
||||
range. When these commands are used, each thread has a list of time ranges, and only samples in the
|
||||
time ranges are reported.
|
||||
|
||||
```
|
||||
THREAD_BEGIN 1 1000
|
||||
THREAD_BEGIN 2 2000
|
||||
THREAD_END 1 3000
|
||||
THREAD_END 2 4000
|
||||
```
|
||||
|
||||
For the example above, thread 1 samples in time range [1000, 3000) and thread 2 samples in time
|
||||
range [2000, 4000) are reported.
|
||||
@ -0,0 +1,357 @@
|
||||
# Scripts reference
|
||||
|
||||
[TOC]
|
||||
|
||||
## Record a profile
|
||||
|
||||
### app_profiler.py
|
||||
|
||||
`app_profiler.py` is used to record profiling data for Android applications and native executables.
|
||||
|
||||
```sh
|
||||
# Record an Android application.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp
|
||||
|
||||
# Record an Android application with Java code compiled into native instructions.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp --compile_java_code
|
||||
|
||||
# Record the launch of an Activity of an Android application.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -a .SleepActivity
|
||||
|
||||
# Record a native process.
|
||||
$ ./app_profiler.py -np surfaceflinger
|
||||
|
||||
# Record a native process given its pid.
|
||||
$ ./app_profiler.py --pid 11324
|
||||
|
||||
# Record a command.
|
||||
$ ./app_profiler.py -cmd \
|
||||
"dex2oat --dex-file=/data/local/tmp/app-debug.apk --oat-file=/data/local/tmp/a.oat"
|
||||
|
||||
# Record an Android application, and use -r to send custom options to the record command.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp \
|
||||
-r "-e cpu-clock -g --duration 30"
|
||||
|
||||
# Record both on CPU time and off CPU time.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp \
|
||||
-r "-e task-clock -g -f 1000 --duration 10 --trace-offcpu"
|
||||
|
||||
# Save profiling data in a custom file (like perf_custom.data) instead of perf.data.
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -o perf_custom.data
|
||||
```
|
||||
|
||||
### Profile from launch of an application
|
||||
|
||||
Sometimes we want to profile the launch-time of an application. To support this, we added `--app` in
|
||||
the record command. The `--app` option sets the package name of the Android application to profile.
|
||||
If the app is not already running, the record command will poll for the app process in a loop with
|
||||
an interval of 1ms. So to profile from launch of an application, we can first start the record
|
||||
command with `--app`, then start the app. Below is an example.
|
||||
|
||||
```sh
|
||||
$ ./run_simpleperf_on_device.py record --app simpleperf.example.cpp \
|
||||
-g --duration 1 -o /data/local/tmp/perf.data
|
||||
# Start the app manually or using the `am` command.
|
||||
```
|
||||
|
||||
To make it convenient to use, `app_profiler.py` supports using the `-a` option to start an Activity
|
||||
after recording has started.
|
||||
|
||||
```sh
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp -a .MainActivity
|
||||
```
|
||||
|
||||
### api_profiler.py
|
||||
|
||||
`api_profiler.py` is used to control recording in application code. It does preparation work
|
||||
before recording, and collects profiling data files after recording.
|
||||
|
||||
[Here](./android_application_profiling.md#control-recording-in-application-code) are the details.
|
||||
|
||||
### run_simpleperf_without_usb_connection.py
|
||||
|
||||
`run_simpleperf_without_usb_connection.py` records profiling data while the USB cable isn't
|
||||
connected. Maybe `api_profiler.py` is more suitable, which also don't need USB cable when recording.
|
||||
Below is an example.
|
||||
|
||||
```sh
|
||||
$ ./run_simpleperf_without_usb_connection.py start -p simpleperf.example.cpp
|
||||
# After the command finishes successfully, unplug the USB cable, run the
|
||||
# SimpleperfExampleCpp app. After a few seconds, plug in the USB cable.
|
||||
$ ./run_simpleperf_without_usb_connection.py stop
|
||||
# It may take a while to stop recording. After that, the profiling data is collected in perf.data
|
||||
# on host.
|
||||
```
|
||||
|
||||
### binary_cache_builder.py
|
||||
|
||||
The `binary_cache` directory is a directory holding binaries needed by a profiling data file. The
|
||||
binaries are expected to be unstripped, having debug information and symbol tables. The
|
||||
`binary_cache` directory is used by report scripts to read symbols of binaries. It is also used by
|
||||
`report_html.py` to generate annotated source code and disassembly.
|
||||
|
||||
By default, `app_profiler.py` builds the binary_cache directory after recording. But we can also
|
||||
build `binary_cache` for existing profiling data files using `binary_cache_builder.py`. It is useful
|
||||
when you record profiling data using `simpleperf record` directly, to do system wide profiling or
|
||||
record without the USB cable connected.
|
||||
|
||||
`binary_cache_builder.py` can either pull binaries from an Android device, or find binaries in
|
||||
directories on the host (via `-lib`).
|
||||
|
||||
```sh
|
||||
# Generate binary_cache for perf.data, by pulling binaries from the device.
|
||||
$ ./binary_cache_builder.py
|
||||
|
||||
# Generate binary_cache, by pulling binaries from the device and finding binaries in
|
||||
# SimpleperfExampleCpp.
|
||||
$ ./binary_cache_builder.py -lib path_of_SimpleperfExampleCpp
|
||||
```
|
||||
|
||||
### run_simpleperf_on_device.py
|
||||
|
||||
This script pushes the `simpleperf` executable on the device, and run a simpleperf command on the
|
||||
device. It is more convenient than running adb commands manually.
|
||||
|
||||
## Viewing the profile
|
||||
|
||||
Scripts in this section are for viewing the profile or converting profile data into formats used by
|
||||
external UIs. For recommended UIs, see [view_the_profile.md](view_the_profile.md).
|
||||
|
||||
### report.py
|
||||
|
||||
report.py is a wrapper of the `report` command on the host. It accepts all options of the `report`
|
||||
command.
|
||||
|
||||
```sh
|
||||
# Report call graph
|
||||
$ ./report.py -g
|
||||
|
||||
# Report call graph in a GUI window implemented by Python Tk.
|
||||
$ ./report.py -g --gui
|
||||
```
|
||||
|
||||
### report_html.py
|
||||
|
||||
`report_html.py` generates `report.html` based on the profiling data. Then the `report.html` can show
|
||||
the profiling result without depending on other files. So it can be shown in local browsers or
|
||||
passed to other machines. Depending on which command-line options are used, the content of the
|
||||
`report.html` can include: chart statistics, sample table, flamegraphs, annotated source code for
|
||||
each function, annotated disassembly for each function.
|
||||
|
||||
```sh
|
||||
# Generate chart statistics, sample table and flamegraphs, based on perf.data.
|
||||
$ ./report_html.py
|
||||
|
||||
# Add source code.
|
||||
$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp
|
||||
|
||||
# Add disassembly.
|
||||
$ ./report_html.py --add_disassembly
|
||||
|
||||
# Adding disassembly for all binaries can cost a lot of time. So we can choose to only add
|
||||
# disassembly for selected binaries.
|
||||
$ ./report_html.py --add_disassembly --binary_filter libgame.so
|
||||
|
||||
# report_html.py accepts more than one recording data file.
|
||||
$ ./report_html.py -i perf1.data perf2.data
|
||||
```
|
||||
|
||||
Below is an example of generating html profiling results for SimpleperfExampleCpp.
|
||||
|
||||
```sh
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp
|
||||
$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp \
|
||||
--add_disassembly
|
||||
```
|
||||
|
||||
After opening the generated [`report.html`](./report_html.html) in a browser, there are several tabs:
|
||||
|
||||
The first tab is "Chart Statistics". You can click the pie chart to show the time consumed by each
|
||||
process, thread, library and function.
|
||||
|
||||
The second tab is "Sample Table". It shows the time taken by each function. By clicking one row in
|
||||
the table, we can jump to a new tab called "Function".
|
||||
|
||||
The third tab is "Flamegraph". It shows the graphs generated by [`inferno`](./inferno.md).
|
||||
|
||||
The fourth tab is "Function". It only appears when users click a row in the "Sample Table" tab.
|
||||
It shows information of a function, including:
|
||||
|
||||
1. A flamegraph showing functions called by that function.
|
||||
2. A flamegraph showing functions calling that function.
|
||||
3. Annotated source code of that function. It only appears when there are source code files for
|
||||
that function.
|
||||
4. Annotated disassembly of that function. It only appears when there are binaries containing that
|
||||
function.
|
||||
|
||||
### inferno
|
||||
|
||||
[`inferno`](./inferno.md) is a tool used to generate flamegraph in a html file.
|
||||
|
||||
```sh
|
||||
# Generate flamegraph based on perf.data.
|
||||
# On Windows, use inferno.bat instead of ./inferno.sh.
|
||||
$ ./inferno.sh -sc --record_file perf.data
|
||||
|
||||
# Record a native program and generate flamegraph.
|
||||
$ ./inferno.sh -np surfaceflinger
|
||||
```
|
||||
|
||||
### purgatorio
|
||||
|
||||
[`purgatorio`](../scripts/purgatorio/README.md) is a visualization tool to show samples in time order.
|
||||
|
||||
### pprof_proto_generator.py
|
||||
|
||||
It converts a profiling data file into `pprof.proto`, a format used by [pprof](https://github.com/google/pprof).
|
||||
|
||||
```sh
|
||||
# Convert perf.data in the current directory to pprof.proto format.
|
||||
$ ./pprof_proto_generator.py
|
||||
# Show report in pdf format.
|
||||
$ pprof -pdf pprof.profile
|
||||
|
||||
# Show report in html format. To show disassembly, add --tools option like:
|
||||
# --tools=objdump:<ndk_path>/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin
|
||||
# To show annotated source or disassembly, select `top` in the view menu, click a function and
|
||||
# select `source` or `disassemble` in the view menu.
|
||||
$ pprof -http=:8080 pprof.profile
|
||||
```
|
||||
|
||||
### gecko_profile_generator.py
|
||||
|
||||
Converts `perf.data` to [Gecko Profile
|
||||
Format](https://github.com/firefox-devtools/profiler/blob/main/docs-developer/gecko-profile-format.md),
|
||||
the format read by https://profiler.firefox.com/.
|
||||
|
||||
Firefox Profiler is a powerful general-purpose profiler UI which runs locally in
|
||||
any browser (not just Firefox), with:
|
||||
|
||||
- Per-thread tracks
|
||||
- Flamegraphs
|
||||
- Search, focus for specific stacks
|
||||
- A time series view for seeing your samples in timestamp order
|
||||
- Filtering by thread and duration
|
||||
|
||||
Usage:
|
||||
|
||||
```
|
||||
# Record a profile of your application
|
||||
$ ./app_profiler.py -p simpleperf.example.cpp
|
||||
|
||||
# Convert and gzip.
|
||||
$ ./gecko_profile_generator.py -i perf.data | gzip > gecko-profile.json.gz
|
||||
```
|
||||
|
||||
Then open `gecko-profile.json.gz` in https://profiler.firefox.com/.
|
||||
|
||||
### report_sample.py
|
||||
|
||||
`report_sample.py` converts a profiling data file into the `perf script` text format output by
|
||||
`linux-perf-tool`.
|
||||
|
||||
This format can be imported into:
|
||||
|
||||
- [FlameGraph](https://github.com/brendangregg/FlameGraph)
|
||||
- [Flamescope](https://github.com/Netflix/flamescope)
|
||||
- [Firefox
|
||||
Profiler](https://github.com/firefox-devtools/profiler/blob/main/docs-user/guide-perf-profiling.md),
|
||||
but prefer using `gecko_profile_generator.py`.
|
||||
- [Speedscope](https://github.com/jlfwong/speedscope/wiki/Importing-from-perf-(linux))
|
||||
|
||||
```sh
|
||||
# Record a profile to perf.data
|
||||
$ ./app_profiler.py <args>
|
||||
|
||||
# Convert perf.data in the current directory to a format used by FlameGraph.
|
||||
$ ./report_sample.py --symfs binary_cache >out.perf
|
||||
|
||||
$ git clone https://github.com/brendangregg/FlameGraph.git
|
||||
$ FlameGraph/stackcollapse-perf.pl out.perf >out.folded
|
||||
$ FlameGraph/flamegraph.pl out.folded >a.svg
|
||||
```
|
||||
|
||||
### stackcollapse.py
|
||||
|
||||
`stackcollapse.py` converts a profiling data file (`perf.data`) to [Brendan
|
||||
Gregg's "Folded Stacks"
|
||||
format](https://queue.acm.org/detail.cfm?id=2927301#:~:text=The%20folded%20stack%2Dtrace%20format,trace%2C%20followed%20by%20a%20semicolon).
|
||||
|
||||
Folded Stacks are lines of semicolon-delimited stack frames, root to leaf,
|
||||
followed by a count of events sampled in that stack, e.g.:
|
||||
|
||||
```
|
||||
BusyThread;__start_thread;__pthread_start(void*);java.lang.Thread.run 17889729
|
||||
```
|
||||
|
||||
All similar stacks are aggregated and sample timestamps are unused.
|
||||
|
||||
Folded Stacks format is readable by:
|
||||
|
||||
- The [FlameGraph](https://github.com/brendangregg/FlameGraph) toolkit
|
||||
- [Inferno](https://github.com/jonhoo/inferno) (Rust port of FlameGraph)
|
||||
- [Speedscope](https://speedscope.app/)
|
||||
|
||||
Example:
|
||||
|
||||
```sh
|
||||
# Record a profile to perf.data
|
||||
$ ./app_profiler.py <args>
|
||||
|
||||
# Convert to Folded Stacks format
|
||||
$ ./stackcollapse.py --kernel --jit | gzip > profile.folded.gz
|
||||
|
||||
# Visualise with FlameGraph with Java Stacks and nanosecond times
|
||||
$ git clone https://github.com/brendangregg/FlameGraph.git
|
||||
$ gunzip -c profile.folded.gz \
|
||||
| FlameGraph/flamegraph.pl --color=java --countname=ns \
|
||||
> profile.svg
|
||||
```
|
||||
|
||||
## simpleperf_report_lib.py
|
||||
|
||||
`simpleperf_report_lib.py` is a Python library used to parse profiling data files generated by the
|
||||
record command. Internally, it uses libsimpleperf_report.so to do the work. Generally, for each
|
||||
profiling data file, we create an instance of ReportLib, pass it the file path (via SetRecordFile).
|
||||
Then we can read all samples through GetNextSample(). For each sample, we can read its event info
|
||||
(via GetEventOfCurrentSample), symbol info (via GetSymbolOfCurrentSample) and call chain info
|
||||
(via GetCallChainOfCurrentSample). We can also get some global information, like record options
|
||||
(via GetRecordCmd), the arch of the device (via GetArch) and meta strings (via MetaInfo).
|
||||
|
||||
Examples of using `simpleperf_report_lib.py` are in `report_sample.py`, `report_html.py`,
|
||||
`pprof_proto_generator.py` and `inferno/inferno.py`.
|
||||
|
||||
## ipc.py
|
||||
`ipc.py`captures the instructions per cycle (IPC) of the system during a specified duration.
|
||||
|
||||
Example:
|
||||
```sh
|
||||
./ipc.py
|
||||
./ipc.py 2 20 # Set interval to 2 secs and total duration to 20 secs
|
||||
./ipc.py -p 284 -C 4 # Only profile the PID 284 while running on core 4
|
||||
./ipc.py -c 'sleep 5' # Only profile the command to run
|
||||
```
|
||||
|
||||
The results look like:
|
||||
```
|
||||
K_CYCLES K_INSTR IPC
|
||||
36840 14138 0.38
|
||||
70701 27743 0.39
|
||||
104562 41350 0.40
|
||||
138264 54916 0.40
|
||||
```
|
||||
|
||||
## sample_filter.py
|
||||
|
||||
`sample_filter.py` generates sample filter files as documented in [sample_filter.md](https://android.googlesource.com/platform/system/extras/+/refs/heads/main/simpleperf/doc/sample_filter.md).
|
||||
A filter file can be passed in `--filter-file` when running report scripts.
|
||||
|
||||
For example, it can be used to split a large recording file into several report files.
|
||||
|
||||
```sh
|
||||
$ sample_filter.py -i perf.data --split-time-range 2 -o sample_filter
|
||||
$ gecko_profile_generator.py -i perf.data --filter-file sample_filter_part1 \
|
||||
| gzip >profile-part1.json.gz
|
||||
$ gecko_profile_generator.py -i perf.data --filter-file sample_filter_part2 \
|
||||
| gzip >profile-part2.json.gz
|
||||
```
|
||||
|
After Width: | Height: | Size: 11 KiB |
@ -0,0 +1,352 @@
|
||||
# View the profile
|
||||
|
||||
[TOC]
|
||||
|
||||
## Introduction
|
||||
|
||||
After using `simpleperf record` or `app_profiler.py`, we get a profile data file. The file contains
|
||||
a list of samples. Each sample has a timestamp, a thread id, a callstack, events (like cpu-cycles
|
||||
or cpu-clock) used in this sample, etc. We have many choices for viewing the profile. We can show
|
||||
samples in chronological order, or show aggregated flamegraphs. We can show reports in text format,
|
||||
or in some interactive UIs.
|
||||
|
||||
Below shows some recommended UIs to view the profile. Google developers can find more examples in
|
||||
[go/gmm-profiling](go/gmm-profiling?polyglot=linux-workstation#viewing-the-profile).
|
||||
|
||||
|
||||
## Continuous PProf UI (great flamegraph UI, but only available internally)
|
||||
|
||||
[PProf](https://github.com/google/pprof) is a mature profiling technology used extensively on
|
||||
Google servers, with a powerful flamegraph UI, with strong drilldown, search, pivot, profile diff,
|
||||
and graph visualisation.
|
||||
|
||||

|
||||
|
||||
We can use `pprof_proto_generator.py` to convert profiles into pprof.profile protobufs for use in
|
||||
pprof.
|
||||
|
||||
```
|
||||
# Output all threads, broken down by threadpool.
|
||||
./pprof_proto_generator.py
|
||||
|
||||
# Use proguard mapping.
|
||||
./pprof_proto_generator.py --proguard-mapping-file proguard.map
|
||||
|
||||
# Just the main (UI) thread (query by thread name):
|
||||
./pprof_proto_generator.py --comm com.example.android.displayingbitmaps
|
||||
```
|
||||
|
||||
This will print some debug logs about Failed to read symbols: this is usually OK, unless those
|
||||
symbols are hotspots.
|
||||
|
||||
The continuous pprof server has a file upload size limit of 50MB. To get around this limit, compress
|
||||
the profile before uploading:
|
||||
|
||||
```
|
||||
gzip pprof.profile
|
||||
```
|
||||
|
||||
After compressing, you can upload the `pprof.profile.gz` file to either http://pprof/ or
|
||||
http://pprofng/. Both websites have an 'Upload' tab for this purpose. Alternatively, you can use
|
||||
the following `pprof` command to upload the compressed profile:
|
||||
|
||||
```
|
||||
# Upload all threads in profile, grouped by threadpool.
|
||||
# This is usually a good default, combining threads with similar names.
|
||||
pprof --flame --tagroot threadpool pprof.profile.gz
|
||||
|
||||
# Upload all threads in profile, grouped by individual thread name.
|
||||
pprof --flame --tagroot thread pprof.profile.gz
|
||||
|
||||
# Upload all threads in profile, without grouping by thread.
|
||||
pprof --flame pprof.profile.gz
|
||||
This will output a URL, example: https://pprof.corp.google.com/?id=589a60852306144c880e36429e10b166
|
||||
```
|
||||
|
||||
## Firefox Profiler (great chronological UI)
|
||||
|
||||
We can view Android profiles using Firefox Profiler: https://profiler.firefox.com/. This does not
|
||||
require Firefox installation -- Firefox Profiler is just a website, you can open it in any browser.
|
||||
There is also an internal Google-Hosted Firefox Profiler, at go/profiler or go/firefox-profiler.
|
||||
|
||||

|
||||
|
||||
Firefox Profiler has a great chronological view, as it doesn't pre-aggregate similar stack traces
|
||||
like pprof does.
|
||||
|
||||
We can use `gecko_profile_generator.py` to convert raw perf.data files into a Firefox Profile, with
|
||||
Proguard deobfuscation.
|
||||
|
||||
```
|
||||
# Create Gecko Profile
|
||||
./gecko_profile_generator.py | gzip > gecko_profile.json.gz
|
||||
|
||||
# Create Gecko Profile using Proguard map
|
||||
./gecko_profile_generator.py --proguard-mapping-file proguard.map | gzip > gecko_profile.json.gz
|
||||
```
|
||||
|
||||
Then drag-and-drop gecko_profile.json.gz into https://profiler.firefox.com/.
|
||||
|
||||
Firefox Profiler supports:
|
||||
|
||||
1. Aggregated Flamegraphs
|
||||
2. Chronological Stackcharts
|
||||
|
||||
And allows filtering by:
|
||||
|
||||
1. Individual threads
|
||||
2. Multiple threads (Ctrl+Click thread names to select many)
|
||||
3. Timeline period
|
||||
4. Stack frame text search
|
||||
|
||||
## FlameScope (great jank-finding UI)
|
||||
|
||||
[Netflix's FlameScope](https://github.com/Netflix/flamescope) is a rough, proof-of-concept UI that
|
||||
lets you spot repeating patterns of work by laying out the profile as a subsecond heatmap.
|
||||
|
||||
Below, each vertical stripe is one second, and each cell is 10ms. Redder cells have more samples.
|
||||
See https://www.brendangregg.com/blog/2018-11-08/flamescope-pattern-recognition.html for how to
|
||||
spot patterns.
|
||||
|
||||
This is an example of a 60s DisplayBitmaps app Startup Profile.
|
||||
|
||||

|
||||
|
||||
You can see:
|
||||
|
||||
The thick red vertical line on the left is startup.
|
||||
The long white vertical sections on the left shows the app is mostly idle, waiting for commands
|
||||
from instrumented tests.
|
||||
Then we see periodically red blocks, which shows the app is periodically busy handling commands
|
||||
from instrumented tests.
|
||||
|
||||
Click the start and end cells of a duration:
|
||||
|
||||

|
||||
|
||||
To see a flamegraph for that duration:
|
||||
|
||||

|
||||
|
||||
Install and run Flamescope:
|
||||
|
||||
```
|
||||
git clone https://github.com/Netflix/flamescope ~/flamescope
|
||||
cd ~/flamescope
|
||||
pip install -r requirements.txt
|
||||
npm install
|
||||
npm run webpack
|
||||
python3 run.py
|
||||
```
|
||||
|
||||
Then open FlameScope in-browser: http://localhost:5000/.
|
||||
|
||||
FlameScope can read gzipped perf script format profiles. Convert simpleperf perf.data to this
|
||||
format with `report_sample.py`, and place it in Flamescope's examples directory:
|
||||
|
||||
```
|
||||
# Create `Linux perf script` format profile.
|
||||
report_sample.py | gzip > ~/flamescope/examples/my_simpleperf_profile.gz
|
||||
|
||||
# Create `Linux perf script` format profile using Proguard map.
|
||||
report_sample.py \
|
||||
--proguard-mapping-file proguard.map \
|
||||
| gzip > ~/flamescope/examples/my_simpleperf_profile.gz
|
||||
```
|
||||
|
||||
Open the profile "as Linux Perf", and click start and end sections to get a flamegraph of that
|
||||
timespan.
|
||||
|
||||
To investigate UI Thread Jank, filter to UI thread samples only:
|
||||
|
||||
```
|
||||
report_sample.py \
|
||||
--comm com.example.android.displayingbitmaps \ # UI Thread
|
||||
| gzip > ~/flamescope/examples/uithread.gz
|
||||
```
|
||||
|
||||
Once you've identified the timespan of interest, consider also zooming into that section with
|
||||
Firefox Profiler, which has a more powerful flamegraph viewer.
|
||||
|
||||
## Differential FlameGraph
|
||||
|
||||
See Brendan Gregg's [Differential Flame Graphs](https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html) blog.
|
||||
|
||||
Use Simpleperf's `stackcollapse.py` to convert perf.data to Folded Stacks format for the FlameGraph
|
||||
toolkit.
|
||||
|
||||
Consider diffing both directions: After minus Before, and Before minus After.
|
||||
|
||||
If you've recorded before and after your optimisation as perf_before.data and perf_after.data, and
|
||||
you're only interested in the UI thread:
|
||||
|
||||
```
|
||||
# Generate before and after folded stacks from perf.data files
|
||||
./stackcollapse.py --kernel --jit -i perf_before.data \
|
||||
--proguard-mapping-file proguard_before.map \
|
||||
--comm com.example.android.displayingbitmaps \
|
||||
> perf_before.folded
|
||||
./stackcollapse.py --kernel --jit -i perf_after.data \
|
||||
--proguard-mapping-file proguard_after.map \
|
||||
--comm com.example.android.displayingbitmaps \
|
||||
> perf_after.folded
|
||||
|
||||
# Generate diff reports
|
||||
FlameGraph/difffolded.pl -n perf_before.folded perf_after.folded \
|
||||
| FlameGraph/flamegraph.pl > diff1.svg
|
||||
FlameGraph/difffolded.pl -n --negate perf_after.folded perf_before.folded \
|
||||
| FlameGraph/flamegraph.pl > diff2.svg
|
||||
```
|
||||
|
||||
## Android Studio Profiler
|
||||
|
||||
Android Studio Profiler supports recording and reporting profiles of app processes. It supports
|
||||
several recording methods, including one using simpleperf as backend. You can use Android Studio
|
||||
Profiler for both recording and reporting.
|
||||
|
||||
In Android Studio:
|
||||
Open View -> Tool Windows -> Profiler
|
||||
Click + -> Your Device -> Profileable Processes -> Your App
|
||||
|
||||

|
||||
|
||||
Click into "CPU" Chart
|
||||
|
||||
Choose Callstack Sample Recording. Even if you're using Java, this provides better observability,
|
||||
into ART, malloc, and the kernel.
|
||||
|
||||

|
||||
|
||||
Click Record, run your test on the device, then Stop when you're done.
|
||||
|
||||
Click on a thread track, and "Flame Chart" to see a chronological chart on the left, and an
|
||||
aggregated flamechart on the right:
|
||||
|
||||

|
||||
|
||||
If you want more flexibility in recording options, or want to add proguard mapping file, you can
|
||||
record using simpleperf, and report using Android Studio Profiler.
|
||||
|
||||
We can use `simpleperf report-sample` to convert perf.data to trace files for Android Studio
|
||||
Profiler.
|
||||
|
||||
```
|
||||
# Convert perf.data to perf.trace for Android Studio Profiler.
|
||||
# If on Mac/Windows, use simpleperf host executable for those platforms instead.
|
||||
bin/linux/x86_64/simpleperf report-sample --show-callchain --protobuf -i perf.data -o perf.trace
|
||||
|
||||
# Convert perf.data to perf.trace using proguard mapping file.
|
||||
bin/linux/x86_64/simpleperf report-sample --show-callchain --protobuf -i perf.data -o perf.trace \
|
||||
--proguard-mapping-file proguard.map
|
||||
```
|
||||
|
||||
In Android Studio: Open File -> Open -> Select perf.trace
|
||||
|
||||

|
||||
|
||||
|
||||
## Simpleperf HTML Report
|
||||
|
||||
Simpleperf can generate its own HTML Profile, which is able to show Android-specific information
|
||||
and separate flamegraphs for all threads, with a much rougher flamegraph UI.
|
||||
|
||||

|
||||
|
||||
This UI is fairly rough; we recommend using the Continuous PProf UI or Firefox Profiler instead. But
|
||||
it's useful for a quick look at your data.
|
||||
|
||||
Each of the following commands take as input ./perf.data and output ./report.html.
|
||||
|
||||
```
|
||||
# Make an HTML report.
|
||||
./report_html.py
|
||||
|
||||
# Make an HTML report with Proguard mapping.
|
||||
./report_html.py --proguard-mapping-file proguard.map
|
||||
```
|
||||
|
||||
This will print some debug logs about Failed to read symbols: this is usually OK, unless those
|
||||
symbols are hotspots.
|
||||
|
||||
See also [report_html.py's README](scripts_reference.md#report_htmlpy) and `report_html.py -h`.
|
||||
|
||||
|
||||
## PProf Interactive Command Line
|
||||
|
||||
Unlike Continuous PProf UI, [PProf](https://github.com/google/pprof) command line is publicly
|
||||
available, and allows drilldown, pivoting and filtering.
|
||||
|
||||
The below session demonstrates filtering to stack frames containing processBitmap.
|
||||
|
||||
```
|
||||
$ pprof pprof.profile
|
||||
(pprof) show=processBitmap
|
||||
(pprof) top
|
||||
Active filters:
|
||||
show=processBitmap
|
||||
Showing nodes accounting for 2.45s, 11.44% of 21.46s total
|
||||
flat flat% sum% cum cum%
|
||||
2.45s 11.44% 11.44% 2.45s 11.44% com.example.android.displayingbitmaps.util.ImageFetcher.processBitmap
|
||||
```
|
||||
|
||||
And then showing the tags of those frames, to tell what threads they are running on:
|
||||
|
||||
```
|
||||
(pprof) tags
|
||||
pid: Total 2.5s
|
||||
2.5s ( 100%): 31112
|
||||
|
||||
thread: Total 2.5s
|
||||
1.4s (57.21%): AsyncTask #3
|
||||
1.1s (42.79%): AsyncTask #4
|
||||
|
||||
threadpool: Total 2.5s
|
||||
2.5s ( 100%): AsyncTask #%d
|
||||
|
||||
tid: Total 2.5s
|
||||
1.4s (57.21%): 31174
|
||||
1.1s (42.79%): 31175
|
||||
```
|
||||
|
||||
Contrast with another method:
|
||||
|
||||
```
|
||||
(pprof) show=addBitmapToCache
|
||||
(pprof) top
|
||||
Active filters:
|
||||
show=addBitmapToCache
|
||||
Showing nodes accounting for 1.05s, 4.88% of 21.46s total
|
||||
flat flat% sum% cum cum%
|
||||
1.05s 4.88% 4.88% 1.05s 4.88% com.example.android.displayingbitmaps.util.ImageCache.addBitmapToCache
|
||||
```
|
||||
|
||||
For more information, see the [pprof README](https://github.com/google/pprof/blob/main/doc/README.md#interactive-terminal-use).
|
||||
|
||||
|
||||
## Simpleperf Report Command Line
|
||||
|
||||
The simpleperf report command reports profiles in text format.
|
||||
|
||||

|
||||
|
||||
You can call `simpleperf report` directly or call it via `report.py`.
|
||||
|
||||
```
|
||||
# Report symbols in table format.
|
||||
$ ./report.py --children
|
||||
|
||||
# Report call graph.
|
||||
$ bin/linux/x86_64/simpleperf report -g -i perf.data
|
||||
```
|
||||
|
||||
See also [report command's README](executable_commands_reference.md#The-report-command) and
|
||||
`report.py -h`.
|
||||
|
||||
|
||||
## Custom Report Interface
|
||||
|
||||
If the above View UIs can't fulfill your need, you can use `simpleperf_report_lib.py` to parse
|
||||
perf.data, extract sample information, and feed it to any views you like.
|
||||
|
||||
See [simpleperf_report_lib.py's README](scripts_reference.md#simpleperf_report_libpy) for more
|
||||
details.
|
||||