Init

2025-11-01 00:11:13 +05:30
commit 9d85e3d822
5723 changed files with 1758962 additions and 0 deletions
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/README.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/README.md
@ -0,0 +1,333 @@
+# Simpleperf
+
+Android Studio includes a graphical front end to Simpleperf, documented in
+[Inspect CPU activity with CPU Profiler](https://developer.android.com/studio/profile/cpu-profiler).
+Most users will prefer to use that instead of using Simpleperf directly.
+
+Simpleperf is a native CPU profiling tool for Android. It can be used to profile
+both Android applications and native processes running on Android. It can
+profile both Java and C++ code on Android. The simpleperf executable can run on Android >=L,
+and Python scripts can be used on Android >= N.
+
+Simpleperf is part of the Android Open Source Project.
+The source code is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/).
+The latest document is [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/README.md).
+
+[TOC]
+
+## Introduction
+
+An introduction slide deck is [here](./introduction.pdf).
+
+Simpleperf contains two parts: the simpleperf executable and Python scripts.
+
+The simpleperf executable works similar to linux-tools-perf, but has some specific features for
+the Android profiling environment:
+
+1. It collects more info in profiling data. Since the common workflow is "record on the device, and
+   report on the host", simpleperf not only collects samples in profiling data, but also collects
+   needed symbols, device info and recording time.
+
+2. It delivers new features for recording.
+   1) When recording dwarf based call graph, simpleperf unwinds the stack before writing a sample
+      to file. This is to save storage space on the device.
+   2) Support tracing both on CPU time and off CPU time with --trace-offcpu option.
+   3) Support recording callgraphs of JITed and interpreted Java code on Android >= P.
+
+3. It relates closely to the Android platform.
+   1) Is aware of Android environment, like using system properties to enable profiling, using
+      run-as to profile in application's context.
+   2) Supports reading symbols and debug information from the .gnu_debugdata section, because
+      system libraries are built with .gnu_debugdata section starting from Android O.
+   3) Supports profiling shared libraries embedded in apk files.
+   4) It uses the standard Android stack unwinder, so its results are consistent with all other
+      Android tools.
+
+4. It builds executables and shared libraries for different usages.
+   1) Builds static executables on the device. Since static executables don't rely on any library,
+      simpleperf executables can be pushed on any Android device and used to record profiling data.
+   2) Builds executables on different hosts: Linux, Mac and Windows. These executables can be used
+      to report on hosts.
+   3) Builds report shared libraries on different hosts. The report library is used by different
+      Python scripts to parse profiling data.
+
+Detailed documentation for the simpleperf executable is [here](#executable-commands-reference).
+
+Python scripts are split into three parts according to their functions:
+
+1. Scripts used for recording, like app_profiler.py, run_simpleperf_without_usb_connection.py.
+
+2. Scripts used for reporting, like report.py, report_html.py, inferno.
+
+3. Scripts used for parsing profiling data, like simpleperf_report_lib.py.
+
+The python scripts are tested on Python >= 3.9. Older versions may not be supported.
+Detailed documentation for the Python scripts is [here](#scripts-reference).
+
+
+## Tools in simpleperf
+
+The simpleperf executables and Python scripts are located in simpleperf/ in ndk releases, and in
+system/extras/simpleperf/scripts/ in AOSP. Their functions are listed below.
+
+bin/: contains executables and shared libraries.
+
+bin/android/${arch}/simpleperf: static simpleperf executables used on the device.
+
+bin/${host}/${arch}/simpleperf: simpleperf executables used on the host, only supports reporting.
+
+bin/${host}/${arch}/libsimpleperf_report.${so/dylib/dll}: report shared libraries used on the host.
+
+*.py, inferno, purgatorio: Python scripts used for recording and reporting. Details are in [scripts_reference.md](scripts_reference.md).
+
+
+## Android application profiling
+
+See [android_application_profiling.md](./android_application_profiling.md).
+
+
+## Android platform profiling
+
+See [android_platform_profiling.md](./android_platform_profiling.md).
+
+
+## Executable commands reference
+
+See [executable_commands_reference.md](./executable_commands_reference.md).
+
+
+## Scripts reference
+
+See [scripts_reference.md](./scripts_reference.md).
+
+## View the profile
+
+See [view_the_profile.md](./view_the_profile.md).
+
+## Answers to common issues
+
+### Support on different Android versions
+
+On Android < N, the kernel may be too old (< 3.18) to support features like recording DWARF
+based call graphs.
+On Android M - O, we can only profile C++ code and fully compiled Java code.
+On Android >= P, the ART interpreter supports DWARF based unwinding. So we can profile Java code.
+On Android >= Q, we can used simpleperf shipped on device to profile released Android apps, with
+  `<profileable android:shell="true" />`.
+
+
+### Comparing DWARF based and stack frame based call graphs
+
+Simpleperf supports two ways recording call stacks with samples. One is DWARF based call graph,
+the other is stack frame based call graph. Below is their comparison:
+
+Recording DWARF based call graph:
+1. Needs support of debug information in binaries.
+2. Behaves normally well on both ARM and ARM64, for both Java code and C++ code.
+3. Can only unwind 64K stack for each sample. So it isn't always possible to unwind to the bottom.
+   However, this is alleviated in simpleperf, as explained in the next section.
+4. Takes more CPU time than stack frame based call graphs. So it has higher overhead, and can't
+   sample at very high frequency (usually <= 4000 Hz).
+
+Recording stack frame based call graph:
+1. Needs support of stack frame registers.
+2. Doesn't work well on ARM. Because ARM is short of registers, and ARM and THUMB code have
+   different stack frame registers. So the kernel can't unwind user stack containing both ARM and
+   THUMB code.
+3. Also doesn't work well on Java code. Because the ART compiler doesn't reserve stack frame
+   registers. And it can't get frames for interpreted Java code.
+4. Works well when profiling native programs on ARM64. One example is profiling surfacelinger. And
+   usually shows complete flamegraph when it works well.
+5. Takes much less CPU time than DWARF based call graphs. So the sample frequency can be 10000 Hz or
+   higher.
+
+So if you need to profile code on ARM or profile Java code, DWARF based call graph is better. If you
+need to profile C++ code on ARM64, stack frame based call graphs may be better. After all, you can
+fisrt try DWARF based call graph, which is also the default option when `-g` is used. Because it
+always produces reasonable results. If it doesn't work well enough, then try stack frame based call
+graph instead.
+
+
+### Fix broken DWARF based call graph
+
+A DWARF-based call graph is generated by unwinding thread stacks. When a sample is recorded, a
+kernel dumps up to 64 kilobytes of stack data. By unwinding the stack based on DWARF information,
+we can get a call stack.
+
+Two reasons may cause a broken call stack:
+1. The kernel can only dump up to 64 kilobytes of stack data for each sample, but a thread can have
+   much larger stack. In this case, we can't unwind to the thread start point.
+
+2. We need binaries containing DWARF call frame information to unwind stack frames. The binary
+   should have one of the following sections: .eh_frame, .debug_frame, .ARM.exidx or .gnu_debugdata.
+
+To mitigate these problems,
+
+
+For the missing stack data problem:
+1. To alleviate it, simpleperf joins callchains (call stacks) after recording. If two callchains of
+   a thread have an entry containing the same ip and sp address, then simpleperf tries to join them
+   to make the callchains longer. So we can get more complete callchains by recording longer and
+   joining more samples. This doesn't guarantee to get complete call graphs. But it usually works
+   well.
+
+2. Simpleperf stores samples in a buffer before unwinding them. If the bufer is low in free space,
+   simpleperf may decide to truncate stack data for a sample to 1K. Hopefully, this can be recovered
+   by callchain joiner. But when a high percentage of samples are truncated, many callchains can be
+   broken. We can tell if many samples are truncated in the record command output, like:
+
+```sh
+$ simpleperf record ...
+simpleperf I cmd_record.cpp:809] Samples recorded: 105584 (cut 86291). Samples lost: 6501.
+
+$ simpleperf record ...
+simpleperf I cmd_record.cpp:894] Samples recorded: 7,365 (1,857 with truncated stacks).
+```
+
+   There are two ways to avoid truncating samples. One is increasing the buffer size, like
+   `--user-buffer-size 1G`. But `--user-buffer-size` is only available on latest simpleperf. If that
+   option isn't available, we can use `--no-cut-samples` to disable truncating samples.
+
+For the missing DWARF call frame info problem:
+1. Most C++ code generates binaries containing call frame info, in .eh_frame or .ARM.exidx sections.
+   These sections are not stripped, and are usually enough for stack unwinding.
+
+2. For C code and a small percentage of C++ code that the compiler is sure will not generate
+   exceptions, the call frame info is generated in .debug_frame section. .debug_frame section is
+   usually stripped with other debug sections. One way to fix it, is to download unstripped binaries
+   on device, as [here](#fix-broken-callchain-stopped-at-c-functions).
+
+3. The compiler doesn't generate unwind instructions for function prologue and epilogue. Because
+   they operates stack frames and will not generate exceptions. But profiling may hit these
+   instructions, and fails to unwind them. This usually doesn't matter in a frame graph. But in a
+   time based Stack Chart (like in Android Studio and Firefox profiler), this causes stack gaps once
+   in a while. We can remove stack gaps via `--remove-gaps`, which is already enabled by default.
+
+
+### Fix broken callchain stopped at C functions
+
+When using dwarf based call graphs, simpleperf generates callchains during recording to save space.
+The debug information needed to unwind C functions is in .debug_frame section, which is usually
+stripped in native libraries in apks. To fix this, we can download unstripped version of native
+libraries on device, and ask simpleperf to use them when recording.
+
+To use simpleperf directly:
+
+```sh
+# create native_libs dir on device, and push unstripped libs in it (nested dirs are not supported).
+$ adb shell mkdir /data/local/tmp/native_libs
+$ adb push <unstripped_dir>/*.so /data/local/tmp/native_libs
+# run simpleperf record with --symfs option.
+$ adb shell simpleperf record xxx --symfs /data/local/tmp/native_libs
+```
+
+To use app_profiler.py:
+
+```sh
+$ ./app_profiler.py -lib <unstripped_dir>
+```
+
+
+### How to solve missing symbols in report?
+
+The simpleperf record command collects symbols on device in perf.data. But if the native libraries
+you use on device are stripped, this will result in a lot of unknown symbols in the report. A
+solution is to build binary_cache on host.
+
+```sh
+# Collect binaries needed by perf.data in binary_cache/.
+$ ./binary_cache_builder.py -lib NATIVE_LIB_DIR,...
+```
+
+The NATIVE_LIB_DIRs passed in -lib option are the directories containing unstripped native
+libraries on host. After running it, the native libraries containing symbol tables are collected
+in binary_cache/ for use when reporting.
+
+```sh
+$ ./report.py --symfs binary_cache
+
+# report_html.py searches binary_cache/ automatically, so you don't need to
+# pass it any argument.
+$ ./report_html.py
+```
+
+
+### Show annotated source code and disassembly
+
+To show hot places at source code and instruction level, we need to show source code and
+disassembly with event count annotation. Simpleperf supports showing annotated source code and
+disassembly for C++ code and fully compiled Java code. Simpleperf supports two ways to do it:
+
+1. Through report_html.py:
+   1) Generate perf.data and pull it on host.
+   2) Generate binary_cache, containing elf files with debug information. Use -lib option to add
+     libs with debug info. Do it with
+     `binary_cache_builder.py -i perf.data -lib <dir_of_lib_with_debug_info>`.
+   3) Use report_html.py to generate report.html with annotated source code and disassembly,
+     as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#report_html_py).
+
+2. Through pprof.
+   1) Generate perf.data and binary_cache as above.
+   2) Use pprof_proto_generator.py to generate pprof proto file. `pprof_proto_generator.py`.
+   3) Use pprof to report a function with annotated source code, as described [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/doc/scripts_reference.md#pprof_proto_generator_py).
+
+
+### Reduce lost samples and samples with truncated stack
+
+When using `simpleperf record`, we may see lost samples or samples with truncated stack data. Before
+saving samples to a file, simpleperf uses two buffers to cache samples in memory. One is a kernel
+buffer, the other is a userspace buffer. The kernel puts samples to the kernel buffer. Simpleperf
+moves samples from the kernel buffer to the userspace buffer before processing them. If a buffer
+overflows, we lose samples or get samples with truncated stack data. Below is an example.
+
+```sh
+$ simpleperf record -a --duration 1 -g --user-buffer-size 100k
+simpleperf I cmd_record.cpp:799] Recorded for 1.00814 seconds. Start post processing.
+simpleperf I cmd_record.cpp:894] Samples recorded: 79 (16 with truncated stacks).
+                                 Samples lost: 2,129 (kernelspace: 18, userspace: 2,111).
+simpleperf W cmd_record.cpp:911] Lost 18.5567% of samples in kernel space, consider increasing
+                                 kernel buffer size(-m), or decreasing sample frequency(-f), or
+                                 increasing sample period(-c).
+simpleperf W cmd_record.cpp:928] Lost/Truncated 97.1233% of samples in user space, consider
+                                 increasing userspace buffer size(--user-buffer-size), or
+                                 decreasing sample frequency(-f), or increasing sample period(-c).
+```
+
+In the above example, we get 79 samples, 16 of them are with truncated stack data. We lose 18
+samples in the kernel buffer, and lose 2111 samples in the userspace buffer.
+
+To reduce lost samples in the kernel buffer, we can increase kernel buffer size via `-m`. To reduce
+lost samples in the userspace buffer, or reduce samples with truncated stack data, we can increase
+userspace buffer size via `--user-buffer-size`.
+
+We can also reduce samples generated in a fixed time period, like reducing sample frequency using
+`-f`, reducing monitored threads, not monitoring multiple perf events at the same time.
+
+
+## Bugs and contribution
+
+Bugs and feature requests can be submitted at https://github.com/android/ndk/issues.
+Patches can be uploaded to android-review.googlesource.com as [here](https://source.android.com/setup/contribute/),
+or sent to email addresses listed [here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/OWNERS).
+
+If you want to compile simpleperf C++ source code, follow below steps:
+1. Download AOSP main branch as [here](https://source.android.com/setup/build/requirements).
+2. Build simpleperf.
+```sh
+$ . build/envsetup.sh
+$ lunch aosp_arm64-trunk_staging-userdebug
+$ mmma system/extras/simpleperf -j30
+```
+
+If built successfully, out/target/product/generic_arm64/system/bin/simpleperf is for ARM64, and
+out/target/product/generic_arm64/system/bin/simpleperf32 is for ARM.
+
+The source code of simpleperf python scripts is in [system/extras/simpleperf/scripts](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/scripts/).
+Most scripts rely on simpleperf binaries to work. To update binaries for scripts (using linux
+x86_64 host and android arm64 target as an example):
+```sh
+$ cp out/host/linux-x86/lib64/libsimpleperf_report.so system/extras/simpleperf/scripts/bin/linux/x86_64/libsimpleperf_report.so
+$ cp out/target/product/generic_arm64/system/bin/simpleperf_ndk64 system/extras/simpleperf/scripts/bin/android/arm64/simpleperf
+```
+
+Then you can try the latest simpleperf scripts and binaries in system/extras/simpleperf/scripts.
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/android_application_profiling.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/android_application_profiling.md
@ -0,0 +1,313 @@
+# Android application profiling
+
+This section shows how to profile an Android application.
+Some examples are [Here](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo/README.md).
+
+Profiling an Android application involves three steps:
+1. Prepare an Android application.
+2. Record profiling data.
+3. Report profiling data.
+
+[TOC]
+
+## Prepare an Android application
+
+Based on the profiling situation, we may need to customize the build script to generate an apk file
+specifically for profiling. Below are some suggestions.
+
+1. If you want to profile a debug build of an application:
+
+For the debug build type, Android studio sets android::debuggable="true" in AndroidManifest.xml,
+enables JNI checks and may not optimize C/C++ code. It can be profiled by simpleperf without any
+change.
+
+2. If you want to profile a release build of an application:
+
+For the release build type, Android studio sets android::debuggable="false" in AndroidManifest.xml,
+disables JNI checks and optimizes C/C++ code. However, security restrictions mean that only apps
+with android::debuggable set to true can be profiled. So simpleperf can only profile a release
+build under these three circumstances:
+If you are on a rooted device, you can profile any app.
+
+If you are on Android >= Q, you can add profileableFromShell flag in AndroidManifest.xml, this makes
+a released app profileable by preinstalled profiling tools. In this case, simpleperf downloaded by
+adb will invoke simpleperf preinstalled in system image to profile the app.
+
+```
+<manifest ...>
+    <application ...>
+      <profileable android:shell="true" />
+    </application>
+</manifest>
+```
+
+If you are on Android >= O, we can use [wrap.sh](https://developer.android.com/ndk/guides/wrap-script.html)
+to profile a release build:
+Step 1: Add android::debuggable="true" in AndroidManifest.xml to enable profiling.
+```
+<manifest ...>
+    <application android::debuggable="true" ...>
+```
+
+Step 2: Add wrap.sh in lib/`arch` directories. wrap.sh runs the app without passing any debug flags
+to ART, so the app runs as a release app. wrap.sh can be done by adding the script below in
+app/build.gradle.
+```
+android {
+    buildTypes {
+        release {
+            sourceSets {
+                release {
+                    resources {
+                        srcDir {
+                            "wrap_sh_lib_dir"
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+task createWrapShLibDir
+    for (String abi : ["armeabi-v7a", "arm64-v8a", "x86", "x86_64"]) {
+        def dir = new File("app/wrap_sh_lib_dir/lib/" + abi)
+        dir.mkdirs()
+        def wrapFile = new File(dir, "wrap.sh")
+        wrapFile.withWriter { writer ->
+            writer.write('#!/system/bin/sh\n\$@\n')
+        }
+    }
+}
+```
+
+3. If you want to profile C/C++ code:
+
+Android studio strips symbol table and debug info of native libraries in the apk. So the profiling
+results may contain unknown symbols or broken callgraphs. To fix this, we can pass app_profiler.py
+a directory containing unstripped native libraries via the -lib option. Usually the directory can
+be the path of your Android Studio project.
+
+
+4. If you want to profile Java code:
+
+On Android >= P, simpleperf supports profiling Java code, no matter whether it is executed by
+the interpreter, or JITed, or compiled into native instructions. So you don't need to do anything.
+
+On Android O, simpleperf supports profiling Java code which is compiled into native instructions,
+and it also needs wrap.sh to use the compiled Java code. To compile Java code, we can pass
+app_profiler.py the --compile_java_code option.
+
+On Android N, simpleperf supports profiling Java code that is compiled into native instructions.
+To compile java code, we can pass app_profiler.py the --compile_java_code option.
+
+On Android <= M, simpleperf doesn't support profiling Java code.
+
+
+Below I use application [SimpleperfExampleCpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo/SimpleperfExampleCpp).
+It builds an app-debug.apk for profiling.
+
+```sh
+$ git clone https://android.googlesource.com/platform/system/extras
+$ cd extras/simpleperf/demo
+# Open SimpleperfExampleCpp project with Android studio, and build this project
+# successfully, otherwise the `./gradlew` command below will fail.
+$ cd SimpleperfExampleCpp
+
+# On windows, use "gradlew" instead.
+$ ./gradlew clean assemble
+$ adb install -r app/build/outputs/apk/debug/app-debug.apk
+```
+
+## Record and report profiling data
+
+We can use [app-profiler.py](scripts_reference.md#app_profilerpy) to profile Android applications.
+
+```sh
+# Cd to the directory of simpleperf scripts. Record perf.data.
+# -p option selects the profiled app using its package name.
+# --compile_java_code option compiles Java code into native instructions, which isn't needed on
+# Android >= P.
+# -a option selects the Activity to profile.
+# -lib option gives the directory to find debug native libraries.
+$ ./app_profiler.py -p simpleperf.example.cpp -a .MixActivity -lib path_of_SimpleperfExampleCpp
+```
+
+This will collect profiling data in perf.data in the current directory, and related native
+binaries in binary_cache/.
+
+Normally we need to use the app when profiling, otherwise we may record no samples. But in this
+case, the MixActivity starts a busy thread. So we don't need to use the app while profiling.
+
+```sh
+# Report perf.data in stdio interface.
+$ ./report.py
+Cmdline: /data/data/simpleperf.example.cpp/simpleperf record ...
+Arch: arm64
+Event: task-clock:u (type 1, config 1)
+Samples: 10023
+Event count: 10023000000
+
+Overhead  Command     Pid   Tid   Shared Object              Symbol
+27.04%    BusyThread  5703  5729  /system/lib64/libart.so    art::JniMethodStart(art::Thread*)
+25.87%    BusyThread  5703  5729  /system/lib64/libc.so      long StrToI<long, ...
+...
+```
+
+[report.py](scripts_reference.md#reportpy) reports profiling data in stdio interface. If there
+are a lot of unknown symbols in the report, check [here](README.md#how-to-solve-missing-symbols-in-report).
+
+```sh
+# Report perf.data in html interface.
+$ ./report_html.py
+
+# Add source code and disassembly. Change the path of source_dirs if it not correct.
+$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp \
+      --add_disassembly
+```
+
+[report_html.py](scripts_reference.md#report_htmlpy) generates report in report.html, and pops up
+a browser tab to show it.
+
+## Record and report call graph
+
+We can record and report [call graphs](executable_commands_reference.md#record-call-graphs) as below.
+
+```sh
+# Record dwarf based call graphs: add "-g" in the -r option.
+$ ./app_profiler.py -p simpleperf.example.cpp \
+        -r "-e task-clock:u -f 1000 --duration 10 -g" -lib path_of_SimpleperfExampleCpp
+
+# Record stack frame based call graphs: add "--call-graph fp" in the -r option.
+$ ./app_profiler.py -p simpleperf.example.cpp \
+        -r "-e task-clock:u -f 1000 --duration 10 --call-graph fp" \
+        -lib path_of_SimpleperfExampleCpp
+
+# Report call graphs in stdio interface.
+$ ./report.py -g
+
+# Report call graphs in python Tk interface.
+$ ./report.py -g --gui
+
+# Report call graphs in html interface.
+$ ./report_html.py
+
+# Report call graphs in flamegraphs.
+# On Windows, use inferno.bat instead of ./inferno.sh.
+$ ./inferno.sh -sc
+```
+
+## Report in html interface
+
+We can use [report_html.py](scripts_reference.md#report_htmlpy) to show profiling results in a web browser.
+report_html.py integrates chart statistics, sample table, flamegraphs, source code annotation
+and disassembly annotation. It is the recommended way to show reports.
+
+```sh
+$ ./report_html.py
+```
+
+## Show flamegraph
+
+To show flamegraphs, we need to first record call graphs. Flamegraphs are shown by
+report_html.py in the "Flamegraph" tab.
+We can also use [inferno](scripts_reference.md#inferno) to show flamegraphs directly.
+
+```sh
+# On Windows, use inferno.bat instead of ./inferno.sh.
+$ ./inferno.sh -sc
+```
+
+We can also build flamegraphs using https://github.com/brendangregg/FlameGraph.
+Please make sure you have perl installed.
+
+```sh
+$ git clone https://github.com/brendangregg/FlameGraph.git
+$ ./report_sample.py --symfs binary_cache >out.perf
+$ FlameGraph/stackcollapse-perf.pl out.perf >out.folded
+$ FlameGraph/flamegraph.pl out.folded >a.svg
+```
+
+## Report in Android Studio
+
+simpleperf report-sample command can convert perf.data into protobuf format accepted by
+Android Studio cpu profiler. The conversion can be done either on device or on host. If you have
+more symbol info on host, then prefer do it on host with --symdir option.
+
+```sh
+$ simpleperf report-sample --protobuf --show-callchain -i perf.data -o perf.trace
+# Then open perf.trace in Android Studio to show it.
+```
+
+## Deobfuscate Java symbols
+
+Java symbols may be obfuscated by ProGuard. To restore the original symbols in a report, we can
+pass a Proguard mapping file to the report scripts or report-sample command via
+`--proguard-mapping-file`.
+
+```sh
+$ ./report_html.py --proguard-mapping-file proguard_mapping_file.txt
+```
+
+## Record both on CPU time and off CPU time
+
+We can [record both on CPU time and off CPU time](executable_commands_reference.md#record-both-on-cpu-time-and-off-cpu-time).
+
+First check if trace-offcpu feature is supported on the device.
+
+```sh
+$ ./run_simpleperf_on_device.py list --show-features
+dwarf-based-call-graph
+trace-offcpu
+```
+
+If trace-offcpu is supported, it will be shown in the feature list. Then we can try it.
+
+```sh
+$ ./app_profiler.py -p simpleperf.example.cpp -a .SleepActivity \
+    -r "-g -e task-clock:u -f 1000 --duration 10 --trace-offcpu" \
+    -lib path_of_SimpleperfExampleCpp
+$ ./report_html.py --add_disassembly --add_source_code \
+    --source_dirs path_of_SimpleperfExampleCpp
+```
+
+## Profile from launch
+
+We can [profile from launch of an application](scripts_reference.md#profile-from-launch-of-an-application).
+
+```sh
+# Start simpleperf recording, then start the Activity to profile.
+$ ./app_profiler.py -p simpleperf.example.cpp -a .MainActivity
+
+# We can also start the Activity on the device manually.
+# 1. Make sure the application isn't running or one of the recent apps.
+# 2. Start simpleperf recording.
+$ ./app_profiler.py -p simpleperf.example.cpp
+# 3. Start the app manually on the device.
+```
+
+## Control recording in application code
+
+Simpleperf supports controlling recording from application code. Below is the workflow:
+
+1. Run `api_profiler.py prepare -p <package_name>` to allow an app recording itself using
+   simpleperf. By default, the permission is reset after device reboot. So we need to run the
+   script every time the device reboots. But on Android >= 13, we can use `--days` options to
+   set how long we want the permission to last.
+
+2. Link simpleperf app_api code in the application. The app needs to be debuggable or
+   profileableFromShell as described [here](#prepare-an-android-application). Then the app can
+   use the api to start/pause/resume/stop recording. To start recording, the app_api forks a child
+   process running simpleperf, and uses pipe files to send commands to the child process. After
+   recording, a profiling data file is generated.
+
+3. Run `api_profiler.py collect -p <package_name>` to collect profiling data files to host.
+
+Examples are CppApi and JavaApi in [demo](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/demo).
+
+
+## Parse profiling data manually
+
+We can also write python scripts to parse profiling data manually, by using
+[simpleperf_report_lib.py](scripts_reference.md#simpleperf_report_libpy). Examples are report_sample.py,
+report_html.py.
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/android_platform_profiling.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/android_platform_profiling.md
@ -0,0 +1,109 @@
+# Android platform profiling
+
+[TOC]
+
+## General Tips
+
+Here are some tips for Android platform developers, who build and flash system images on rooted
+devices:
+1. After running `adb root`, simpleperf can be used to profile any process or system wide.
+2. It is recommended to use the latest simpleperf available in AOSP main, if you are not working
+on the current main branch. Scripts are in `system/extras/simpleperf/scripts`, binaries are in
+`system/extras/simpleperf/scripts/bin/android`.
+3. It is recommended to use `app_profiler.py` for recording, and `report_html.py` for reporting.
+Below is an example.
+
+```sh
+# Record surfaceflinger process for 10 seconds with dwarf based call graph. More examples are in
+# scripts reference in the doc.
+$ ./app_profiler.py -np surfaceflinger -r "-g --duration 10"
+
+# Generate html report.
+$ ./report_html.py
+```
+
+4. Since Android >= O has symbols for system libraries on device, we don't need to use unstripped
+binaries in `$ANDROID_PRODUCT_OUT/symbols` to report call graphs. However, they are needed to add
+source code and disassembly (with line numbers) in the report. Below is an example.
+
+```sh
+# Doing recording with app_profiler.py or simpleperf on device, and generates perf.data on host.
+$ ./app_profiler.py -np surfaceflinger -r "--call-graph fp --duration 10"
+
+# Collect unstripped binaries from $ANDROID_PRODUCT_OUT/symbols to binary_cache/.
+$ ./binary_cache_builder.py -lib $ANDROID_PRODUCT_OUT/symbols
+
+# Report source code and disassembly. Disassembling all binaries is slow, so it's better to add
+# --binary_filter option to only disassemble selected binaries.
+$ ./report_html.py --add_source_code --source_dirs $ANDROID_BUILD_TOP --add_disassembly \
+  --binary_filter surfaceflinger.so
+```
+
+## Start simpleperf from system_server process
+
+Sometimes we want to profile a process/system-wide when a special situation happens. In this case,
+we can add code starting simpleperf at the point where the situation is detected.
+
+1. Disable selinux by `adb shell setenforce 0`. Because selinux only allows simpleperf running
+   in shell or debuggable/profileable apps.
+
+2. Add below code at the point where the special situation is detected.
+
+```java
+try {
+  // for capability check
+  Os.prctl(OsConstants.PR_CAP_AMBIENT, OsConstants.PR_CAP_AMBIENT_RAISE,
+           OsConstants.CAP_SYS_PTRACE, 0, 0);
+  // Write to /data instead of /data/local/tmp. Because /data can be written by system user.
+  Runtime.getRuntime().exec("/system/bin/simpleperf record -g -p " + String.valueOf(Process.myPid())
+            + " -o /data/perf.data --duration 30 --log-to-android-buffer --log verbose");
+} catch (Exception e) {
+  Slog.e(TAG, "error while running simpleperf");
+  e.printStackTrace();
+}
+```
+
+## Hardware PMU counter limit
+
+When monitoring instruction and cache related perf events (in hw/cache/raw/pmu category of list cmd),
+these events are mapped to PMU counters on each cpu core. But each core only has a limited number
+of PMU counters. If number of events > number of PMU counters, then the counters are multiplexed
+among events, which probably isn't what we want. We can use `simpleperf stat --print-hw-counter` to
+show hardware counters (per core) available on the device.
+
+On Pixel devices, the number of PMU counters on each core is usually 7, of which 4 of them are used
+by the kernel to monitor memory latency. So only 3 counters are available. It's fine to monitor up
+to 3 PMU events at the same time. To monitor more than 3 events, the `--use-devfreq-counters` option
+can be used to borrow from the counters used by the kernel.
+
+## Get boot-time profile
+
+On userdebug/eng devices, we can get boot-time profile via simpleperf.
+
+Step 1. Customize the configuration if needed. By default, simpleperf tracks all processes
+except for itself, starts at `early-init`, and stops when `sys.boot_completed` is set.
+You can customize it by changing the trigger or command line flags in
+`system/extras/simpleperf/simpleperf.rc`.
+
+Step 2. Add `androidboot.simpleperf.boot_record=1` to the kernel command line.
+For example, on Pixel devices, you can do
+```
+$ fastboot oem cmdline add androidboot.simpleperf.boot_record=1
+```
+
+Step 3. Reboot the device. When booting, init finds that the kernel command line flag is set,
+so it forks a background process to run simpleperf to record boot-time profile.
+init starts simpleperf at `early-init` stage, which is very soon after second-stage init starts.
+
+Step 4. After boot, the boot-time profile is stored in /tmp/boot_perf.data. Then we can pull
+the profile to host to report.
+
+```
+$ adb shell ls /tmp/boot_perf.data
+/tmp/boot_perf.data
+```
+
+Following is a boot-time profile example. From timestamp, the first sample is generated at about
+4.5s after booting.
+
+![boot_time_profile](pictures/boot_time_profile.png)
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/bottleneck.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/bottleneck.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/collect_etm_data_for_autofdo.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/collect_etm_data_for_autofdo.md
@ -0,0 +1,268 @@
+# Collect ETM data for AutoFDO
+
+[TOC]
+
+## Introduction
+
+ETM is a hardware feature available on arm64 devices. It collects the instruction stream running on
+each cpu. ARM uses ETM as an alternative for LBR (last branch record) on x86.
+Simpleperf supports collecting ETM data, and converting it to input files for AutoFDO, which can
+then be used for PGO (profile-guided optimization) during compilation.
+
+On ARMv8, ETM is considered as an external debug interface (unless ARMv8.4 Self-hosted Trace
+extension is impelemented). So it needs to be enabled explicitly in the bootloader, and isn't
+available on user devices. For Pixel devices, it's available on EVT and DVT devices on Pixel 4,
+Pixel 4a (5G) and Pixel 5. To test if it's available on other devices, you can follow commands in
+this doc and see if you can record any ETM data.
+
+## Examples
+
+Below are examples collecting ETM data for AutoFDO. It has two steps: first recording ETM data,
+second converting ETM data to AutoFDO input files.
+
+Record ETM data:
+
+```sh
+# preparation: we need to be root to record ETM data
+$ adb root
+$ adb shell
+redfin:/ \# cd data/local/tmp
+redfin:/data/local/tmp \#
+
+# Do a system wide collection, it writes output to perf.data.
+# If only want ETM data for kernel, use `-e cs-etm:k`.
+# If only want ETM data for userspace, use `-e cs-etm:u`.
+redfin:/data/local/tmp \# simpleperf record -e cs-etm --duration 3 -a
+
+# To reduce file size and time converting to AutoFDO input files, we recommend converting ETM data
+# into an intermediate branch-list format.
+redfin:/data/local/tmp \# simpleperf inject --output branch-list -o branch_list.data
+```
+
+Converting ETM data to AutoFDO input files needs to read binaries.
+So for userspace libraries, they can be converted on device. For kernel, it needs
+to be converted on host, with vmlinux and kernel modules available.
+
+Convert ETM data for userspace libraries:
+
+```sh
+# Injecting ETM data on device. It writes output to perf_inject.data.
+# perf_inject.data is a text file, containing branch counts for each library.
+redfin:/data/local/tmp \# simpleperf inject -i branch_list.data
+```
+
+Convert ETM data for kernel:
+
+```sh
+# pull ETM data to host.
+host $ adb pull /data/local/tmp/branch_list.data
+# download vmlinux and kernel modules to <binary_dir>
+# host simpleperf is in <aosp-top>/system/extras/simpleperf/scripts/bin/linux/x86_64/simpleperf,
+# or you can build simpleperf by `mmma system/extras/simpleperf`.
+host $ simpleperf inject --symdir <binary_dir> -i branch_list.data
+```
+
+The generated perf_inject.data may contain branch info for multiple binaries. But AutoFDO only
+accepts one at a time. So we need to split perf_inject.data.
+The format of perf_inject.data is below:
+
+```perf_inject.data format
+
+executed range with count info for binary1
+branch with count info for binary1
+// name for binary1
+
+executed range with count info for binary2
+branch with count info for binary2
+// name for binary2
+
+...
+```
+
+We need to split perf_inject.data, and make sure one file only contains info for one binary.
+
+Then we can use [AutoFDO](https://github.com/google/autofdo) to create profile. AutoFDO only works
+for binaries having an executable segment as its first loadable segment. But binaries built in
+Android may not follow this rule. Simpleperf inject command knows how to work around this problem.
+But there is a check in AutoFDO forcing binaries to start with an executable segment. We need to
+disable the check in AutoFDO, by commenting out L127-L136 in
+https://github.com/google/autofdo/commit/188db2834ce74762ed17108ca344916994640708#diff-2d132ecbb5e4f13e0da65419f6d1759dd27d6b696786dd7096c0c34d499b1710R127-R136.
+Then we can use `create_llvm_prof` in AutoFDO to create profiles used by clang.
+
+```sh
+# perf_inject_binary1.data is split from perf_inject.data, and only contains branch info for binary1.
+host $ autofdo/create_llvm_prof -profile perf_inject_binary1.data -profiler text -binary path_of_binary1 -out a.prof -format binary
+
+# perf_inject_kernel.data is split from perf_inject.data, and only contains branch info for [kernel.kallsyms].
+host $ autofdo/create_llvm_prof -profile perf_inject_kernel.data -profiler text -binary vmlinux -out a.prof -format binary
+```
+
+Then we can use a.prof for PGO during compilation, via `-fprofile-sample-use=a.prof`.
+[Here](https://clang.llvm.org/docs/UsersManual.html#using-sampling-profilers) are more details.
+
+### A complete example: etm_test_loop.cpp
+
+`etm_test_loop.cpp` is an example to show the complete process.
+The source code is in [etm_test_loop.cpp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/etm_test_loop.cpp).
+The build script is in [Android.bp](https://android.googlesource.com/platform/system/extras/+/main/simpleperf/runtest/Android.bp).
+It builds an executable called `etm_test_loop`, which runs on device.
+
+Step 1: Build `etm_test_loop` binary.
+
+```sh
+(host) <AOSP>$ . build/envsetup.sh
+(host) <AOSP>$ lunch aosp_arm64-trunk_staging-userdebug
+(host) <AOSP>$ make etm_test_loop
+```
+
+Step 2: Run `etm_test_loop` on device, and collect ETM data for its running.
+
+```sh
+(host) <AOSP>$ adb push out/target/product/generic_arm64/system/bin/etm_test_loop /data/local/tmp
+(host) <AOSP>$ adb root
+(host) <AOSP>$ adb shell
+(device) / # cd /data/local/tmp
+(device) /data/local/tmp # chmod a+x etm_test_loop
+(device) /data/local/tmp # simpleperf record -e cs-etm:u ./etm_test_loop
+simpleperf I cmd_record.cpp:729] Recorded for 0.0370068 seconds. Start post processing.
+simpleperf I cmd_record.cpp:799] Aux data traced: 1689136
+(device) /data/local/tmp # simpleperf inject -i perf.data --output branch-list -o branch_list.data
+simpleperf W dso.cpp:557] failed to read min virtual address of [vdso]: File not found
+(device) /data/local/tmp # exit
+(host) <AOSP>$ adb pull /data/local/tmp/branch_list.data
+```
+
+Step 3: Convert ETM data to AutoFDO data.
+
+```sh
+# Build simpleperf tool on host.
+(host) <AOSP>$ make simpleperf_ndk
+(host) <AOSP>$ simpleperf_ndk64 inject -i branch_list.data -o perf_inject_etm_test_loop.data --symdir out/target/product/generic_arm64/symbols/system/bin
+simpleperf W cmd_inject.cpp:505] failed to build instr ranges for binary [vdso]: File not found
+(host) <AOSP>$ cat perf_inject_etm_test_loop.data
+13
+1000-1010:1
+1014-1050:1
+...
+112c->0:1
+// /data/local/tmp/etm_test_loop
+
+(host) <AOSP>$ create_llvm_prof -profile perf_inject_etm_test_loop.data -profiler text -binary out/target/product/generic_arm64/symbols/system/bin/etm_test_loop -out etm_test_loop.afdo -format binary
+(host) <AOSP>$ ls -lh etm_test_loop.afdo
+rw-r--r-- 1 user group 241 Aug 29 16:04 etm_test_loop.afdo
+```
+
+Step 4: Use AutoFDO data to build optimized binary.
+
+```sh
+(host) <AOSP>$ mkdir toolchain/pgo-profiles/sampling/
+(host) <AOSP>$ cp etm_test_loop.afdo toolchain/pgo-profiles/sampling/
+(host) <AOSP>$ vi toolchain/pgo-profiles/sampling/Android.bp
+# edit Android.bp to add a fdo_profile module
+# soong_namespace {}
+#
+# fdo_profile {
+#    name: "etm_test_loop_afdo",
+#    profile: ["etm_test_loop.afdo"],
+# }
+```
+
+`soong_namespace` is added to support fdo_profile modules with the same name
+
+In a product config mk file, update `PRODUCT_AFDO_PROFILES` with
+
+```make
+PRODUCT_AFDO_PROFILES += etm_test_loop://toolchain/pgo-profiles/sampling:etm_test_loop_afdo
+```
+
+```sh
+(host) <AOSP>$ vi system/extras/simpleperf/runtest/Android.bp
+# edit Android.bp to enable afdo for etm_test_loop.
+# cc_binary {
+#    name: "etm_test_loop",
+#    srcs: ["etm_test_loop.cpp"],
+#    afdo: true,
+# }
+(host) <AOSP>$ make etm_test_loop
+```
+
+If comparing the disassembly of `out/target/product/generic_arm64/symbols/system/bin/etm_test_loop`
+before and after optimizing with AutoFDO data, we can see different preferences when branching.
+
+
+## Collect ETM data with a daemon
+
+Android also has a daemon collecting ETM data periodically. It only runs on userdebug and eng
+devices. The source code is in https://android.googlesource.com/platform/system/extras/+/main/profcollectd/.
+
+## Support ETM in the kernel
+
+To let simpleperf use ETM function, we need to enable Coresight driver in the kernel, which lives in
+`<linux_kernel>/drivers/hwtracing/coresight`.
+
+The Coresight driver can be enabled by below kernel configs:
+
+```config
+	CONFIG_CORESIGHT=y
+	CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
+	CONFIG_CORESIGHT_SOURCE_ETM4X=y
+```
+
+On Kernel 5.10+, we recommend building Coresight driver as kernel modules. Because it works with
+GKI kernel.
+
+```config
+	CONFIG_CORESIGHT=m
+	CONFIG_CORESIGHT_LINK_AND_SINK_TMC=m
+	CONFIG_CORESIGHT_SOURCE_ETM4X=m
+```
+
+Android common kernel 5.10+ should have all the Coresight patches needed to collect ETM data.
+Android common kernel 5.4 misses two patches. But by adding patches in
+https://android-review.googlesource.com/q/topic:test_etm_on_hikey960_5.4, we can collect ETM data
+on hikey960 with 5.4 kernel.
+For Android common kernel 4.14 and 4.19, we have backported all necessary Coresight patches.
+
+Besides Coresight driver, we also need to add Coresight devices in device tree. An example is in
+https://github.com/torvalds/linux/blob/master/arch/arm64/boot/dts/arm/juno-base.dtsi. There should
+be a path flowing ETM data from ETM device through funnels, ETF and replicators, all the way to
+ETR, which writes ETM data to system memory.
+
+One optional flag in ETM device tree is "arm,coresight-loses-context-with-cpu". It saves ETM
+registers when a CPU enters low power state. It may be needed to avoid
+"coresight_disclaim_device_unlocked" warning when doing system wide collection.
+
+One optional flag in ETR device tree is "arm,scatter-gather". Simpleperf requests 4M system memory
+for ETR to store ETM data. Without IOMMU, the memory needs to be contiguous. If the kernel can't
+fulfill the request, simpleperf will report out of memory error. Fortunately, we can use
+"arm,scatter-gather" flag to let ETR run in scatter gather mode, which uses non-contiguous memory.
+
+
+### A possible problem: trace_id mismatch
+
+Each CPU has an ETM device, which has a unique trace_id assigned from the kernel.
+The formula is: `trace_id = 0x10 + cpu * 2`, as in https://github.com/torvalds/linux/blob/master/include/linux/coresight-pmu.h#L37.
+If the formula is modified by local patches, then simpleperf inject command can't parse ETM data
+properly and is likely to give empty output.
+
+
+## Enable ETM in the bootloader
+
+Unless ARMv8.4 Self-hosted Trace extension is implemented, ETM is considered as an external debug
+interface. It may be disabled by fuse (like JTAG). So we need to check if ETM is disabled, and
+if bootloader provides a way to reenable it.
+
+We can tell if ETM is disable by checking its TRCAUTHSTATUS register, which is exposed in sysfs,
+like /sys/bus/coresight/devices/coresight-etm0/mgmt/trcauthstatus. To reenable ETM, we need to
+enable non-Secure non-invasive debug on ARM CPU. The method depends on chip vendors(SOCs).
+
+
+## Related docs
+
+* [Arm Architecture Reference Manual Armv8, D3 AArch64 Self-hosted Trace](https://developer.arm.com/documentation/ddi0487/latest)
+* [ARM ETM Architecture Specification](https://developer.arm.com/documentation/ihi0064/latest/)
+* [ARM CoreSight Architecture Specification](https://developer.arm.com/documentation/ihi0029/latest)
+* [CoreSight Components Technical Reference Manual](https://developer.arm.com/documentation/ddi0314/h/)
+* [CoreSight Trace Memory Controller Technical Reference Manual](https://developer.arm.com/documentation/ddi0461/b/)
+* [OpenCSD library for decoding ETM data](https://github.com/Linaro/OpenCSD)
+* [AutoFDO tool for converting profile data](https://github.com/google/autofdo)
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/debug_dwarf_unwinding.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/debug_dwarf_unwinding.md
@ -0,0 +1,79 @@
+# Debug dwarf unwinding
+
+Dwarf unwinding is the default way of getting call graphs in simpleperf. In this process,
+simpleperf asks the kernel to add stack and register data to each sample. Then it uses
+[libunwindstack](https://cs.android.com/android/platform/superproject/+/main:system/unwinding/libunwindstack/)
+to unwind the call stack. libunwindstack uses dwarf sections (like .debug_frame or .eh_frame) in
+elf files to know how to unwind the stack.
+
+By default, `simpleperf record` unwinds a sample before saving it to disk, to reduce space consumed
+by stack data. But this behavior makes it harder to reproduce unwinding problems. So we added
+debug-unwind command, to help debug and profile dwarf unwinding. Below are two use cases.
+
+[TOC]
+
+## Debug failed unwinding cases
+
+Unwinding a sample can fail for different reasons: not enough stack or register data, unknown
+thread maps, no dwarf info, bugs in code, etc. And to fix them, we need to get error details
+and be able to reproduce them. simpleperf record cmd has two options for this:
+`--keep-failed-unwinding-result` keeps error code for failed unwinding samples. It's lightweight
+and gives us a brief idea why unwinding stops.
+`--keep-failed-unwinding-debug-info` keeps stack and register data for failed unwinding samples. It
+can be used to reproduce the unwinding process given proper elf files. Below is an example.
+
+```sh
+# Run record cmd and keep failed unwinding debug info.
+$ simpleperf64 record --app com.example.android.displayingbitmaps -g --duration 10 \
+    --keep-failed-unwinding-debug-info
+...
+simpleperf I cmd_record.cpp:762] Samples recorded: 22026. Samples lost: 0.
+
+# Generate a text report containing failed unwinding cases.
+$ simpleperf debug-unwind --generate-report -o report.txt
+
+# Pull report.txt on host and show it using debug_unwind_reporter.py.
+# Show summary.
+$ debug_unwind_reporter.py -i report.txt --summary
+# Show summary of samples failed at a symbol.
+$ debug_unwind_reporter.py -i report.txt --summary --include-end-symbol SocketInputStream_socketRead0
+# Show details of samples failed at a symbol.
+$ debug_unwind_reporter.py -i report.txt --include-end-symbol SocketInputStream_socketRead0
+
+# Reproduce unwinding a failed case.
+$ simpleperf debug-unwind --unwind-sample --sample-time 256666343213301
+
+# Generate a test file containing a failed case and elf files for debugging it.
+$ simpleperf debug-unwind --generate-test-file --sample-time 256666343213301 --keep-binaries-in-test-file \
+    /apex/com.android.runtime/lib64/bionic/libc.so,/apex/com.android.art/lib64/libopenjdk.so -o test.data
+```
+
+## Profile unwinding process
+
+We can also record samples without unwinding them. Then we can use debug-unwind cmd to unwind the
+samples after recording. Below is an example.
+
+```sh
+# Record samples without unwinding them.
+$ simpleperf record --app com.example.android.displayingbitmaps -g --duration 10 \
+    --no-unwind
+...
+simpleperf I cmd_record.cpp:762] Samples recorded: 9923. Samples lost: 0.
+
+# Use debug-unwind cmd to unwind samples.
+$ simpleperf debug-unwind --unwind-sample
+```
+
+We can profile the unwinding process, get hot functions for improvement.
+
+```sh
+# Profile debug-unwind cmd.
+$ simpleperf record -g -o perf_unwind.data simpleperf debug-unwind --unwind-sample --skip-sample-print
+
+# Then pull perf_unwind.data and report it.
+$ report_html.py -i perf_unwind.data
+
+# We can also add source code annotation in report.html.
+$ binary_cache_builder.py -i perf_unwind.data -lib <path to aosp-main>/out/target/product/<device-name>/symbols/system
+$ report_html.py -i perf_unwind.data --add_source_code --source_dirs <path to aosp-main>/system/
+```
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/executable_commands_reference.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/executable_commands_reference.md
@ -0,0 +1,696 @@
+# Executable commands reference
+
+[TOC]
+
+## How simpleperf works
+
+Modern CPUs have a hardware component called the performance monitoring unit (PMU). The PMU has
+several hardware counters, counting events like how many cpu cycles have happened, how many
+instructions have executed, or how many cache misses have happened.
+
+The Linux kernel wraps these hardware counters into hardware perf events. In addition, the Linux
+kernel also provides hardware independent software events and tracepoint events. The Linux kernel
+exposes all events to userspace via the perf_event_open system call, which is used by simpleperf.
+
+Simpleperf has three main commands: stat, record and report.
+
+The stat command gives a summary of how many events have happened in the profiled processes in a
+time period. Here’s how it works:
+1. Given user options, simpleperf enables profiling by making a system call to the kernel.
+2. The kernel enables counters while the profiled processes are running.
+3. After profiling, simpleperf reads counters from the kernel, and reports a counter summary.
+
+The record command records samples of the profiled processes in a time period. Here’s how it works:
+1. Given user options, simpleperf enables profiling by making a system call to the kernel.
+2. Simpleperf creates mapped buffers between simpleperf and the kernel.
+3. The kernel enables counters while the profiled processes are running.
+4. Each time a given number of events happen, the kernel dumps a sample to the mapped buffers.
+5. Simpleperf reads samples from the mapped buffers and stores profiling data in a file called
+   perf.data.
+
+The report command reads perf.data and any shared libraries used by the profiled processes,
+and outputs a report showing where the time was spent.
+
+## Commands
+
+Simpleperf supports several commands, listed below:
+
+```
+The debug-unwind command: debug/test dwarf based offline unwinding, used for debugging simpleperf.
+The dump command: dumps content in perf.data, used for debugging simpleperf.
+The help command: prints help information for other commands.
+The kmem command: collects kernel memory allocation information (will be replaced by Python scripts).
+The list command: lists all event types supported on the Android device.
+The record command: profiles processes and stores profiling data in perf.data.
+The report command: reports profiling data in perf.data.
+The report-sample command: reports each sample in perf.data, used for supporting integration of
+                           simpleperf in Android Studio.
+The stat command: profiles processes and prints counter summary.
+
+```
+
+Each command supports different options, which can be seen through help message.
+
+```sh
+# List all commands.
+$ simpleperf --help
+
+# Print help message for record command.
+$ simpleperf record --help
+```
+
+Below describes the most frequently used commands, which are list, stat, record and report.
+
+## The list command
+
+The list command lists all events available on the device. Different devices may support different
+events because they have different hardware and kernels.
+
+```sh
+$ simpleperf list
+List of hw-cache events:
+  branch-loads
+  ...
+List of hardware events:
+  cpu-cycles
+  instructions
+  ...
+List of software events:
+  cpu-clock
+  task-clock
+  ...
+```
+
+On ARM/ARM64, the list command also shows a list of raw events, they are the events supported by
+the ARM PMU on the device. The kernel has wrapped part of them into hardware events and hw-cache
+events. For example, raw-cpu-cycles is wrapped into cpu-cycles, raw-instruction-retired is wrapped
+into instructions. The raw events are provided in case we want to use some events supported on the
+device, but unfortunately not wrapped by the kernel.
+
+## The stat command
+
+The stat command is used to get event counter values of the profiled processes. By passing options,
+we can select which events to use, which processes/threads to monitor, how long to monitor and the
+print interval.
+
+```sh
+# Stat using default events (cpu-cycles,instructions,...), and monitor process 7394 for 10 seconds.
+$ simpleperf stat -p 7394 --duration 10
+Performance counter statistics:
+
+#         count  event_name                # count / runtime
+     16,513,564  cpu-cycles                # 1.612904 GHz
+      4,564,133  stalled-cycles-frontend   # 341.490 M/sec
+      6,520,383  stalled-cycles-backend    # 591.666 M/sec
+      4,900,403  instructions              # 612.859 M/sec
+         47,821  branch-misses             # 6.085 M/sec
+  25.274251(ms)  task-clock                # 0.002520 cpus used
+              4  context-switches          # 158.264 /sec
+            466  page-faults               # 18.438 K/sec
+
+Total test time: 10.027923 seconds.
+```
+
+### Select events to stat
+
+We can select which events to use via -e.
+
+```sh
+# Stat event cpu-cycles.
+$ simpleperf stat -e cpu-cycles -p 11904 --duration 10
+
+# Stat event cache-references and cache-misses.
+$ simpleperf stat -e cache-references,cache-misses -p 11904 --duration 10
+```
+
+When running the stat command, if the number of hardware events is larger than the number of
+hardware counters available in the PMU, the kernel shares hardware counters between events, so each
+event is only monitored for part of the total time. As a result, the number of events shown is
+smaller than the number of events that actually happened. The following is an example.
+
+```sh
+# Stat using event cache-references, cache-references:u,....
+$ simpleperf stat -p 7394 -e cache-references,cache-references:u,cache-references:k \
+      -e cache-misses,cache-misses:u,cache-misses:k,instructions --duration 1
+Performance counter statistics:
+
+#   count  event_name           # count / runtime
+  490,713  cache-references     # 151.682 M/sec
+  899,652  cache-references:u   # 130.152 M/sec
+  855,218  cache-references:k   # 111.356 M/sec
+   61,602  cache-misses         # 7.710 M/sec
+   33,282  cache-misses:u       # 5.050 M/sec
+   11,662  cache-misses:k       # 4.478 M/sec
+        0  instructions         #
+
+Total test time: 1.000867 seconds.
+simpleperf W cmd_stat.cpp:946] It seems the number of hardware events are more than the number of
+available CPU PMU hardware counters. That will trigger hardware counter
+multiplexing. As a result, events are not counted all the time processes
+running, and event counts are smaller than what really happens.
+Use --print-hw-counter to show available hardware counters.
+```
+
+In the example above, we monitor 7 events. Each event is only monitored part of the total time.
+Because the number of cache-references is smaller than the number of cache-references:u
+(cache-references only in userspace) and cache-references:k (cache-references only in kernel).
+The number of instructions is zero. After printing the result, simpleperf checks if CPUs have
+enough hardware counters to count hardware events at the same time. If not, it prints a warning.
+
+To avoid hardware counter multiplexing, we can use `simpleperf stat --print-hw-counter` to show
+available counters on each CPU. Then don't monitor more hardware events than counters available.
+
+```sh
+$ simpleperf stat --print-hw-counter
+There are 2 CPU PMU hardware counters available on cpu 0.
+There are 2 CPU PMU hardware counters available on cpu 1.
+There are 2 CPU PMU hardware counters available on cpu 2.
+There are 2 CPU PMU hardware counters available on cpu 3.
+There are 2 CPU PMU hardware counters available on cpu 4.
+There are 2 CPU PMU hardware counters available on cpu 5.
+There are 2 CPU PMU hardware counters available on cpu 6.
+There are 2 CPU PMU hardware counters available on cpu 7.
+```
+
+When counter multiplexing happens, there is no guarantee of which events will be monitored at
+which time. If we want to ensure some events are always monitored at the same time, we can use
+`--group`.
+
+```sh
+# Stat using event cache-references, cache-references:u,....
+$ simpleperf stat -p 7964 --group cache-references,cache-misses \
+      --group cache-references:u,cache-misses:u --group cache-references:k,cache-misses:k \
+      --duration 1
+Performance counter statistics:
+
+#     count  event_name           # count / runtime
+  2,088,463  cache-references     # 181.360 M/sec
+     47,871  cache-misses         # 2.292164% miss rate
+  1,277,600  cache-references:u   # 136.419 M/sec
+     25,977  cache-misses:u       # 2.033265% miss rate
+    326,305  cache-references:k   # 74.724 M/sec
+     13,596  cache-misses:k       # 4.166654% miss rate
+
+Total test time: 1.029729 seconds.
+simpleperf W cmd_stat.cpp:946] It seems the number of hardware events are more than the number of
+...
+```
+
+### Select target to stat
+
+We can select which processes or threads to monitor via -p or -t. Monitoring a
+process is the same as monitoring all threads in the process. Simpleperf can also fork a child
+process to run the new command and then monitor the child process.
+
+```sh
+# Stat process 11904 and 11905.
+$ simpleperf stat -p 11904,11905 --duration 10
+
+# Stat processes with name containing "chrome".
+$ simpleperf stat -p chrome --duration 10
+# Stat processes with name containing part matching regex "chrome:(privileged|sandboxed)".
+$ simpleperf stat -p "chrome:(privileged|sandboxed)" --duration 10
+
+# Stat thread 11904 and 11905.
+$ simpleperf stat -t 11904,11905 --duration 10
+
+# Start a child process running `ls`, and stat it.
+$ simpleperf stat ls
+
+# Stat the process of an Android application. On non-root devices, this only works for debuggable
+# or profileable from shell apps.
+$ simpleperf stat --app simpleperf.example.cpp --duration 10
+
+# Stat only selected thread 11904 in an app.
+$ simpleperf stat --app simpleperf.example.cpp -t 11904 --duration 10
+
+# Stat system wide using -a.
+$ simpleperf stat -a --duration 10
+```
+
+### Decide how long to stat
+
+When monitoring existing threads, we can use --duration to decide how long to monitor. When
+monitoring a child process running a new command, simpleperf monitors until the child process ends.
+In this case, we can use Ctrl-C to stop monitoring at any time.
+
+```sh
+# Stat process 11904 for 10 seconds.
+$ simpleperf stat -p 11904 --duration 10
+
+# Stat until the child process running `ls` finishes.
+$ simpleperf stat ls
+
+# Stop monitoring using Ctrl-C.
+$ simpleperf stat -p 11904 --duration 10
+^C
+```
+
+If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
+SIGHUP signals to simpleperf to stop monitoring.
+
+### Decide the print interval
+
+When monitoring perf counters, we can also use --interval to decide the print interval.
+
+```sh
+# Print stat for process 11904 every 300ms.
+$ simpleperf stat -p 11904 --duration 10 --interval 300
+
+# Print system wide stat at interval of 300ms for 10 seconds. Note that system wide profiling needs
+# root privilege.
+$ su 0 simpleperf stat -a --duration 10 --interval 300
+```
+
+### Display counters in systrace
+
+Simpleperf can also work with systrace to dump counters in the collected trace. Below is an example
+to do a system wide stat.
+
+```sh
+# Capture instructions (kernel only) and cache misses with interval of 300 milliseconds for 15
+# seconds.
+$ su 0 simpleperf stat -e instructions:k,cache-misses -a --interval 300 --duration 15
+# On host launch systrace to collect trace for 10 seconds.
+(HOST)$ external/chromium-trace/systrace.py --time=10 -o new.html sched gfx view
+# Open the collected new.html in browser and perf counters will be shown up.
+```
+
+### Show event count per thread
+
+By default, stat cmd outputs an event count sum for all monitored targets. But when `--per-thread`
+option is used, stat cmd outputs an event count for each thread in monitored targets. It can be
+used to find busy threads in a process or system wide. With `--per-thread` option, stat cmd opens
+a perf_event_file for each exisiting thread. If a monitored thread creates new threads, event
+count for new threads will be added to the monitored thread by default, otherwise omitted if
+`--no-inherit` option is also used.
+
+```sh
+# Print event counts for each thread in process 11904. Event counts for threads created after
+# stat cmd will be added to threads creating them.
+$ simpleperf stat --per-thread -p 11904 --duration 1
+
+# Print event counts for all threads running in the system every 1s. Threads not running will not
+# be reported.
+$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values
+
+# Print event counts for all threads running in the system every 1s. Event counts for threads
+# created after stat cmd will be omitted.
+$ su 0 simpleperf stat --per-thread -a --interval 1000 --interval-only-values --no-inherit
+```
+
+### Show event count per core
+
+By default, stat cmd outputs an event count sum for all monitored cpu cores. But when `--per-core`
+option is used, stat cmd outputs an event count for each core. It can be used to see how events
+are distributed on different cores.
+When stating non-system wide with `--per-core` option, simpleperf creates a perf event for each
+monitored thread on each core. When a thread is in running state, perf events on all cores are
+enabled, but only the perf event on the core running the thread is in running state. So the
+percentage comment shows runtime_on_a_core / runtime_on_all_cores. Note that, percentage is still
+affected by hardware counter multiplexing. Check simpleperf log output for ways to distinguish it.
+
+```sh
+# Print event counts for each cpu running threads in process 11904.
+# A percentage shows runtime_on_a_cpu / runtime_on_all_cpus.
+$ simpleperf stat -e cpu-cycles --per-core -p 1057 --duration 3
+Performance counter statistics:
+
+# cpu        count  event_name   # count / runtime
+  0      1,667,660  cpu-cycles   # 1.571565 GHz
+  1      3,850,440  cpu-cycles   # 1.736958 GHz
+  2      2,463,792  cpu-cycles   # 1.701367 GHz
+  3      2,350,528  cpu-cycles   # 1.700841 GHz
+  5      7,919,520  cpu-cycles   # 2.377081 GHz
+  6    105,622,673  cpu-cycles   # 2.381331 GHz
+
+Total test time: 3.002703 seconds.
+
+# Print event counts for each cpu system wide.
+$ su 0 simpleperf stat --per-core -a --duration 1
+
+# Print cpu-cycle event counts for each cpu for each thread running in the system.
+$ su 0 simpleperf stat -e cpu-cycles -a --per-thread --per-core --duration 1
+```
+
+### Monitor different events on different cores
+
+Android devices usually have big and little cores. Different cores may support different events.
+Therefore, we may want to monitor different events on different cores. We can do this using
+the `--cpu` option. The `--cpu` option selects the cores on which to monitor events. A `--cpu`
+option affects all the following events until meeting another `--cpu` option. The first `--cpu`
+option also affects all events before it. Following are some examples:
+
+```sh
+# By default, cpu-cycles and instructions are monitored on all cpus.
+$ su 0 simpleperf stat -e cpu-cycles,instructions -a --duration 1 --per-core
+
+# Use one `--cpu` option to monitor cpu-cycles and instructions only on cpu 0-3,8.
+$ su 0 simpleperf stat -e cpu-cycles --cpu 0-3,8 -e instructions -a --duration 1 --per-core
+
+# Use two `--cpu` options to monitor raw-l3d-cache-refill-rd on cpu 0-3, and raw-l3d-cache-refill on
+# cpu 4-8.
+$ su 0 simpleperf stat --cpu 0-3 -e raw-l3d-cache-refill-rd --cpu 4-8 -e raw-l3d-cache-refill \
+  -a --duration 1 --per-core
+```
+
+## The record command
+
+The record command is used to dump samples of the profiled processes. Each sample can contain
+information like the time at which the sample was generated, the number of events since last
+sample, the program counter of a thread, the call chain of a thread.
+
+By passing options, we can select which events to use, which processes/threads to monitor,
+what frequency to dump samples, how long to monitor, and where to store samples.
+
+```sh
+# Record on process 7394 for 10 seconds, using default event (cpu-cycles), using default sample
+# frequency (4000 samples per second), writing records to perf.data.
+$ simpleperf record -p 7394 --duration 10
+simpleperf I cmd_record.cpp:316] Samples recorded: 21430. Samples lost: 0.
+```
+
+### Select events to record
+
+By default, the cpu-cycles event is used to evaluate consumed cpu cycles. But we can also use other
+events via -e.
+
+```sh
+# Record using event instructions.
+$ simpleperf record -e instructions -p 11904 --duration 10
+
+# Record using task-clock, which shows the passed CPU time in nanoseconds.
+$ simpleperf record -e task-clock -p 11904 --duration 10
+```
+
+### Select target to record
+
+The way to select target in record command is similar to that in the stat command.
+
+```sh
+# Record process 11904 and 11905.
+$ simpleperf record -p 11904,11905 --duration 10
+
+# Record processes with name containing "chrome".
+$ simpleperf record -p chrome --duration 10
+# Record processes with name containing part matching regex "chrome:(privileged|sandboxed)".
+$ simpleperf record -p "chrome:(privileged|sandboxed)" --duration 10
+
+# Record thread 11904 and 11905.
+$ simpleperf record -t 11904,11905 --duration 10
+
+# Record a child process running `ls`.
+$ simpleperf record ls
+
+# Record the process of an Android application. On non-root devices, this only works for debuggable
+# or profileable from shell apps.
+$ simpleperf record --app simpleperf.example.cpp --duration 10
+
+# Record only selected thread 11904 in an app.
+$ simpleperf record --app simpleperf.example.cpp -t 11904 --duration 10
+
+# Record system wide.
+$ simpleperf record -a --duration 10
+```
+
+### Set the frequency to record
+
+We can set the frequency to dump records via -f or -c. For example, -f 4000 means
+dumping approximately 4000 records every second when the monitored thread runs. If a monitored
+thread runs 0.2s in one second (it can be preempted or blocked in other times), simpleperf dumps
+about 4000 * 0.2 / 1.0 = 800 records every second. Another way is using -c. For example, -c 10000
+means dumping one record whenever 10000 events happen.
+
+```sh
+# Record with sample frequency 1000: sample 1000 times every second running.
+$ simpleperf record -f 1000 -p 11904,11905 --duration 10
+
+# Record with sample period 100000: sample 1 time every 100000 events.
+$ simpleperf record -c 100000 -t 11904,11905 --duration 10
+```
+
+To avoid taking too much time generating samples, kernel >= 3.10 sets the max percent of cpu time
+used for generating samples (default is 25%), and decreases the max allowed sample frequency when
+hitting that limit. Simpleperf uses --cpu-percent option to adjust it, but it needs either root
+privilege or to be on Android >= Q.
+
+```sh
+# Record with sample frequency 10000, with max allowed cpu percent to be 50%.
+$ simpleperf record -f 1000 -p 11904,11905 --duration 10 --cpu-percent 50
+```
+
+### Decide how long to record
+
+The way to decide how long to monitor in record command is similar to that in the stat command.
+
+```sh
+# Record process 11904 for 10 seconds.
+$ simpleperf record -p 11904 --duration 10
+
+# Record until the child process running `ls` finishes.
+$ simpleperf record ls
+
+# Stop monitoring using Ctrl-C.
+$ simpleperf record -p 11904 --duration 10
+^C
+```
+
+If you want to write a script to control how long to monitor, you can send one of SIGINT, SIGTERM,
+SIGHUP signals to simpleperf to stop monitoring.
+
+### Set the path to store profiling data
+
+By default, simpleperf stores profiling data in perf.data in the current directory. But the path
+can be changed using -o.
+
+```sh
+# Write records to data/perf2.data.
+$ simpleperf record -p 11904 -o data/perf2.data --duration 10
+```
+
+#### Record call graphs
+
+A call graph is a tree showing function call relations. Below is an example.
+
+```
+main() {
+    FunctionOne();
+    FunctionTwo();
+}
+FunctionOne() {
+    FunctionTwo();
+    FunctionThree();
+}
+a call graph:
+    main-> FunctionOne
+       |    |
+       |    |-> FunctionTwo
+       |    |-> FunctionThree
+       |
+       |-> FunctionTwo
+```
+
+A call graph shows how a function calls other functions, and a reversed call graph shows how
+a function is called by other functions. To show a call graph, we need to first record it, then
+report it.
+
+There are two ways to record a call graph, one is recording a dwarf based call graph, the other is
+recording a stack frame based call graph. Recording dwarf based call graphs needs support of debug
+information in native binaries. While recording stack frame based call graphs needs support of
+stack frame registers.
+
+```sh
+# Record a dwarf based call graph
+$ simpleperf record -p 11904 -g --duration 10
+
+# Record a stack frame based call graph
+$ simpleperf record -p 11904 --call-graph fp --duration 10
+```
+
+[Here](README.md#suggestions-about-recording-call-graphs) are some suggestions about recording call graphs.
+
+### Record both on CPU time and off CPU time
+
+Simpleperf is a CPU profiler, which generates samples for a thread only when it is running on a
+CPU. But sometimes we want to know where the thread time is spent off-cpu (like preempted by other
+threads, blocked in IO or waiting for some events). To support this, simpleperf added a
+--trace-offcpu option to the record command. When --trace-offcpu is used, simpleperf does the
+following things:
+
+1) Only cpu-clock/task-clock event is allowed to be used with --trace-offcpu. This let simpleperf
+   generate on-cpu samples for cpu-clock event.
+2) Simpleperf also monitors sched:sched_switch event, which will generate a sched_switch sample
+   each time the monitored thread is scheduled off cpu.
+3) Simpleperf also records context switch records. So it knows when the thread is scheduled back on
+   a cpu.
+
+The samples and context switch records collected by simpleperf for a thread are shown below:
+
+![simpleperf_trace_offcpu_sample_mode](simpleperf_trace_offcpu_sample_mode.png)
+
+Here we have two types of samples:
+1) on-cpu samples generated for cpu-clock event. The period value in each sample means how many
+   nanoseconds are spent on cpu (for the callchain of this sample).
+2) off-cpu (sched_switch) samples generated for sched:sched_switch event. The period value is
+   calculated as **Timestamp of the next switch on record** minus **Timestamp of the current sample**
+   by simpleperf. So the period value in each sample means how many nanoseconds are spent off cpu
+   (for the callchain of this sample).
+
+**note**: In reality, switch on records and samples may lost. To mitigate the loss of accuracy, we
+calculate the period of an off-cpu sample as **Timestamp of the next switch on record or sample**
+minus **Timestamp of the current sample**.
+
+When reporting via python scripts, simpleperf_report_lib.py provides SetTraceOffCpuMode() method
+to control how to report the samples:
+1) on-cpu mode: only report on-cpu samples.
+2) off-cpu mode: only report off-cpu samples.
+3) on-off-cpu mode: report both on-cpu and off-cpu samples, which can be split by event name.
+4) mixed-on-off-cpu mode: report on-cpu and off-cpu samples under the same event name.
+
+If not set, mixed-on-off-cpu mode will be used to report.
+
+When using report_html.py, inferno and report_sample.py, the report mode can be set by
+--trace-offcpu option.
+
+Below are some examples recording and reporting trace offcpu profiles.
+
+```sh
+# Check if --trace-offcpu is supported by the kernel (should be available on kernel >= 4.2).
+$ simpleperf list --show-features
+trace-offcpu
+...
+
+# Record with --trace-offcpu.
+$ simpleperf record -g -p 11904 --duration 10 --trace-offcpu -e cpu-clock
+
+# Record system wide with --trace-offcpu.
+$ simpleperf record -a -g --duration 3 --trace-offcpu -e cpu-clock
+
+# Record with --trace-offcpu using app_profiler.py.
+$ ./app_profiler.py -p com.google.samples.apps.sunflower \
+    -r "-g -e cpu-clock:u --duration 10 --trace-offcpu"
+
+# Report on-cpu samples.
+$ ./report_html.py --trace-offcpu on-cpu
+# Report off-cpu samples.
+$ ./report_html.py --trace-offcpu off-cpu
+# Report on-cpu and off-cpu samples under different event names.
+$ ./report_html.py --trace-offcpu on-off-cpu
+# Report on-cpu and off-cpu samples under the same event name.
+$ ./report_html.py --trace-offcpu mixed-on-off-cpu
+```
+
+## The report command
+
+The report command is used to report profiling data generated by the record command. The report
+contains a table of sample entries. Each sample entry is a row in the report. The report command
+groups samples belong to the same process, thread, library, function in the same sample entry. Then
+sort the sample entries based on the event count a sample entry has.
+
+By passing options, we can decide how to filter out uninteresting samples, how to group samples
+into sample entries, and where to find profiling data and binaries.
+
+Below is an example. Records are grouped into 4 sample entries, each entry is a row. There are
+several columns, each column shows piece of information belonging to a sample entry. The first
+column is Overhead, which shows the percentage of events inside the current sample entry in total
+events. As the perf event is cpu-cycles, the overhead is the percentage of CPU cycles used in each
+function.
+
+```sh
+# Reports perf.data, using only records sampled in libsudo-game-jni.so, grouping records using
+# thread name(comm), process id(pid), thread id(tid), function name(symbol), and showing sample
+# count for each row.
+$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so \
+      --sort comm,pid,tid,symbol -n
+Cmdline: /data/data/com.example.sudogame/simpleperf record -p 7394 --duration 10
+Arch: arm64
+Event: cpu-cycles (type 0, config 0)
+Samples: 28235
+Event count: 546356211
+
+Overhead  Sample  Command    Pid   Tid   Symbol
+59.25%    16680   sudogame  7394  7394  checkValid(Board const&, int, int)
+20.42%    5620    sudogame  7394  7394  canFindSolution_r(Board&, int, int)
+13.82%    4088    sudogame  7394  7394  randomBlock_r(Board&, int, int, int, int, int)
+6.24%     1756    sudogame  7394  7394  @plt
+```
+
+### Set the path to read profiling data
+
+By default, the report command reads profiling data from perf.data in the current directory.
+But the path can be changed using -i.
+
+```sh
+$ simpleperf report -i data/perf2.data
+```
+
+### Set the path to find binaries
+
+To report function symbols, simpleperf needs to read executable binaries used by the monitored
+processes to get symbol table and debug information. By default, the paths are the executable
+binaries used by monitored processes while recording. However, these binaries may not exist when
+reporting or not contain symbol table and debug information. So we can use --symfs to redirect
+the paths.
+
+```sh
+# In this case, when simpleperf wants to read executable binary /A/b, it reads file in /A/b.
+$ simpleperf report
+
+# In this case, when simpleperf wants to read executable binary /A/b, it prefers file in
+# /debug_dir/A/b to file in /A/b.
+$ simpleperf report --symfs /debug_dir
+
+# Read symbols for system libraries built locally. Note that this is not needed since Android O,
+# which ships symbols for system libraries on device.
+$ simpleperf report --symfs $ANDROID_PRODUCT_OUT/symbols
+```
+
+### Filter samples
+
+When reporting, it happens that not all records are of interest. The report command supports four
+filters to select samples of interest.
+
+```sh
+# Report records in threads having name sudogame.
+$ simpleperf report --comms sudogame
+
+# Report records in process 7394 or 7395
+$ simpleperf report --pids 7394,7395
+
+# Report records in thread 7394 or 7395.
+$ simpleperf report --tids 7394,7395
+
+# Report records in libsudo-game-jni.so.
+$ simpleperf report --dsos /data/app/com.example.sudogame-2/lib/arm64/libsudo-game-jni.so
+```
+
+### Group samples into sample entries
+
+The report command uses --sort to decide how to group sample entries.
+
+```sh
+# Group records based on their process id: records having the same process id are in the same
+# sample entry.
+$ simpleperf report --sort pid
+
+# Group records based on their thread id and thread comm: records having the same thread id and
+# thread name are in the same sample entry.
+$ simpleperf report --sort tid,comm
+
+# Group records based on their binary and function: records in the same binary and function are in
+# the same sample entry.
+$ simpleperf report --sort dso,symbol
+
+# Default option: --sort comm,pid,tid,dso,symbol. Group records in the same thread, and belong to
+# the same function in the same binary.
+$ simpleperf report
+```
+
+#### Report call graphs
+
+To report a call graph, please make sure the profiling data is recorded with call graphs,
+as [here](#record-call-graphs).
+
+```
+$ simpleperf report -g
+```
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno.md
@ -0,0 +1,109 @@
+# Inferno
+
+![logo](./inferno_small.png)
+
+[TOC]
+
+## Description
+
+Inferno is a flamegraph generator for native (C/C++) Android apps. It was
+originally written to profile and improve surfaceflinger performance
+(Android compositor) but it can be used for any native Android application
+. You can see a sample report generated with Inferno
+[here](./report.html). Report are self-contained in HTML so they can be
+exchanged easily.
+
+Notice there is no concept of time in a flame graph since all callstack are
+merged together. As a result, the width of a flamegraph represents 100% of
+the number of samples and the height is related to the number of functions on
+the stack when sampling occurred.
+
+
+![flamegraph sample](./main_thread_flamegraph.png)
+
+In the flamegraph featured above you can see the main thread of SurfaceFlinger.
+It is immediatly apparent that most of the CPU time is spent processing messages
+`android::SurfaceFlinger::onMessageReceived`. The most expensive task is to ask
+ the screen to be refreshed as `android::DisplayDevice::prepare` shows in orange
+. This graphic division helps to see what part of the program is costly and
+where a developer's effort to improve performances should go.
+
+## Example of bottleneck
+
+A flamegraph give you instant vision on the CPU cycles cost centers but
+it can also be used to find specific offenders. To find them, look for
+plateaus. It is easier to see an example:
+
+![flamegraph sample](./bottleneck.png)
+
+In the previous flamegraph, two
+plateaus (due to `android::BufferQueueCore::validateConsistencyLocked`)
+are immediately apparent.
+
+## How it works
+
+Inferno relies on simpleperf to record the callstack of a native application
+thousands of times per second. Simpleperf takes care of unwinding the stack
+either using frame pointer (recommended) or dwarf. At the end of the recording
+`simpleperf` also symbolize all IPs automatically. The record are aggregated and
+dumps dumped to a file `perf.data`. This file is pulled from the Android device
+and processed on the host by Inferno. The callstacks are merged together to
+visualize in which part of an app the CPU cycles are spent.
+
+## How to use it
+
+Open a terminal and from `simpleperf/scripts` directory type:
+```
+./inferno.sh  (on Linux/Mac)
+inferno.bat (on Windows)
+```
+
+Inferno will collect data, process them and automatically open your web browser
+to display the HTML report.
+
+## Parameters
+
+You can select how long to sample for, the color of the node and many other
+things. Use `-h` to get a list of all supported parameters.
+
+```
+./inferno.sh -h
+```
+
+## Troubleshooting
+
+### Messy flame graph
+
+A healthy flame graph features a single call site at its base (see [here](./report.html)).
+If you don't see a unique call site like `_start` or `_start_thread` at the base
+from which all flames originate, something went wrong. : Stack unwinding may
+fail to reach the root callsite. These incomplete
+callstack are impossible to merge properly. By default Inferno asks
+ `simpleperf` to unwind the stack via the kernel and frame pointers. Try to
+ perform unwinding with dwarf `-du`, you can further tune this setting.
+
+
+### No flames
+
+If you see no flames at all or a mess of 1 level flame without a common base,
+this may be because you compiled without frame pointers. Make sure there is no
+` -fomit-frame-pointer` in your build config. Alternatively, ask simpleperf to
+collect data with dward unwinding `-du`.
+
+
+
+### High percentage of lost samples
+
+If simpleperf reports a lot of lost sample it is probably because you are
+unwinding with `dwarf`. Dwarf unwinding involves copying the stack before it is
+processed. Try to use frame pointer unwinding which can be done by the kernel
+and it much faster.
+
+The cost of frame pointer is negligible on arm64 parameter but considerable
+ on arm 32-bit arch (due to register pressure). Use a 64-bit build for better
+ profiling.
+
+### run-as: package not debuggable
+
+If you cannot run as root, make sure the app is debuggable otherwise simpleperf
+will not be able to profile it.
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno_small.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/inferno_small.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/introduction.pdf
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/introduction.pdf
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/jit_symbols.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/jit_symbols.md
@ -0,0 +1,56 @@
+# JIT symbols
+
+[TOC]
+
+## Java JIT symbols
+
+On Android >= P, simpleperf supports profiling Java code, no matter whether it is executed by
+the interpreter, or JITed, or compiled into native instructions. So you don't need to do anything.
+
+For details on Android O and N, see
+[android_application_profiling.md](./android_application_profiling.md#prepare-an-android-application).
+
+## Generic JIT symbols
+
+Simpleperf supports picking up symbols from per-pid symbol map files, somewhat similar to what
+Linux kernel perftool does. Application should create those files at specific locations.
+
+### Symbol map file location for application
+
+Application should create symbol map files in its data directory.
+
+For example, process `123` of application `foo.bar.baz` should create
+`/data/data/foo.bar.baz/perf-123.map`.
+
+### Symbol map file location for standalone program
+
+Standalone programs should create symbol map files in `/data/local/tmp`.
+
+For example, standalone program process `123` should create `/data/local/tmp/perf-123.map`.
+
+### Symbol map file format
+
+Symbol map file is a text file.
+
+Every line describes a new symbol. Line format is:
+```
+<symbol-absolute-address> <symbol-size> <symbol-name>
+```
+
+For example:
+```
+0x10000000 0x16 jit_symbol_one
+0x20000000 0x332 jit_symbol_two
+0x20002004 0x8 jit_symbol_three
+```
+
+All characters after the symbol size and until the end of the line are parsed as the symbol name,
+with leading and trailing spaces removed. This means spaces are allowed in symbol names themselves.
+
+### Known issues
+
+Current implementation gets confused if memory pages where JIT symbols reside are reused by mapping
+a file either before or after.
+
+For example, if memory pages were first used by `dlopen("libfoo.so")`, then freed by `dlclose`,
+then allocated for JIT symbols - simpleperf will report symbols from `libfoo.so` instead.
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/main_thread_flamegraph.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/main_thread_flamegraph.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_flame_chart.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_flame_chart.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_open_perf_trace.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_open_perf_trace.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_select_process.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_select_process.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_select_recording_method.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/android_studio_profiler_select_recording_method.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/boot_time_profile.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/boot_time_profile.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/continuous_pprof.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/continuous_pprof.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/firefox_profiler.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/firefox_profiler.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope_click.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope_click.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope_flamegraph.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/flamescope_flamegraph.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/report_command.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/report_command.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/report_html.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/pictures/report_html.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report.html
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report.html
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report_bottleneck.html
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report_bottleneck.html
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report_html.html
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/report_html.html
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/sample_filter.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/sample_filter.md
@ -0,0 +1,89 @@
+# Sample Filter
+
+Sometimes we want to report samples only for selected processes, threads, libraries, or time
+ranges. To filter samples, we can pass filter options to the report commands or scripts.
+
+
+## filter file format
+
+To filter samples based on time ranges, simpleperf accepts a filter file when reporting. The filter
+file is in text format, containing a list of lines. Each line is a filter command. The filter file
+can be generated by `sample_filter.py`, and passed to report scripts via `--filter-file`.
+
+```
+filter_command1 command_args
+filter_command2 command_args
+...
+```
+
+### clock command
+
+```
+CLOCK <clock_name>
+```
+
+Set the clock used to generate timestamps in the filter file. Supported clocks are: `monotonic`,
+`realtime`. By default it is monotonic. The clock here should be the same as the clock used in
+profile data, which is set by `--clockid` in simpleperf record command.
+
+### global time filter commands
+
+```
+GLOBAL_BEGIN <begin_timestamp>
+GLOBAL_END <end_timestamp>
+```
+
+The nearest pair of GLOBAL_BEGIN and GLOBAL_END commands makes a time range. When these commands
+are used, only samples in the time ranges are reported. Timestamps are 64-bit integers in
+nanoseconds.
+
+```
+GLOBAL_BEGIN 1000
+GLOBAL_END 2000
+GLOBAL_BEGIN 3000
+GLOBAL_BEGIN 4000
+```
+
+For the example above, samples in time ranges [1000, 2000) and [3000, 4000) are reported.
+
+### process time filter commands
+
+```
+PROCESS_BEGIN <pid> <begin_timestamp>
+PROCESS_END <pid> <end_timestamp>
+```
+
+The nearest pair of PROCESS_BEGIN and PROCESS_END commands for the same process makes a time
+range. When these commands are used, each process has a list of time ranges, and only samples
+in the time ranges are reported.
+
+```
+PROCESS_BEGIN 1 1000
+PROCESS_BEGIN 2 2000
+PROCESS_END 1 3000
+PROCESS_END 2 4000
+```
+
+For the example above, process 1 samples in time range [1000, 3000) and process 2 samples in time
+range [2000, 4000) are reported.
+
+### thread time filter commands
+
+```
+THREAD_BEGIN <tid> <begin_timestamp>
+THREAD_END <tid> <end_timestamp>
+```
+
+The nearest pair of THREAD_BEGIN and THREAD_END commands for the same thread makes a time
+range. When these commands are used, each thread has a list of time ranges, and only samples in the
+time ranges are reported.
+
+```
+THREAD_BEGIN 1 1000
+THREAD_BEGIN 2 2000
+THREAD_END 1 3000
+THREAD_END 2 4000
+```
+
+For the example above, thread 1 samples in time range [1000, 3000) and thread 2 samples in time
+range [2000, 4000) are reported.
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/scripts_reference.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/scripts_reference.md
@ -0,0 +1,357 @@
+# Scripts reference
+
+[TOC]
+
+## Record a profile
+
+### app_profiler.py
+
+`app_profiler.py` is used to record profiling data for Android applications and native executables.
+
+```sh
+# Record an Android application.
+$ ./app_profiler.py -p simpleperf.example.cpp
+
+# Record an Android application with Java code compiled into native instructions.
+$ ./app_profiler.py -p simpleperf.example.cpp --compile_java_code
+
+# Record the launch of an Activity of an Android application.
+$ ./app_profiler.py -p simpleperf.example.cpp -a .SleepActivity
+
+# Record a native process.
+$ ./app_profiler.py -np surfaceflinger
+
+# Record a native process given its pid.
+$ ./app_profiler.py --pid 11324
+
+# Record a command.
+$ ./app_profiler.py -cmd \
+    "dex2oat --dex-file=/data/local/tmp/app-debug.apk --oat-file=/data/local/tmp/a.oat"
+
+# Record an Android application, and use -r to send custom options to the record command.
+$ ./app_profiler.py -p simpleperf.example.cpp \
+    -r "-e cpu-clock -g --duration 30"
+
+# Record both on CPU time and off CPU time.
+$ ./app_profiler.py -p simpleperf.example.cpp \
+    -r "-e task-clock -g -f 1000 --duration 10 --trace-offcpu"
+
+# Save profiling data in a custom file (like perf_custom.data) instead of perf.data.
+$ ./app_profiler.py -p simpleperf.example.cpp -o perf_custom.data
+```
+
+### Profile from launch of an application
+
+Sometimes we want to profile the launch-time of an application. To support this, we added `--app` in
+the record command. The `--app` option sets the package name of the Android application to profile.
+If the app is not already running, the record command will poll for the app process in a loop with
+an interval of 1ms. So to profile from launch of an application, we can first start the record
+command with `--app`, then start the app. Below is an example.
+
+```sh
+$ ./run_simpleperf_on_device.py record --app simpleperf.example.cpp \
+    -g --duration 1 -o /data/local/tmp/perf.data
+# Start the app manually or using the `am` command.
+```
+
+To make it convenient to use, `app_profiler.py` supports using the `-a` option to start an Activity
+after recording has started.
+
+```sh
+$ ./app_profiler.py -p simpleperf.example.cpp -a .MainActivity
+```
+
+### api_profiler.py
+
+`api_profiler.py` is used to control recording in application code. It does preparation work
+before recording, and collects profiling data files after recording.
+
+[Here](./android_application_profiling.md#control-recording-in-application-code) are the details.
+
+### run_simpleperf_without_usb_connection.py
+
+`run_simpleperf_without_usb_connection.py` records profiling data while the USB cable isn't
+connected. Maybe `api_profiler.py` is more suitable, which also don't need USB cable when recording.
+Below is an example.
+
+```sh
+$ ./run_simpleperf_without_usb_connection.py start -p simpleperf.example.cpp
+# After the command finishes successfully, unplug the USB cable, run the
+# SimpleperfExampleCpp app. After a few seconds, plug in the USB cable.
+$ ./run_simpleperf_without_usb_connection.py stop
+# It may take a while to stop recording. After that, the profiling data is collected in perf.data
+# on host.
+```
+
+### binary_cache_builder.py
+
+The `binary_cache` directory is a directory holding binaries needed by a profiling data file. The
+binaries are expected to be unstripped, having debug information and symbol tables. The
+`binary_cache` directory is used by report scripts to read symbols of binaries. It is also used by
+`report_html.py` to generate annotated source code and disassembly.
+
+By default, `app_profiler.py` builds the binary_cache directory after recording. But we can also
+build `binary_cache` for existing profiling data files using `binary_cache_builder.py`. It is useful
+when you record profiling data using `simpleperf record` directly, to do system wide profiling or
+record without the USB cable connected.
+
+`binary_cache_builder.py` can either pull binaries from an Android device, or find binaries in
+directories on the host (via `-lib`).
+
+```sh
+# Generate binary_cache for perf.data, by pulling binaries from the device.
+$ ./binary_cache_builder.py
+
+# Generate binary_cache, by pulling binaries from the device and finding binaries in
+# SimpleperfExampleCpp.
+$ ./binary_cache_builder.py -lib path_of_SimpleperfExampleCpp
+```
+
+### run_simpleperf_on_device.py
+
+This script pushes the `simpleperf` executable on the device, and run a simpleperf command on the
+device. It is more convenient than running adb commands manually.
+
+## Viewing the profile
+
+Scripts in this section are for viewing the profile or converting profile data into formats used by
+external UIs. For recommended UIs, see [view_the_profile.md](view_the_profile.md).
+
+### report.py
+
+report.py is a wrapper of the `report` command on the host. It accepts all options of the `report`
+command.
+
+```sh
+# Report call graph
+$ ./report.py -g
+
+# Report call graph in a GUI window implemented by Python Tk.
+$ ./report.py -g --gui
+```
+
+### report_html.py
+
+`report_html.py` generates `report.html` based on the profiling data. Then the `report.html` can show
+the profiling result without depending on other files. So it can be shown in local browsers or
+passed to other machines. Depending on which command-line options are used, the content of the
+`report.html` can include: chart statistics, sample table, flamegraphs, annotated source code for
+each function, annotated disassembly for each function.
+
+```sh
+# Generate chart statistics, sample table and flamegraphs, based on perf.data.
+$ ./report_html.py
+
+# Add source code.
+$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp
+
+# Add disassembly.
+$ ./report_html.py --add_disassembly
+
+# Adding disassembly for all binaries can cost a lot of time. So we can choose to only add
+# disassembly for selected binaries.
+$ ./report_html.py --add_disassembly --binary_filter libgame.so
+
+# report_html.py accepts more than one recording data file.
+$ ./report_html.py -i perf1.data perf2.data
+```
+
+Below is an example of generating html profiling results for SimpleperfExampleCpp.
+
+```sh
+$ ./app_profiler.py -p simpleperf.example.cpp
+$ ./report_html.py --add_source_code --source_dirs path_of_SimpleperfExampleCpp \
+    --add_disassembly
+```
+
+After opening the generated [`report.html`](./report_html.html) in a browser, there are several tabs:
+
+The first tab is "Chart Statistics". You can click the pie chart to show the time consumed by each
+process, thread, library and function.
+
+The second tab is "Sample Table". It shows the time taken by each function. By clicking one row in
+the table, we can jump to a new tab called "Function".
+
+The third tab is "Flamegraph". It shows the graphs generated by [`inferno`](./inferno.md).
+
+The fourth tab is "Function". It only appears when users click a row in the "Sample Table" tab.
+It shows information of a function, including:
+
+1. A flamegraph showing functions called by that function.
+2. A flamegraph showing functions calling that function.
+3. Annotated source code of that function. It only appears when there are source code files for
+   that function.
+4. Annotated disassembly of that function. It only appears when there are binaries containing that
+   function.
+
+### inferno
+
+[`inferno`](./inferno.md) is a tool used to generate flamegraph in a html file.
+
+```sh
+# Generate flamegraph based on perf.data.
+# On Windows, use inferno.bat instead of ./inferno.sh.
+$ ./inferno.sh -sc --record_file perf.data
+
+# Record a native program and generate flamegraph.
+$ ./inferno.sh -np surfaceflinger
+```
+
+### purgatorio
+
+[`purgatorio`](../scripts/purgatorio/README.md) is a visualization tool to show samples in time order.
+
+### pprof_proto_generator.py
+
+It converts a profiling data file into `pprof.proto`, a format used by [pprof](https://github.com/google/pprof).
+
+```sh
+# Convert perf.data in the current directory to pprof.proto format.
+$ ./pprof_proto_generator.py
+# Show report in pdf format.
+$ pprof -pdf pprof.profile
+
+# Show report in html format. To show disassembly, add --tools option like:
+#  --tools=objdump:<ndk_path>/toolchains/llvm/prebuilt/linux-x86_64/aarch64-linux-android/bin
+# To show annotated source or disassembly, select `top` in the view menu, click a function and
+# select `source` or `disassemble` in the view menu.
+$ pprof -http=:8080 pprof.profile
+```
+
+### gecko_profile_generator.py
+
+Converts `perf.data` to [Gecko Profile
+Format](https://github.com/firefox-devtools/profiler/blob/main/docs-developer/gecko-profile-format.md),
+the format read by https://profiler.firefox.com/.
+
+Firefox Profiler is a powerful general-purpose profiler UI which runs locally in
+any browser (not just Firefox), with:
+
+- Per-thread tracks
+- Flamegraphs
+- Search, focus for specific stacks
+- A time series view for seeing your samples in timestamp order
+- Filtering by thread and duration
+
+Usage:
+
+```
+# Record a profile of your application
+$ ./app_profiler.py -p simpleperf.example.cpp
+
+# Convert and gzip.
+$ ./gecko_profile_generator.py -i perf.data | gzip > gecko-profile.json.gz
+```
+
+Then open `gecko-profile.json.gz` in https://profiler.firefox.com/.
+
+### report_sample.py
+
+`report_sample.py` converts a profiling data file into the `perf script` text format output by
+`linux-perf-tool`.
+
+This format can be imported into:
+
+- [FlameGraph](https://github.com/brendangregg/FlameGraph)
+- [Flamescope](https://github.com/Netflix/flamescope)
+- [Firefox
+  Profiler](https://github.com/firefox-devtools/profiler/blob/main/docs-user/guide-perf-profiling.md),
+  but prefer using `gecko_profile_generator.py`.
+- [Speedscope](https://github.com/jlfwong/speedscope/wiki/Importing-from-perf-(linux))
+
+```sh
+# Record a profile to perf.data
+$ ./app_profiler.py <args>
+
+# Convert perf.data in the current directory to a format used by FlameGraph.
+$ ./report_sample.py --symfs binary_cache >out.perf
+
+$ git clone https://github.com/brendangregg/FlameGraph.git
+$ FlameGraph/stackcollapse-perf.pl out.perf >out.folded
+$ FlameGraph/flamegraph.pl out.folded >a.svg
+```
+
+### stackcollapse.py
+
+`stackcollapse.py` converts a profiling data file (`perf.data`) to [Brendan
+Gregg's "Folded Stacks"
+format](https://queue.acm.org/detail.cfm?id=2927301#:~:text=The%20folded%20stack%2Dtrace%20format,trace%2C%20followed%20by%20a%20semicolon).
+
+Folded Stacks are lines of semicolon-delimited stack frames, root to leaf,
+followed by a count of events sampled in that stack, e.g.:
+
+```
+BusyThread;__start_thread;__pthread_start(void*);java.lang.Thread.run 17889729
+```
+
+All similar stacks are aggregated and sample timestamps are unused.
+
+Folded Stacks format is readable by:
+
+- The [FlameGraph](https://github.com/brendangregg/FlameGraph) toolkit
+- [Inferno](https://github.com/jonhoo/inferno) (Rust port of FlameGraph)
+- [Speedscope](https://speedscope.app/)
+
+Example:
+
+```sh
+# Record a profile to perf.data
+$ ./app_profiler.py <args>
+
+# Convert to Folded Stacks format
+$ ./stackcollapse.py --kernel --jit | gzip > profile.folded.gz
+
+# Visualise with FlameGraph with Java Stacks and nanosecond times
+$ git clone https://github.com/brendangregg/FlameGraph.git
+$ gunzip -c profile.folded.gz \
+    | FlameGraph/flamegraph.pl --color=java --countname=ns \
+    > profile.svg
+```
+
+## simpleperf_report_lib.py
+
+`simpleperf_report_lib.py` is a Python library used to parse profiling data files generated by the
+record command. Internally, it uses libsimpleperf_report.so to do the work. Generally, for each
+profiling data file, we create an instance of ReportLib, pass it the file path (via SetRecordFile).
+Then we can read all samples through GetNextSample(). For each sample, we can read its event info
+(via GetEventOfCurrentSample), symbol info (via GetSymbolOfCurrentSample) and call chain info
+(via GetCallChainOfCurrentSample). We can also get some global information, like record options
+(via GetRecordCmd), the arch of the device (via GetArch) and meta strings (via MetaInfo).
+
+Examples of using `simpleperf_report_lib.py` are in `report_sample.py`, `report_html.py`,
+`pprof_proto_generator.py` and `inferno/inferno.py`.
+
+## ipc.py
+`ipc.py`captures the instructions per cycle (IPC) of the system during a specified duration.
+
+Example:
+```sh
+./ipc.py
+./ipc.py 2 20          # Set interval to 2 secs and total duration to 20 secs
+./ipc.py -p 284 -C 4   # Only profile the PID 284 while running on core 4
+./ipc.py -c 'sleep 5'  # Only profile the command to run
+```
+
+The results look like:
+```
+K_CYCLES   K_INSTR      IPC
+36840      14138       0.38
+70701      27743       0.39
+104562     41350       0.40
+138264     54916       0.40
+```
+
+## sample_filter.py
+
+`sample_filter.py` generates sample filter files as documented in [sample_filter.md](https://android.googlesource.com/platform/system/extras/+/refs/heads/main/simpleperf/doc/sample_filter.md).
+A filter file can be passed in `--filter-file` when running report scripts.
+
+For example, it can be used to split a large recording file into several report files.
+
+```sh
+$ sample_filter.py -i perf.data --split-time-range 2 -o sample_filter
+$ gecko_profile_generator.py -i perf.data --filter-file sample_filter_part1 \
+    | gzip >profile-part1.json.gz
+$ gecko_profile_generator.py -i perf.data --filter-file sample_filter_part2 \
+    | gzip >profile-part2.json.gz
+```
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/simpleperf_trace_offcpu_sample_mode.png
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/simpleperf_trace_offcpu_sample_mode.png
--- a/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/view_the_profile.md
+++ b/Tools/Platform/Android/android-ndk-r27d/simpleperf/doc/view_the_profile.md
@ -0,0 +1,352 @@
+# View the profile
+
+[TOC]
+
+## Introduction
+
+After using `simpleperf record` or `app_profiler.py`, we get a profile data file. The file contains
+a list of samples. Each sample has a timestamp, a thread id, a callstack, events (like cpu-cycles
+or cpu-clock) used in this sample, etc. We have many choices for viewing the profile. We can show
+samples in chronological order, or show aggregated flamegraphs. We can show reports in text format,
+or in some interactive UIs.
+
+Below shows some recommended UIs to view the profile. Google developers can find more examples in
+[go/gmm-profiling](go/gmm-profiling?polyglot=linux-workstation#viewing-the-profile).
+
+
+## Continuous PProf UI (great flamegraph UI, but only available internally)
+
+[PProf](https://github.com/google/pprof) is a mature profiling technology used extensively on
+Google servers, with a powerful flamegraph UI, with strong drilldown, search, pivot, profile diff,
+and graph visualisation.
+
+![Example](./pictures/continuous_pprof.png)
+
+We can use `pprof_proto_generator.py` to convert profiles into pprof.profile protobufs for use in
+pprof.
+
+```
+# Output all threads, broken down by threadpool.
+./pprof_proto_generator.py
+
+# Use proguard mapping.
+./pprof_proto_generator.py --proguard-mapping-file proguard.map
+
+# Just the main (UI) thread (query by thread name):
+./pprof_proto_generator.py --comm com.example.android.displayingbitmaps
+```
+
+This will print some debug logs about Failed to read symbols: this is usually OK, unless those
+symbols are hotspots.
+
+The continuous pprof server has a file upload size limit of 50MB. To get around this limit, compress
+the profile before uploading:
+
+```
+gzip pprof.profile
+```
+
+After compressing, you can upload the `pprof.profile.gz` file to either http://pprof/ or
+http://pprofng/. Both websites have an 'Upload' tab for this purpose. Alternatively, you can use
+the following `pprof` command to upload the compressed profile:
+
+```
+# Upload all threads in profile, grouped by threadpool.
+# This is usually a good default, combining threads with similar names.
+pprof --flame --tagroot threadpool pprof.profile.gz
+
+# Upload all threads in profile, grouped by individual thread name.
+pprof --flame --tagroot thread pprof.profile.gz
+
+# Upload all threads in profile, without grouping by thread.
+pprof --flame pprof.profile.gz
+This will output a URL, example: https://pprof.corp.google.com/?id=589a60852306144c880e36429e10b166
+```
+
+## Firefox Profiler (great chronological UI)
+
+We can view Android profiles using Firefox Profiler: https://profiler.firefox.com/. This does not
+require Firefox installation -- Firefox Profiler is just a website, you can open it in any browser.
+There is also an internal Google-Hosted Firefox Profiler, at go/profiler or go/firefox-profiler.
+
+![Example](./pictures/firefox_profiler.png)
+
+Firefox Profiler has a great chronological view, as it doesn't pre-aggregate similar stack traces
+like pprof does.
+
+We can use `gecko_profile_generator.py` to convert raw perf.data files into a Firefox Profile, with
+Proguard deobfuscation.
+
+```
+# Create Gecko Profile
+./gecko_profile_generator.py | gzip > gecko_profile.json.gz
+
+# Create Gecko Profile using Proguard map
+./gecko_profile_generator.py --proguard-mapping-file proguard.map | gzip > gecko_profile.json.gz
+```
+
+Then drag-and-drop gecko_profile.json.gz into https://profiler.firefox.com/.
+
+Firefox Profiler supports:
+
+1. Aggregated Flamegraphs
+2. Chronological Stackcharts
+
+And allows filtering by:
+
+1. Individual threads
+2. Multiple threads (Ctrl+Click thread names to select many)
+3. Timeline period
+4. Stack frame text search
+
+## FlameScope (great jank-finding UI)
+
+[Netflix's FlameScope](https://github.com/Netflix/flamescope) is a rough, proof-of-concept UI that
+lets you spot repeating patterns of work by laying out the profile as a subsecond heatmap.
+
+Below, each vertical stripe is one second, and each cell is 10ms. Redder cells have more samples.
+See https://www.brendangregg.com/blog/2018-11-08/flamescope-pattern-recognition.html for how to
+spot patterns.
+
+This is an example of a 60s DisplayBitmaps app Startup Profile.
+
+![Example](./pictures/flamescope.png)
+
+You can see:
+
+  The thick red vertical line on the left is startup.
+  The long white vertical sections on the left shows the app is mostly idle, waiting for commands
+  from instrumented tests.
+  Then we see periodically red blocks, which shows the app is periodically busy handling commands
+  from instrumented tests.
+
+Click the start and end cells of a duration:
+
+![Example](./pictures/flamescope_click.png)
+
+To see a flamegraph for that duration:
+
+![Example](./pictures/flamescope_flamegraph.png)
+
+Install and run Flamescope:
+
+```
+git clone https://github.com/Netflix/flamescope ~/flamescope
+cd ~/flamescope
+pip install -r requirements.txt
+npm install
+npm run webpack
+python3 run.py
+```
+
+Then open FlameScope in-browser: http://localhost:5000/.
+
+FlameScope can read gzipped perf script format profiles. Convert simpleperf perf.data to this
+format with `report_sample.py`, and place it in Flamescope's examples directory:
+
+```
+# Create `Linux perf script` format profile.
+report_sample.py | gzip > ~/flamescope/examples/my_simpleperf_profile.gz
+
+# Create `Linux perf script` format profile using Proguard map.
+report_sample.py \
+  --proguard-mapping-file proguard.map \
+  | gzip > ~/flamescope/examples/my_simpleperf_profile.gz
+```
+
+Open the profile "as Linux Perf", and click start and end sections to get a flamegraph of that
+timespan.
+
+To investigate UI Thread Jank, filter to UI thread samples only:
+
+```
+report_sample.py \
+  --comm com.example.android.displayingbitmaps \ # UI Thread
+  | gzip > ~/flamescope/examples/uithread.gz
+```
+
+Once you've identified the timespan of interest, consider also zooming into that section with
+Firefox Profiler, which has a more powerful flamegraph viewer.
+
+## Differential FlameGraph
+
+See Brendan Gregg's [Differential Flame Graphs](https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html) blog.
+
+Use Simpleperf's `stackcollapse.py` to convert perf.data to Folded Stacks format for the FlameGraph
+toolkit.
+
+Consider diffing both directions: After minus Before, and Before minus After.
+
+If you've recorded before and after your optimisation as perf_before.data and perf_after.data, and
+you're only interested in the UI thread:
+
+```
+# Generate before and after folded stacks from perf.data files
+./stackcollapse.py --kernel --jit -i perf_before.data \
+  --proguard-mapping-file proguard_before.map \
+  --comm com.example.android.displayingbitmaps \
+  > perf_before.folded
+./stackcollapse.py --kernel --jit -i perf_after.data \
+  --proguard-mapping-file proguard_after.map \
+  --comm com.example.android.displayingbitmaps \
+  > perf_after.folded
+
+# Generate diff reports
+FlameGraph/difffolded.pl -n perf_before.folded perf_after.folded \
+  | FlameGraph/flamegraph.pl > diff1.svg
+FlameGraph/difffolded.pl -n --negate perf_after.folded perf_before.folded \
+  | FlameGraph/flamegraph.pl > diff2.svg
+```
+
+## Android Studio Profiler
+
+Android Studio Profiler supports recording and reporting profiles of app processes. It supports
+several recording methods, including one using simpleperf as backend. You can use Android Studio
+Profiler for both recording and reporting.
+
+In Android Studio:
+Open View -> Tool Windows -> Profiler
+Click + -> Your Device -> Profileable Processes -> Your App
+
+![Example](./pictures/android_studio_profiler_select_process.png)
+
+Click into "CPU" Chart
+
+Choose Callstack Sample Recording. Even if you're using Java, this provides better observability,
+into ART, malloc, and the kernel.
+
+![Example](./pictures/android_studio_profiler_select_recording_method.png)
+
+Click Record, run your test on the device, then Stop when you're done.
+
+Click on a thread track, and "Flame Chart" to see a chronological chart on the left, and an
+aggregated flamechart on the right:
+
+![Example](./pictures/android_studio_profiler_flame_chart.png)
+
+If you want more flexibility in recording options, or want to add proguard mapping file, you can
+record using simpleperf, and report using Android Studio Profiler.
+
+We can use `simpleperf report-sample` to convert perf.data to trace files for Android Studio
+Profiler.
+
+```
+# Convert perf.data to perf.trace for Android Studio Profiler.
+# If on Mac/Windows, use simpleperf host executable for those platforms instead.
+bin/linux/x86_64/simpleperf report-sample --show-callchain --protobuf -i perf.data -o perf.trace
+
+# Convert perf.data to perf.trace using proguard mapping file.
+bin/linux/x86_64/simpleperf report-sample --show-callchain --protobuf -i perf.data -o perf.trace \
+    --proguard-mapping-file proguard.map
+```
+
+In Android Studio: Open File -> Open -> Select perf.trace
+
+![Example](./pictures/android_studio_profiler_open_perf_trace.png)
+
+
+## Simpleperf HTML Report
+
+Simpleperf can generate its own HTML Profile, which is able to show Android-specific information
+and separate flamegraphs for all threads, with a much rougher flamegraph UI.
+
+![Example](./pictures/report_html.png)
+
+This UI is fairly rough; we recommend using the Continuous PProf UI or Firefox Profiler instead. But
+it's useful for a quick look at your data.
+
+Each of the following commands take as input ./perf.data and output ./report.html.
+
+```
+# Make an HTML report.
+./report_html.py
+
+# Make an HTML report with Proguard mapping.
+./report_html.py --proguard-mapping-file proguard.map
+```
+
+This will print some debug logs about Failed to read symbols: this is usually OK, unless those
+symbols are hotspots.
+
+See also [report_html.py's README](scripts_reference.md#report_htmlpy) and `report_html.py -h`.
+
+
+## PProf Interactive Command Line
+
+Unlike Continuous PProf UI, [PProf](https://github.com/google/pprof) command line is publicly
+available, and allows drilldown, pivoting and filtering.
+
+The below session demonstrates filtering to stack frames containing processBitmap.
+
+```
+$ pprof pprof.profile
+(pprof) show=processBitmap
+(pprof) top
+Active filters:
+   show=processBitmap
+Showing nodes accounting for 2.45s, 11.44% of 21.46s total
+      flat  flat%   sum%        cum   cum%
+     2.45s 11.44% 11.44%      2.45s 11.44%  com.example.android.displayingbitmaps.util.ImageFetcher.processBitmap
+```
+
+And then showing the tags of those frames, to tell what threads they are running on:
+
+```
+(pprof) tags
+ pid: Total 2.5s
+      2.5s (  100%): 31112
+
+ thread: Total 2.5s
+         1.4s (57.21%): AsyncTask #3
+         1.1s (42.79%): AsyncTask #4
+
+ threadpool: Total 2.5s
+             2.5s (  100%): AsyncTask #%d
+
+ tid: Total 2.5s
+      1.4s (57.21%): 31174
+      1.1s (42.79%): 31175
+```
+
+Contrast with another method:
+
+```
+(pprof) show=addBitmapToCache
+(pprof) top
+Active filters:
+   show=addBitmapToCache
+Showing nodes accounting for 1.05s, 4.88% of 21.46s total
+      flat  flat%   sum%        cum   cum%
+     1.05s  4.88%  4.88%      1.05s  4.88%  com.example.android.displayingbitmaps.util.ImageCache.addBitmapToCache
+```
+
+For more information, see the [pprof README](https://github.com/google/pprof/blob/main/doc/README.md#interactive-terminal-use).
+
+
+## Simpleperf Report Command Line
+
+The simpleperf report command reports profiles in text format.
+
+![Example](./pictures/report_command.png)
+
+You can call `simpleperf report` directly or call it via `report.py`.
+
+```
+# Report symbols in table format.
+$ ./report.py --children
+
+# Report call graph.
+$ bin/linux/x86_64/simpleperf report -g -i perf.data
+```
+
+See also [report command's README](executable_commands_reference.md#The-report-command) and
+`report.py -h`.
+
+
+## Custom Report Interface
+
+If the above View UIs can't fulfill your need, you can use `simpleperf_report_lib.py` to parse
+perf.data, extract sample information, and feed it to any views you like.
+
+See [simpleperf_report_lib.py's README](scripts_reference.md#simpleperf_report_libpy) for more
+details.