Conversation
fd3ea6c to
7c64cc2
Compare
7c64cc2 to
1c19495
Compare
savannahostrowski
left a comment
There was a problem hiding this comment.
Given the numbers we've seen, we should update 0.5-2%...if we're rounding up?
|
|
||
| Because the flags are in ``CFLAGS``, they propagate automatically to consumers | ||
| that build against CPython's reported compiler flags, such as C extensions | ||
| built via ``pip``, ``setuptools``, or direct ``sysconfig`` queries. Those |
There was a problem hiding this comment.
nit: I don't think pip directly builds any C extension, build backends do that
There was a problem hiding this comment.
| built via ``pip``, ``setuptools``, or direct ``sysconfig`` queries. Those | |
| built via pip, Setuptools, or direct ``sysconfig`` queries. Those |
|
|
||
| For all graphs below, the green dots are geometric means of the | ||
| individual benchmark's median, while orange lines are the median of our data points. | ||
| Hollow circles reperesent outliers. |
There was a problem hiding this comment.
| Hollow circles reperesent outliers. | |
| Hollow circles represent outliers. |
| https://github.com/Fidget-Spinner/python-framepointer-bench | ||
|
|
||
| .. [#missing_benchmarks] Some benchmarks are missing due to incompatibilities with | ||
| Python 3.15alpha. |
There was a problem hiding this comment.
| Python 3.15alpha. | |
| Python 3.15 alpha. |
| CPython's own documentation already states the recommended fix:: | ||
|
|
||
| For best results, Python should be compiled with | ||
| CFLAGS='-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer' | ||
| as this allows profilers to unwind using only the frame pointer | ||
| and not on DWARF debug information. | ||
|
|
||
| -- docs.python.org/3/howto/perf_profiling.html |
There was a problem hiding this comment.
Fix link, no code formatting for quoting text:
| CPython's own documentation already states the recommended fix:: | |
| For best results, Python should be compiled with | |
| CFLAGS='-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer' | |
| as this allows profilers to unwind using only the frame pointer | |
| and not on DWARF debug information. | |
| -- docs.python.org/3/howto/perf_profiling.html | |
| :ref:`CPython's own documentation <python:perf_profiling>`__ already states the | |
| recommended fix: | |
| For best results, Python should be compiled with | |
| ``CFLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer"`` | |
| as this allows profilers to unwind using only the frame pointer | |
| and not on DWARF debug information. |
| -mno-omit-leaf-frame-pointer``. However, Ubuntu 24.04 LTS explicitly exempted | ||
| CPython:: | ||
|
|
||
| In cases where the impact is high (such as the Python | ||
| interpreter), we'll continue to omit frame pointers until | ||
| this is addressed. | ||
|
|
||
| -- ubuntu.com/blog/ubuntu-performance-engineering-with- | ||
| frame-pointers-by-default |
There was a problem hiding this comment.
| -mno-omit-leaf-frame-pointer``. However, Ubuntu 24.04 LTS explicitly exempted | |
| CPython:: | |
| In cases where the impact is high (such as the Python | |
| interpreter), we'll continue to omit frame pointers until | |
| this is addressed. | |
| -- ubuntu.com/blog/ubuntu-performance-engineering-with- | |
| frame-pointers-by-default | |
| -mno-omit-leaf-frame-pointer``. However, Ubuntu 24.04 LTS | |
| `explicitly exempted CPython | |
| <https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-by-default>`__: | |
| In cases where the impact is high (such as the Python | |
| interpreter), we'll continue to omit frame pointers until | |
| this is addressed. |
peps/pep-0831.rst
Outdated
| ===================================== ======================= | ||
| Apple M2 Mac Mini (arm64) 1.006x slower | ||
| macOS M3 Pro (arm64) 1.001x slower | ||
| Raspberry Pi (aarch64). 1.002x slower |
There was a problem hiding this comment.
| Raspberry Pi (aarch64). 1.002x slower | |
| Raspberry Pi (aarch64) 1.002x slower |
| extension builds. If only the interpreter has frame pointers but extensions do | ||
| not, the chain is still broken at every C extension boundary. By adding the | ||
| flags to ``CFLAGS`` as reported by ``sysconfig``, extension builds that consume | ||
| CPython's compiler flags (for example via ``pip install``, ``setuptools``, or |
There was a problem hiding this comment.
| CPython's compiler flags (for example via ``pip install``, ``setuptools``, or | |
| CPython's compiler flags (for example via ``pip install``, Setuptools, or |
| the ``python`` binary, ``libpython``, and built-in extension modules under | ||
| ``Modules/``. | ||
| 2. The flags **are** written into the ``sysconfig`` data, so that third-party C | ||
| extensions built against this Python (via ``pip``, ``setuptools``, or direct |
There was a problem hiding this comment.
| extensions built against this Python (via ``pip``, ``setuptools``, or direct | |
| extensions built against this Python (via pip, Setuptools, or direct |
|
|
||
| Because the flags are in ``CFLAGS``, they propagate automatically to consumers | ||
| that build against CPython's reported compiler flags, such as C extensions | ||
| built via ``pip``, ``setuptools``, or direct ``sysconfig`` queries. Those |
There was a problem hiding this comment.
| built via ``pip``, ``setuptools``, or direct ``sysconfig`` queries. Those | |
| built via pip, Setuptools, or direct ``sysconfig`` queries. Those |
markshannon
left a comment
There was a problem hiding this comment.
Very compelling.
I'd added a few quick comments on the first part of the PEP.
This is long, and might be a bit much to expect anyone to read through in one go.
Maybe some of the supporting analyses could be moved into appendices and linked to from the body of the PEP?
For example, the detailed, but long, analyses of perf and ebpf tools could be moved to appendices.
| with frame pointers by default.** This PEP recommends that *every* compiled | ||
| component that participates in the Python call stack (C extensions, Rust | ||
| extensions, embedding applications, and native libraries) should enable | ||
| frame pointers. A frame-pointer chain is only as complete as its weakest |
There was a problem hiding this comment.
| frame pointers. A frame-pointer chain is only as complete as its weakest | |
| frame pointers. A frame-pointer chain is only as strong as its weakest |
| above) prevents these tools from producing useful call stacks for Python | ||
| processes, and undermines the perf trampoline support CPython shipped in 3.12. | ||
|
|
||
| The measured overhead is under 2% geometric mean for typical workloads. |
There was a problem hiding this comment.
For what platform, build configuration, etc, etc?
Can you link to the data, otherwise the readers (at least this reader's) first thought is "where's the evidence?"
| ========== | ||
|
|
||
| Python's observability story (profiling, debugging, and system-level tracing) | ||
| is fundamentally limited by the absence of frame pointers. The core motivation |
There was a problem hiding this comment.
This isn't quite true. It isn't impossible without frame pointers, just much, much harder.
| The performance wins that profiling enables far outweigh the modest overhead of | ||
| frame pointers. As Brendan Gregg notes: "I've seen frame pointers help find | ||
| performance wins ranging from 5% to 500%" [#gregg2024]_. A 0.5-2% overhead that | ||
| unlocks the ability to find 5-500% improvements is a favourable trade. |
There was a problem hiding this comment.
Make it clear, that large speedups are not going to happen for CPython, but it might happen for larger systems using Python as a glue language.
| frame also stores the *previous* frame pointer, creating a linked list through | ||
| the entire call stack:: | ||
|
|
||
| ┌──────────────────┐ |
There was a problem hiding this comment.
Have the arrows actually point to the previous frame. I think it is OK to use an image,instead of ASCII art.
| Stack unwinding is the process of walking this chain to reconstruct the call | ||
| stack. Profilers do it to find out where the program is spending time; | ||
| debuggers do it to show backtraces; crash handlers do it to produce useful | ||
| error reports. With frame pointers, unwinding is a simple pointer chase: read |
There was a problem hiding this comment.
| error reports. With frame pointers, unwinding is a simple pointer chase: read | |
| error reports. With frame pointers, unwinding is a simply following pointers: read |
I don't know if you can expect the readership to known the term "pointer chasing"
| stack. Profilers do it to find out where the program is spending time; | ||
| debuggers do it to show backtraces; crash handlers do it to produce useful | ||
| error reports. With frame pointers, unwinding is a simple pointer chase: read | ||
| ``%rbp``, follow the link, repeat. It takes microseconds and requires no |
There was a problem hiding this comment.
| ``%rbp``, follow the link, repeat. It takes microseconds and requires no | |
| ``%rbp``, follow the link, repeat. It is very fast and requires no |
How long it actually takes depends on many factors.
| default [#gcc_fomit]_. This frees the ``%rbp`` register for general use, | ||
| giving the optimiser one more register to work with. On x86-64 this is a gain | ||
| of one register out of 16 (about 7%). The performance benefit is small | ||
| (typically 0.5-2%) but it was considered worthwhile when the convention was |
There was a problem hiding this comment.
Again, you need to be more nuanced about how much things slow down, and for which platforms.
I'd use terms like "a few" rather than 0.5-2 and add a link to a section with a more detailed breakdown
| established for 32-bit x86, where the gain was one register out of 6 (~20%). | ||
|
|
||
| Without frame pointers, the linked list does not exist. Tools that need to | ||
| walk the call stack must instead parse DWARF debug information (a complex, |
There was a problem hiding this comment.
Don't forget Windows. It would make your case more compelling, as tools would need to parse DWARF and understand what Windows does.
| follows the chain, and reconstructs the full call stack in microseconds. This | ||
| is fast enough to sample at 10,000 Hz in production with negligible overhead. | ||
|
|
||
| Without frame pointers, the profiler must fall back to DWARF unwinding, a |
There was a problem hiding this comment.
Didn't you just say this a few paragraphs earlier?
https://github.com/python/peps/pull/4896/changes#diff-cce3082fa07cd352af453f46a4e6f23b15645c79d58cb902da789f5c463cc92aR115
Basic requirements (all PEP Types)
pep-NNNN.rst), PR title (PEP 123: <Title of PEP>) andPEPheaderAuthororSponsor, and formally confirmed their approvalAuthor,Status(Draft),TypeandCreatedheaders filled out correctlyPEP-Delegate,Topic,RequiresandReplacesheaders completed if appropriate.github/CODEOWNERSfor the PEPStandards Track requirements
Python-Versionset to valid (pre-beta) future Python version, if relevantDiscussions-ToandPost-History📚 Documentation preview 📚: https://pep-previews--4896.org.readthedocs.build/pep-0831/