Bug report
In #138664 we added a critical section on the code object in PyCode_Addr2Line to fix crashes involving profiling.
There are two problems with using a lock here:
-
Code objects are shared across threads and PyCode_Addr2Line is a frequently function, leading to contention during profiling.
-
This introduces a stop-the-world safe point in profiling code, which can cause bugs if the profiling code isn't expecting it. For example, PyTorch's profiler isn't robust to a stop-the-world event (or GIL release) happening during profiling, and this leads to crashes. The problem is essentially:
- PyTorch assumes that after
PyEval_SetProfileAllThreads(NULL, NULL); no threads are still in the PyTorch profile code. This is incorrect due to the PyCode_Addr2Line() critical section suspending the thread state in the middle of a profile
- PyTorch frees a bunch of profiler state, but thread suspended on
PyCode_Addr2Line() resumes and tries reading that state leading to a crash.
This is a bug in PyTorch, so I'm going to fix it there as well, but the critical section in PyCode_Addr2Line() makes it much more likely to encounter the bug.
Proposed fix to CPython
- No critical section in PyCode_Addr2Line. Use atomic reads on
co->_co_monitoring and co->_co_monitoring->lines
- Use atomic stores to initialize
co->_co_monitoring and co->_co_monitoring->lines and reorder the stores so that the contents are initialized before the pointer is stored.
cc @kumaraditya303
Linked PRs
Bug report
In #138664 we added a critical section on the code object in
PyCode_Addr2Lineto fix crashes involving profiling.There are two problems with using a lock here:
Code objects are shared across threads and
PyCode_Addr2Lineis a frequently function, leading to contention during profiling.This introduces a stop-the-world safe point in profiling code, which can cause bugs if the profiling code isn't expecting it. For example, PyTorch's profiler isn't robust to a stop-the-world event (or GIL release) happening during profiling, and this leads to crashes. The problem is essentially:
PyEval_SetProfileAllThreads(NULL, NULL);no threads are still in the PyTorch profile code. This is incorrect due to thePyCode_Addr2Line()critical section suspending the thread state in the middle of a profilePyCode_Addr2Line()resumes and tries reading that state leading to a crash.This is a bug in PyTorch, so I'm going to fix it there as well, but the critical section in
PyCode_Addr2Line()makes it much more likely to encounter the bug.Proposed fix to CPython
co->_co_monitoringandco->_co_monitoring->linesco->_co_monitoringandco->_co_monitoring->linesand reorder the stores so that the contents are initialized before the pointer is stored.cc @kumaraditya303
Linked PRs
PyCode_Addr2Line#148103PyCode_Addr2Line(GH… #148353