This is part 1 of a series on how Python profilers work. We start by building one from scratch using the hooks Python gives you, and then show why the naive approach can’t be used in production.
All code examples run with uv run <script> on Python 3.15.0a7. All benchmark results were obtained on an Apple MacBook Pro with an M3 Max chip and 64 GB RAM.
What Does It Mean to Profile?
Profiling answers the question: where does time go?
A profiler observes a running program and records how much time is spent in each function. The result is a breakdown you can use to find hotspots: functions that dominate execution time and are worth optimizing.
Python’s Profile Hook
sys.setprofile(func) installs a profile function that fires on function calls and returns, including C function calls via c_call, c_return, and c_exception.
It is thread-specific. To profile all threads, also call threading.setprofile(), which installs the hook for any threads created after that point.
The hook function receives three arguments:
def hook(frame, event, arg):
...frame: the currentframe object, containing the code object, locals, and back-pointer to the callerevent: a string:'call','return','c_call','c_return', or'c_exception'arg: event-dependent. For'return'it’s the return value; for'c_call'it’s the C function object; otherwiseNone
The return value of the hook is ignored.
Building a Simple Call Profiler
Here’s a minimal profiler built on sys.setprofile. It records the total time spent in each function. (simple_profiler.py, run with uv run simple_profiler.py)
# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
import sys
import time
import threading
from collections import defaultdict
class SimpleProfiler:
def __init__(self):
self.totals = defaultdict(float) # function name -> total seconds
self._call_times = {} # frame id -> entry time
def _hook(self, frame, event, arg):
if event == "call":
self._call_times[id(frame)] = time.perf_counter()
elif event == "return":
entry = self._call_times.pop(id(frame), None)
if entry is not None:
elapsed = time.perf_counter() - entry
name = frame.f_code.co_qualname
self.totals[name] += elapsed
def start(self):
sys.setprofile(self._hook)
threading.setprofile(self._hook)
def stop(self):
sys.setprofile(None)
threading.setprofile(None)
def report(self):
for name, total in sorted(self.totals.items(), key=lambda x: -x[1]):
if total >= 0.0001: # skip sub-0.1ms noise from profiler internals
print(f"{total * 1000:8.2f}ms {name}")
def add(a, b):
return a + b
def slow_function():
total = 0
for i in range(1_000_000):
total = add(total, i)
return total
profiler = SimpleProfiler()
profiler.start()
slow_function()
profiler.stop()
profiler.report() 345.60ms slow_function
165.17ms add
frame.f_code.co_qualname gives us the qualified name of the function (e.g. MyClass.method instead of just method). frame.f_back lets you walk the call stack if you want to attribute time to callers.
The core logic (record call entry time, compute elapsed on return) is the same as what profiling.tracing does. More on that next.
The Overhead Problem
The hook fires on every single function call and return. In a real application, that’s a lot of events.
Here’s a benchmark measuring the overhead of running slow_function with and without the profiler. (overhead_benchmark.py, run with uv run overhead_benchmark.py)
# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
import sys
import time
import threading
from collections import defaultdict
class SimpleProfiler:
... # same as above
def add(a, b):
return a + b
def slow_function():
total = 0
for i in range(1_000_000):
total = add(total, i)
return total
# Baseline: no profiler
start = time.perf_counter()
slow_function()
baseline = time.perf_counter() - start
# With profiler
profiler = SimpleProfiler()
profiler.start()
start = time.perf_counter()
slow_function()
with_profile = time.perf_counter() - start
profiler.stop()
print(f"baseline: {baseline * 1000:.1f}ms")
print(f"with hook: {with_profile * 1000:.1f}ms")
print(f"overhead: {with_profile / baseline:.1f}x")
print()
profiler.report()Output:
baseline: 26.1ms
with hook: 358.9ms
overhead: 13.8x
358.92ms slow_function
168.87ms add
The overhead compounds in real applications. Frameworks like Django or FastAPI make thousands of internal function calls per request. Running a tracing profiler in production means slowing your application by 13x or more, which effectively breaks it.
The Standard Library: profiling.tracing
profiling.tracing is CPython’s built-in tracing profiler. In Python 3.15 it was split out from cProfile, which is now a compatibility shim that imports from it. It implements the same idea (intercept every call and return, record timestamps) but in C via _lsprof.c. The core accounting lives in ptrace_enter_call and ptrace_leave_call, which map directly onto the call/return events we handle in _hook.
_lsprof.c uses sys.monitoring (introduced in Python 3.12, PEP 669) rather than setprofile. Instead of installing a Python-level callback, it registers directly with CPython’s instrumentation system using sys.monitoring.register_callback(). The events it subscribes to (PY_START, PY_RETURN, CALL, C_RETURN) map onto the same call/return boundaries, but the dispatch path is faster.
Here is profiling.tracing on the same example. (cprofile_example.py, run with uv run cprofile_example.py)
# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
import profiling.tracing
import time
def add(a, b):
return a + b
def slow_function():
total = 0
for i in range(1_000_000):
total = add(total, i)
return total
# Baseline: no profiler
start = time.perf_counter()
slow_function()
baseline = time.perf_counter() - start
# With profiling.tracing
profiler = profiling.tracing.Profile()
profiler.enable()
start = time.perf_counter()
slow_function()
with_profile = time.perf_counter() - start
profiler.disable()
print(f"baseline: {baseline * 1000:.1f}ms")
print(f"with hook: {with_profile * 1000:.1f}ms")
print(f"overhead: {with_profile / baseline:.1f}x")
print()
# getstats() returns _lsprof.profiler_entry objects:
# entry.callcount - number of calls
# entry.inlinetime - time in the function itself, excluding callees (tottime)
# entry.totaltime - time including callees (cumtime)
entries = sorted(profiler.getstats(), key=lambda e: -e.totaltime)
print(f"{'ncalls':>10} {'tottime':>12} {'cumtime':>12} function")
for entry in entries:
if not hasattr(entry.code, "co_qualname"):
continue # skip C functions
name = f"{entry.code.co_filename.split('/')[-1]}:{entry.code.co_firstlineno}({entry.code.co_qualname})"
print(f"{entry.callcount:>10} {entry.inlinetime * 1000:>10.2f}ms {entry.totaltime * 1000:>10.2f}ms {name}")baseline: 27.0ms
with hook: 112.2ms
overhead: 4.1x
ncalls tottime cumtime function
1 72.06ms 112.15ms cprofile_example.py:14(slow_function)
1000000 40.09ms 40.09ms cprofile_example.py:10(add)
4.1x overhead, significantly better than our Python implementation (13.8x), because _lsprof.c does all timing in C. But it is still 4.1x. The timings reported by profiling.tracing also include its own overhead: slow_function’s cumtime of 112ms is the time under the profiler, not the true 27ms baseline.
tottime is the time spent inside the function itself, excluding callees. cumtime is the total time including callees. For slow_function, cumtime (112ms) includes the time spent in add; tottime (72ms) is the time slow_function spent on its own loop logic.
Why This Can’t Be Used in Production
The fundamental problem is that the hook is synchronous and on the critical path. Every function call the application makes has to invoke your hook before it can proceed. There’s no way to make it fast enough to be invisible.
This is fine for development profiling, running profiling.tracing locally to find which function dominates your benchmark. But it’s not viable for always-on production profiling, where the profiler needs to have overhead below 1-2%.
Part 2 covers a different approach: instead of observing every event, periodically interrupt the program and record what it’s doing. That’s a sampling profiler, and it’s how production profilers are built.