How to Build a Python Profiler, Part 1: The Naive Approach

This is part 1 of a series on how Python profilers work. We start by building one from scratch using the hooks Python gives you, and then show why the naive approach can’t be used in production.

All code examples run with uv run <script> on Python 3.15.0a7. All benchmark results were obtained on an Apple MacBook Pro with an M3 Max chip and 64 GB RAM.

What Does It Mean to Profile?

Profiling answers the question: where do resources go?

Time (wall-clock or CPU) is the most common resource, but profilers can also measure memory allocations, lock contention, I/O waits, and more. The rest of this post focuses on wall-clock time, since it’s straightforward to measure with Python’s built-in profile hooks.

A wall-clock profiler observes a running program and records how much time is spent in each function. The result is a breakdown you can use to find hotspots: functions that dominate wall-clock time and are worth optimizing.

Building one requires a way to observe every function call and return. Python provides exactly that.

Python’s Profile Hook

sys.setprofile(func) installs a profile function that fires on function calls and returns, including C function calls via c_call, c_return, and c_exception.

It is thread-specific. To profile all threads, also call threading.setprofile(), which installs the hook for any threads created after that point.

The hook function receives three arguments:

def hook(frame, event, arg):
    ...

frame: the current frame object, containing the code object, locals, and back-pointer to the caller
event: a string: 'call', 'return', 'c_call', 'c_return', or 'c_exception'
arg: event-dependent. For 'return' it’s the return value; for 'c_call' it’s the C function object; otherwise None

The return value of the hook is ignored.

Building a Simple Call Profiler

Here’s a minimal profiler built on sys.setprofile. It records the total time spent in each function. (simple_profiler.py, run with uv run simple_profiler.py)

# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
 
import sys
import time
import threading
from collections import defaultdict
 
 
class SimpleProfiler:
    def __init__(self):
        self.totals = defaultdict(float)  # function name -> total seconds
        self._call_times = {}             # frame id -> entry time
 
    def _hook(self, frame, event, arg):
        if event == "call":
            self._call_times[id(frame)] = time.perf_counter()
 
        elif event == "return":
            entry = self._call_times.pop(id(frame), None)
            if entry is not None:
                elapsed = time.perf_counter() - entry
                name = frame.f_code.co_qualname
                self.totals[name] += elapsed
 
    def start(self):
        sys.setprofile(self._hook)
        threading.setprofile(self._hook)
 
    def stop(self):
        sys.setprofile(None)
        threading.setprofile(None)
 
    def report(self):
        for name, total in sorted(self.totals.items(), key=lambda x: -x[1]):
            if total >= 0.0001:  # skip sub-0.1ms noise from profiler internals
                print(f"{total * 1000:8.2f}ms  {name}")
 
 
def add(a, b):
    return a + b
 
 
def slow_function():
    total = 0
    for i in range(1_000_000):
        total = add(total, i)
    return total
 
 
profiler = SimpleProfiler()
profiler.start()
slow_function()
profiler.stop()
profiler.report()

  345.60ms  slow_function
  165.17ms  add

frame.f_code.co_qualname gives us the qualified name of the function (e.g. MyClass.method instead of just method). frame.f_back lets you walk the call stack if you want to attribute time to callers.

The core logic (record call entry time, compute elapsed on return) is the same as what profiling.tracing does. More on that next.

The Overhead Problem

The hook fires on every single function call and return. In a real application, that’s a lot of events.

Here’s a benchmark measuring the overhead of running slow_function with and without the profiler. (overhead_benchmark.py, run with uv run overhead_benchmark.py)

# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
 
import sys
import time
import threading
from collections import defaultdict
 
 
class SimpleProfiler:
    ...  # same as above
 
 
def add(a, b):
    return a + b
 
 
def slow_function():
    total = 0
    for i in range(1_000_000):
        total = add(total, i)
    return total
 
 
# Baseline: no profiler
start = time.perf_counter()
slow_function()
baseline = time.perf_counter() - start
 
# With profiler
profiler = SimpleProfiler()
profiler.start()
start = time.perf_counter()
slow_function()
with_profile = time.perf_counter() - start
profiler.stop()
 
print(f"baseline:    {baseline * 1000:.1f}ms")
print(f"with hook:   {with_profile * 1000:.1f}ms")
print(f"overhead:    {with_profile / baseline:.1f}x")
print()
profiler.report()

Output:

baseline:    26.1ms
with hook:   358.9ms
overhead:    13.8x

  358.92ms  slow_function
  168.87ms  add

The overhead compounds in real applications. Frameworks like Django or FastAPI make thousands of internal function calls per request. Running a tracing profiler in production means slowing your application by 13x or more, which effectively breaks it.

The Standard Library: profiling.tracing

profiling.tracing is CPython’s built-in tracing profiler. In Python 3.15 it was split out from cProfile, which is now a compatibility shim that imports from it. It implements the same idea (intercept every call and return, record timestamps) but in C via _lsprof.c. The core accounting lives in ptrace_enter_call and ptrace_leave_call, which map directly onto the call/return events we handle in _hook.

_lsprof.c uses sys.monitoring (introduced in Python 3.12, PEP 669) rather than setprofile. Instead of installing a Python-level callback, it registers directly with CPython’s instrumentation system using sys.monitoring.register_callback(). The events it subscribes to (PY_START, PY_RETURN, CALL, C_RETURN) map onto the same call/return boundaries, but the dispatch path is faster.

Here is profiling.tracing on the same example. (cprofile_example.py, run with uv run cprofile_example.py)

# /// script
# requires-python = "==3.15.0a7"
# dependencies = []
# ///
 
import profiling.tracing
import time
 
 
def add(a, b):
    return a + b
 
 
def slow_function():
    total = 0
    for i in range(1_000_000):
        total = add(total, i)
    return total
 
 
# Baseline: no profiler
start = time.perf_counter()
slow_function()
baseline = time.perf_counter() - start
 
# With profiling.tracing
profiler = profiling.tracing.Profile()
profiler.enable()
start = time.perf_counter()
slow_function()
with_profile = time.perf_counter() - start
profiler.disable()
 
print(f"baseline:    {baseline * 1000:.1f}ms")
print(f"with hook:   {with_profile * 1000:.1f}ms")
print(f"overhead:    {with_profile / baseline:.1f}x")
print()
 
# getstats() returns _lsprof.profiler_entry objects:
#   entry.callcount   - number of calls
#   entry.inlinetime  - time in the function itself, excluding callees (tottime)
#   entry.totaltime   - time including callees (cumtime)
entries = sorted(profiler.getstats(), key=lambda e: -e.totaltime)
print(f"{'ncalls':>10}  {'tottime':>12}  {'cumtime':>12}  function")
for entry in entries:
    if not hasattr(entry.code, "co_qualname"):
        continue  # skip C functions
    name = f"{entry.code.co_filename.split('/')[-1]}:{entry.code.co_firstlineno}({entry.code.co_qualname})"
    print(f"{entry.callcount:>10}  {entry.inlinetime * 1000:>10.2f}ms  {entry.totaltime * 1000:>10.2f}ms  {name}")

baseline:    27.0ms
with hook:   112.2ms
overhead:    4.1x

    ncalls       tottime       cumtime  function
         1       72.06ms      112.15ms  cprofile_example.py:14(slow_function)
   1000000       40.09ms       40.09ms  cprofile_example.py:10(add)

4.1x overhead, significantly better than our Python implementation (13.8x), because _lsprof.c does all timing in C. But it is still 4.1x. The timings reported by profiling.tracing also include its own overhead: slow_function’s cumtime of 112ms is the time under the profiler, not the true 27ms baseline.

tottime is the time spent inside the function itself, excluding callees. cumtime is the total time including callees. For slow_function, cumtime (112ms) includes the time spent in add; tottime (72ms) is the time slow_function spent on its own loop logic.

Why This Can’t Be Used in Production

Two things kill tracing for always-on production profiling:

The hook is synchronous and on the critical path. Every function call and return has to invoke the hook before execution can continue. There’s no way to make it invisible.
Even a fast implementation is too slow. Our pure-Python profiler ran 13.8x slower than baseline. Moving the hook into C (profiling.tracing) only brings that down to 4.1x. Production profilers need overhead below 1-2%; even the C implementation is two orders of magnitude over budget.

This is fine for development profiling, running profiling.tracing locally to find which function dominates your benchmark. But it’s not viable for always-on production profiling. The upstream Python docs agree: the profile module documentation recommends profiling.tracing for development and testing, and profiling.sampling for production debugging with zero overhead.

Part 2 covers a different approach: instead of observing every event, periodically interrupt the program and record what it’s doing. That’s a sampling profiler, and it’s how production profilers are built.

Taegyun Kim

Explorer