Performance Benchmarks¶
dimtensor adds unit tracking to your arrays. This page shows the performance characteristics and overhead compared to raw operations.
Summary¶
| Framework | Overhead vs Raw | Notes |
|---|---|---|
| NumPy | 2-5x | Acceptable for most scientific work |
| PyTorch | 2-4x | GPU operations amortize overhead |
| JAX | 2-3x | JIT compilation reduces overhead |
NumPy Benchmarks¶
Array Creation¶
import numpy as np
from dimtensor import DimArray, units
import timeit
# Raw NumPy
%timeit np.array([1.0] * 1000)
# ~2.5 μs
# dimtensor
%timeit DimArray([1.0] * 1000, units.m)
# ~5 μs (2x overhead)
Arithmetic Operations¶
a_np = np.random.randn(10000)
b_np = np.random.randn(10000)
a_dim = DimArray(a_np, units.m)
b_dim = DimArray(b_np, units.m)
# Raw NumPy
%timeit a_np + b_np
# ~5 μs
# dimtensor
%timeit a_dim + b_dim
# ~15 μs (3x overhead)
Large Array Operations¶
For larger arrays, the relative overhead decreases:
# 1 million elements
a_np = np.random.randn(1_000_000)
b_np = np.random.randn(1_000_000)
a_dim = DimArray(a_np, units.m)
b_dim = DimArray(b_np, units.m)
# Raw NumPy
%timeit a_np * b_np
# ~1.2 ms
# dimtensor
%timeit a_dim * b_dim
# ~1.5 ms (1.25x overhead - much better!)
Key insight: Overhead is relatively constant, so larger arrays have proportionally less overhead.
PyTorch Benchmarks¶
CPU Operations¶
import torch
from dimtensor.torch import DimTensor
from dimtensor import units
a_torch = torch.randn(10000)
b_torch = torch.randn(10000)
a_dim = DimTensor(a_torch, units.m)
b_dim = DimTensor(b_torch, units.m)
# Raw PyTorch
%timeit a_torch + b_torch
# ~8 μs
# dimtensor
%timeit a_dim + b_dim
# ~25 μs (3x overhead)
GPU Operations¶
GPU operations benefit significantly because the overhead is CPU-bound:
# On GPU
a_gpu = torch.randn(1_000_000, device='cuda')
b_gpu = torch.randn(1_000_000, device='cuda')
a_dim_gpu = DimTensor(a_gpu, units.m)
b_dim_gpu = DimTensor(b_gpu, units.m)
# Raw PyTorch GPU
%timeit torch.cuda.synchronize(); _ = a_gpu * b_gpu; torch.cuda.synchronize()
# ~50 μs
# dimtensor GPU
%timeit torch.cuda.synchronize(); _ = a_dim_gpu * b_dim_gpu; torch.cuda.synchronize()
# ~55 μs (1.1x overhead - negligible!)
Key insight: GPU operations amortize the CPU-side unit tracking overhead.
Autograd¶
a = DimTensor(torch.randn(1000, requires_grad=True), units.m)
b = DimTensor(torch.randn(1000), units.s)
def forward():
c = a / b
return c.sum()
# Backward pass
%timeit loss = forward(); loss.backward()
# Overhead: ~2x vs raw PyTorch autograd
JAX Benchmarks¶
JIT Compilation¶
JIT compilation significantly reduces dimtensor overhead:
import jax
import jax.numpy as jnp
from dimtensor.jax import DimArray
from dimtensor import units
@jax.jit
def compute_raw(a, b):
return a * b + a
@jax.jit
def compute_dim(a, b):
return a * b + a
a_jax = jnp.ones(10000)
b_jax = jnp.ones(10000)
a_dim = DimArray(jnp.ones(10000), units.m)
b_dim = DimArray(jnp.ones(10000), units.m)
# After warmup
%timeit compute_raw(a_jax, b_jax).block_until_ready()
# ~15 μs
%timeit compute_dim(a_dim, b_dim).data.block_until_ready()
# ~25 μs (1.7x overhead)
vmap¶
@jax.jit
@jax.vmap
def batched_compute(x):
return x ** 2
x_raw = jnp.ones((100, 1000))
x_dim = DimArray(jnp.ones((100, 1000)), units.m)
%timeit batched_compute(x_raw).block_until_ready()
# ~20 μs
%timeit batched_compute(x_dim).data.block_until_ready()
# ~35 μs (1.75x overhead)
Optimization Tips¶
1. Use Larger Arrays¶
Overhead is relatively constant, so batch your operations:
# Slower: Many small operations
for i in range(1000):
result = DimArray([values[i]], units.m) * scalar
# Faster: One large operation
result = DimArray(values, units.m) * scalar
2. Use GPU for Large Computations¶
GPU operations have negligible overhead:
# Move to GPU for large arrays
data = DimTensor(torch.randn(1_000_000), units.m)
data_gpu = data.cuda() # Overhead becomes negligible
3. Use JAX JIT¶
JIT compilation reduces overhead significantly:
@jax.jit
def physics_simulation(state):
# All unit checking happens at trace time
# Compiled code runs at near-native speed
return new_state
4. Extract Data for Tight Loops¶
For performance-critical inner loops:
# Extract raw data for tight loop
raw_data = arr.data
raw_unit = arr.unit
for i in range(1000000):
# Pure NumPy operations
raw_data[i] = compute(raw_data[i])
# Reconstruct with units
result = DimArray(raw_data, raw_unit)
5. Disable Uncertainty When Not Needed¶
Uncertainty propagation adds overhead:
# With uncertainty (slower)
arr = DimArray([1, 2, 3], units.m, uncertainty=[0.1, 0.1, 0.1])
# Without uncertainty (faster)
arr = DimArray([1, 2, 3], units.m)
Comparison with Other Libraries¶
| Library | Relative Overhead | GPU Support |
|---|---|---|
| dimtensor | 2-5x | Yes (PyTorch) |
| Pint | 2-5x | No |
| Astropy | 2-3x | No |
| unyt | 1.5-3x | No |
dimtensor's overhead is comparable to other unit libraries, with the added benefit of GPU acceleration.
When Performance Matters¶
For most scientific applications, the 2-5x overhead is acceptable because:
- I/O is usually the bottleneck - File reading, network, etc.
- Bugs are expensive - Dimensional errors caught early save hours of debugging
- GPU amortizes overhead - Large computations on GPU have negligible overhead
- Correctness > Speed - A fast wrong answer is worse than a slow correct one
If you find dimtensor is your bottleneck, consider:
- Profiling to confirm it's actually the issue
- Using GPU acceleration
- Extracting raw data for the critical section
- Opening an issue for optimization suggestions
Running Your Own Benchmarks¶
dimtensor includes benchmarking utilities: