# Parallelization and profiling#

If you’re one of those people whose scripts always run in a second or less, you can probably skip this tutorial. But if you have time to make yourself a cup of tea while your code is running, you might want to read on. This tutorial covers how to run code in parallel, and how to check its performance to look for improvements.

## Parallelization#

### Parallelization in Python#

Scary stories of Python’s “global interpreter lock” aside, parallelization is actually fairly simple in Python. However, it’s not particularly intuitive or flexible. We can do vanilla parallelization in Python via something like this:

[1]:

import multiprocessing as mp

# Define a function
def my_func(x):
return x**2

# Run it in parallel
with mp.Pool() as pool:
results = pool.map(my_func, [1,2,3])

print(results)

[1, 4, 9]


So far so good. But what if we have something more complicated? What if we want to run our function with a different keyword argument, for example? It starts getting kind of crazy:

[2]:

from functools import partial

# Define a (slightly) more complex function
def complex_func(x, arg1=2, arg2=4):
return x**2 + (arg1 * arg2)

# Make a new function with a different default argument 😱
new_func = partial(complex_func, arg2=10)

# Run it in parallel
with mp.Pool() as pool:
results = pool.map(new_func, [1,2,3])

print(results)

[21, 24, 29]


This works, but that sure was a lot of work just to set a single keyword argument!

### Parallelization in Sciris#

With Sciris, you can do it all with one line:

[3]:

import sciris as sc

results = sc.parallelize(complex_func, [1,2,3], arg2=10)

print(results)

[21, 24, 29]


What’s happening here? sc.parallelize() lets you pass keyword arguments directly to the function you’re calling. You can also iterate over multiple arguments rather than just one:

[4]:

args = dict(x=[1,2,3], arg2=[10,20,30])

results = sc.parallelize(complex_func, iterkwargs=args)

print(results)

[21, 44, 69]


(Of course you can do this with vanilla Python too, but you’ll need to define a list of tuples, and you can only assign by position, not by keyword.)

Depending on what you might want to run, your inputs might be in one of several different forms. You can supply a list of values, a list of dicts, or a dict of lists. An example will probably help:

[5]:

def mult(x,y):
return x*y

r1 = sc.parallelize(mult, iterarg=[(1,2),(2,3),(3,4)])
r2 = sc.parallelize(mult, iterkwargs={'x':[1,2,3], 'y':[2,3,4]})
r3 = sc.parallelize(mult, iterkwargs=[{'x':1, 'y':2}, {'x':2, 'y':3}, {'x':3, 'y':4}])
print(f'{r1 = }')
print(f'{r2 = }')
print(f'{r3 = }')

r1 = [2, 6, 12]
r2 = [2, 6, 12]
r3 = [2, 6, 12]


All of these are equivalent: choose whichever makes you happy.

There are lots and lots of options with parallelization, but we’ll only cover a couple here. For example, if you want to start 200 jobs on your laptop with 8 cores, you probably don’t want them to eat up all your CPU or memory and make your computer unusable. You can set maxcpu and maxmem limits to handle that:

[6]:

import numpy as np
import pylab as pl

# Define the function
def rand2d(i, x, y):
np.random.seed()
xy = [x+i*np.random.randn(100), y+i*np.random.randn(100)]
return (i,xy)

# Run in parallel
xy = sc.parallelize(
func     = rand2d,   # The function to parallelize
iterarg  = range(5), # Values for first argument
maxcpu   = 0.8,      # CPU limit (1 = no limit)
maxmem   = 0.9,      # Memory limit (1 = no limit)
interval = 0.2,      # How often to re-check the limits (in seconds)
x = 3, y = 8,        # Keyword arguments for the function
)

# Plot
pl.figure()
colors = sc.gridcolors(len(xy))
for i,(x,y) in reversed(xy): # Reverse order to plot the most widely spaced dots first
pl.scatter(x, y, c=[colors[i]], alpha=0.7, label=f'Scale={i}')
pl.legend();

CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 0 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 1 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 2 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 3 after 1 tries
CPU ✓ (0.00<0.80), memory ✓ (0.19<0.90): starting process 4 after 1 tries


So far, we’ve used sc.parallelize() as a function. But you can also use it as a class, which gives you more flexibility and control over which jobs are run, and will give you more information if any of them failed:

[7]:

def slow_func(i=1):
sc.randsleep(seed=i)
if i == 4:
raise Exception("I don't like seed 4")
return i**2

# Create the parallelizer object
P = sc.Parallel(
func = slow_func,
iterarg = range(10),
parallelizer = 'multiprocess-async', # Run asynchronously
die = False, # Keep going if a job crashes
)

# Actually run
P.run_async()

# Monitor progress
P.monitor()

# Get results
P.finalize()

# See how long things took
print(P.times)

Job 4/10 (2.3 s) ••••••••••••—————————————————— 40%

/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/multiprocess/pool.py:48: RuntimeWarning: sc.parallelize(): Task 4 failed, but die=False so continuing.
Traceback (most recent call last):
result = func(*args, **kwargs) # Call the function!
^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ipykernel_1915/2785706684.py", line 4, in slow_func
raise Exception("I don't like seed 4")
Exception: I don't like seed 4

return list(map(*args))

#0. 'started':  datetime.datetime(2024, 4, 1, 23, 20, 29, 452539)
#1. 'finished': datetime.datetime(2024, 4, 1, 23, 20, 36, 65270)
#2. 'elapsed':  6.612731
#3. 'jobs':     [1.278202772140503, 1.0245444774627686, 0.5267577171325684,
0.17212557792663574, 1.8923094272613525, 1.611030101776123, 1.077981948852539,
1.251112699508667, 0.654982328414917, 1.7413651943206787]

/home/docs/checkouts/readthedocs.org/user_builds/sciris/envs/latest/lib/python3.11/site-packages/sciris/sc_parallel.py:543: RuntimeWarning: Only 9 of 10 jobs succeeded; see exceptions attribute for details
self.process_results()


You can see it raised some warnings. These are stored in the Parallel object so we can check back and see what happened:

[8]:

print(f'{P.success = }')
print(f'{P.exceptions = }')
print(f'{P.results = }')

P.success = [True, True, True, True, False, True, True, True, True, True]
P.exceptions = [None, None, None, None, Exception("I don't like seed 4"), None, None, None, None, None]
P.results = [0, 1, 4, 9, None, 25, 36, 49, 64, 81]


Hopefully, you will never need to run a function as poorly written as slow_func()!

## Profiling#

Even parallelization can’t save you if your code is just really slow. Sciris provides a variety of tools to help with this.

### Benchmarking#

First off, we can check if our computer is performing as we expect, or if we want to compare across computers:

[9]:

bm = sc.benchmark() # Check CPU performance, in units of MOPS (million operations per second)
ram = sc.checkram() # Check RAM used by this Python instance

print('CPU performance: ', dict(bm))
print('Python RAM usage', ram)

CPU performance:  {'python': 4.420045548484547, 'numpy': 138.59488716438034}
Python RAM usage 153.78 MB


We can see that NumPy performance is much higher than Python – hundreds of MOPS† instead of single-digits. This makes sense, this is why we use it for array operations!

† The determination of a single “operation” is a little loose, so these “MOPS” can be used for relative purposes, but aren’t directly relatable to, say, published processor speeds.

### Line profiling#

If you want to do a serious profiling of your code, take a look at Austin. But if you just want to get a quick sense of where things might be slow, you can use sc.profile(). Applying it to our lousy slow_func() from before:

[10]:

sc.profile(slow_func)

Profiling...
Timer unit: 1e-09 s

Total time: 1.02409 s
File: /tmp/ipykernel_1915/2785706684.py
Function: slow_func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def slow_func(i=1):
2         1 1024088429.0    1e+09    100.0      sc.randsleep(seed=i)
3         1       1747.0   1747.0      0.0      if i == 4:
4                                                   raise Exception("I don't like seed 4")
5         1       1411.0   1411.0      0.0      return i**2

Done.

[10]:

<line_profiler.line_profiler.LineProfiler at 0x7f0c9c5f5630>


We can see that 100% (well, 99.9997%) of the time was taken by the sleep function. This is not surprising, but seems correct!

For a slightly more realistic example:

[11]:

def func():
n = 1000

# Do some NumPy
v1 = np.random.rand(n,n)
v2 = np.random.rand(n,n)
v3 = v1*v2

# Do some Python
means = []
for i in range(n):
means.append(sum(v3[i])/n)

sc.profile(func)

Profiling...
Timer unit: 1e-09 s

Total time: 0.110839 s
File: /tmp/ipykernel_1915/701805461.py
Function: func at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
1                                           def func():
2         1        803.0    803.0      0.0      n = 1000
3
4                                               # Do some NumPy
5         1   10601157.0    1e+07      9.6      v1 = np.random.rand(n,n)
6         1    9473091.0    9e+06      8.5      v2 = np.random.rand(n,n)
7         1    3943569.0    4e+06      3.6      v3 = v1*v2
8
9                                               # Do some Python
10         1       1484.0   1484.0      0.0      means = []
11      1001     278894.0    278.6      0.3      for i in range(n):
12      1000   86539806.0  86539.8     78.1          means.append(sum(v3[i])/n)

Done.

[11]:

<line_profiler.line_profiler.LineProfiler at 0x7f0c9c5f43d0>


We can see (from the “% Time” column) that, again not surprisingly, the Python math operation is much slower than the NumPy operations.