How can I profile the TLB hits and TLB misses in ubuntu - c++

I have written a simple C++ program using for-loop to print the numbers from 1 to 100. I want to find the number of TLB hits and misses occurring for the particular program while running. Is there any possibility to get this data?
I am using Ubuntu. I have used perf tool. But it is producing different result in different times. I am very confused what part of my code is leading to such a huge number TLB hits, TLB misses and cache misses.
Ofcourse there might be other processes running simultaneously like Ubuntu GUI. But, does this result includes those process too?
command I used: perf stat -e dTLB-loads -e dTPerformance counter stats for './hellocc':
result: first time--
909,822 dTLB-loads
2,023 dTLB-misses # 0.22% of all dTLB cache hits
4,512 cache-misses
0.006821182 seconds time elapsed
LB-misses ./hellocc
result: Second time-- Performance counter stats for './hellocc':
907,810 dTLB-loads
2,045 dTLB-misses # 0.23% of all dTLB cache hits
4,533 cache-misses
0.006780635 seconds time elapsed
My simple code:
#include <iostream>
using namespace std;
int main
{
cout << "hello" << "\n";
for(int i=1; i <= 100; i = i + 1)
cout<< i << "\t" ;
return 0;
}

One way you could simulate this is using cachegrind, a part of valgrind.
Cachegrind simulates how your program interacts with a machine's cache hierarchy and (optionally) branch predictor. It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines.
While it's not your hardware, which I don't think you can get to, it's a good stand-in.

The cache behaviour of your program depends on what else is happening on your system at the time.
On a Linux system there are many processes running, such as the X server and window manager, the terminal, your editor, various daemon processes, and whatever else you have running (such as a web browser).
Depending on the vagaries of the scheduler, and the demands these other programs place on your system, your program's data may or may not stay in cache (the scheduler may even page your process entirely to the swap file), so the number of cache misses will vary depending on the other applications running.

Related

Measuring FLOPS and memory traffic in a C++ program

I am trying to profile a C++ program. For the first step, I want to determine whether the program is compute-bound or memory-bound by the Roofline Model. So I need to measure the following 4 things.
W: # of computations performed in the program (FLOPs)
Q: # of bytes of memory accesses incurred in the program (Byte/s)
π: peak performance (FLOPs)
β: peak bandwidth (Byte/s)
I have tried to use Linux perf to measure W. I followed the instructions here, using libpfm4 to determine the available events (by ./showevinfo). I found my CPU supports the INST_RETIREDevent with umask X87, then I used ./check_events INST_RETIRED:X87 to find the code, which is 0x5302c0. Then I tried perf stat -e r5302c0 ./test_exe and I got
Performance counter stats for './test_exe':
83,381,997 r5302c0
20.134717382 seconds time elapsed
74.691675000 seconds user
0.357003000 seconds sys
Questions:
Is it right for my process to measure the W of my program? If yes, then it should be 83,381,997 FLOPs, right?
Why is this FLOPs not stable between repeated executions?
How can I measure the other Q, π and β?
Thanks for your time and any suggestions.

Limit total CPU usage in python multiprocessing

I am using multiprocessing.Pool.imap to run many independent jobs in parallel using Python 2.7 on Windows 7. With the default settings, my total CPU usage is pegged at 100%, as measured by Windows Task Manager. This makes it impossible to do any other work while my code runs in the background.
I've tried limiting the number of processes to be the number of CPUs minus 1, as described in How to limit the number of processors that Python uses:
pool = Pool(processes=max(multiprocessing.cpu_count()-1, 1)
for p in pool.imap(func, iterable):
...
This does reduce the total number of running processes. However, each process just takes up more cycles to make up for it. So my total CPU usage is still pegged at 100%.
Is there a way to directly limit the total CPU usage - NOT just the number of processes - or failing that, is there any workaround?
The solution depends on what you want to do. Here are a few options:
Lower priorities of processes
You can nice the subprocesses. This way, though they will still eat 100% of the CPU, when you start other applications, the OS gives preference to the other applications. If you want to leave a work intensive computation run on the background of your laptop and don't care about the CPU fan running all the time, then setting the nice value with psutils is your solution. This script is a test script which runs on all cores for enough time so you can see how it behaves.
from multiprocessing import Pool, cpu_count
import math
import psutil
import os
def f(i):
return math.sqrt(i)
def limit_cpu():
"is called at every process start"
p = psutil.Process(os.getpid())
# set to lowest priority, this is windows only, on Unix use ps.nice(19)
p.nice(psutil.BELOW_NORMAL_PRIORITY_CLASS)
if __name__ == '__main__':
# start "number of cores" processes
pool = Pool(None, limit_cpu)
for p in pool.imap(f, range(10**8)):
pass
The trick is that limit_cpu is run at the beginning of every process (see initializer argment in the doc). Whereas Unix has levels -19 (highest prio) to 19 (lowest prio), Windows has a few distinct levels for giving priority. BELOW_NORMAL_PRIORITY_CLASS probably fits your requirements best, there is also IDLE_PRIORITY_CLASS which says Windows to run your process only when the system is idle.
You can view the priority if you switch to detail mode in Task Manager and right click on the process:
Lower number of processes
Although you have rejected this option it still might be a good option: Say you limit the number of subprocesses to half the cpu cores using pool = Pool(max(cpu_count()//2, 1)) then the OS initially runs those processes on half the cpu cores, while the others stay idle or just run the other applications currently running. After a short time, the OS reschedules the processes and might move them to other cpu cores etc. Both Windows as Unix based systems behave this way.
Windows: Running 2 processes on 4 cores:
OSX: Running 4 processes on 8 cores:
You see that both OS balance the process between the cores, although not evenly so you still see a few cores with higher percentages than others.
Sleep
If you absolutely want to go sure, that your processes never eat 100% of a certain core (e.g. if you want to prevent that the cpu fan goes up), then you can run sleep in your processing function:
from time import sleep
def f(i):
sleep(0.01)
return math.sqrt(i)
This makes the OS "schedule out" your process for 0.01 seconds for each computation and makes room for other applications. If there are no other applications, then the cpu core is idle, thus it will never go to 100%. You'll need to play around with different sleep durations, it will also vary from computer to computer you run it on. If you want to make it very sophisticated you could adapt the sleep depending on what cpu_times() reports.
On the OS level
you can use nice to set a priority to a single command. You could also start a python script with nice. (Below from: http://blog.scoutapp.com/articles/2014/11/04/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups)
nice
The nice command tweaks the priority level of a process so that it runs less frequently. This is useful when you need to run a
CPU intensive task as a background or batch job. The niceness level
ranges from -20 (most favorable scheduling) to 19 (least favorable).
Processes on Linux are started with a niceness of 0 by default. The
nice command (without any additional parameters) will start a process
with a niceness of 10. At that level the scheduler will see it as a
lower priority task and give it less CPU resources.Start two
matho-primes tasks, one with nice and one without:
nice matho-primes 0 9999999999 > /dev/null &matho-primes 0 9999999999 > /dev/null &
matho-primes 0 9999999999 > /dev/null &
Now run top.
As a function in Python
Another approach is to use psutils to check your CPU load average for the past minute and then have your threads check the CPU load average and spool up another thread if you are below the specified CPU load target, and sleep or kill the thread if you are above the CPU load target. This will get out of your way when you are using your computer, but will maintain a constant CPU load.
# Import Python modules
import time
import os
import multiprocessing
import psutil
import math
from random import randint
# Main task function
def main_process(item_queue, args_array):
# Go through each link in the array passed in.
while not item_queue.empty():
# Get the next item in the queue
item = item_queue.get()
# Create a random number to simulate threads that
# are not all going to be the same
randomizer = randint(100, 100000)
for i in range(randomizer):
algo_seed = math.sqrt(math.sqrt(i * randomizer) % randomizer)
# Check if the thread should continue based on current load balance
if spool_down_load_balance():
print "Process " + str(os.getpid()) + " saying goodnight..."
break
# This function will build a queue and
def start_thread_process(queue_pile, args_array):
# Create a Queue to hold link pile and share between threads
item_queue = multiprocessing.Queue()
# Put all the initial items into the queue
for item in queue_pile:
item_queue.put(item)
# Append the load balancer thread to the loop
load_balance_process = multiprocessing.Process(target=spool_up_load_balance, args=(item_queue, args_array))
# Loop through and start all processes
load_balance_process.start()
# This .join() function prevents the script from progressing further.
load_balance_process.join()
# Spool down the thread balance when load is too high
def spool_down_load_balance():
# Get the count of CPU cores
core_count = psutil.cpu_count()
# Calulate the short term load average of past minute
one_minute_load_average = os.getloadavg()[0] / core_count
# If load balance above the max return True to kill the process
if one_minute_load_average > args_array['cpu_target']:
print "-Unacceptable load balance detected. Killing process " + str(os.getpid()) + "..."
return True
# Load balancer thread function
def spool_up_load_balance(item_queue, args_array):
print "[Starting load balancer...]"
# Get the count of CPU cores
core_count = psutil.cpu_count()
# While there is still links in queue
while not item_queue.empty():
print "[Calculating load balance...]"
# Check the 1 minute average CPU load balance
# returns 1,5,15 minute load averages
one_minute_load_average = os.getloadavg()[0] / core_count
# If the load average much less than target, start a group of new threads
if one_minute_load_average < args_array['cpu_target'] / 2:
# Print message and log that load balancer is starting another thread
print "Starting another thread group due to low CPU load balance of: " + str(one_minute_load_average * 100) + "%"
time.sleep(5)
# Start another group of threads
for i in range(3):
start_new_thread = multiprocessing.Process(target=main_process,args=(item_queue, args_array))
start_new_thread.start()
# Allow the added threads to have an impact on the CPU balance
# before checking the one minute average again
time.sleep(20)
# If load average less than target start single thread
elif one_minute_load_average < args_array['cpu_target']:
# Print message and log that load balancer is starting another thread
print "Starting another single thread due to low CPU load balance of: " + str(one_minute_load_average * 100) + "%"
# Start another thread
start_new_thread = multiprocessing.Process(target=main_process,args=(item_queue, args_array))
start_new_thread.start()
# Allow the added threads to have an impact on the CPU balance
# before checking the one minute average again
time.sleep(20)
else:
# Print CPU load balance
print "Reporting stable CPU load balance: " + str(one_minute_load_average * 100) + "%"
# Sleep for another minute while
time.sleep(20)
if __name__=="__main__":
# Set the queue size
queue_size = 10000
# Define an arguments array to pass around all the values
args_array = {
# Set some initial CPU load values as a CPU usage goal
"cpu_target" : 0.60,
# When CPU load is significantly low, start this number
# of threads
"thread_group_size" : 3
}
# Create an array of fixed length to act as queue
queue_pile = list(range(queue_size))
# Set main process start time
start_time = time.time()
# Start the main process
start_thread_process(queue_pile, args_array)
print '[Finished processing the entire queue! Time consuming:{0} Time Finished: {1}]'.format(time.time() - start_time, time.strftime("%c"))
In Linux:
Use nice() with a numerical value:
#on Unix use ps.nice(10) for very low priority
p.nice(10)
https://en.wikipedia.org/wiki/Nice_(Unix)#:~:text=nice%20is%20a%20program%20found,CPU%20time%20than%20other%20processes.

Single thread programme apparently using multiple core

Question summary: all four cores used when running a single threaded programme. Why?
Details: I have written a non-parallelised programme in Xcode (C++). I was in the process of parallelising it, and wanted to see whether what I was doing was actually resulting in more cores being used. To that end I used Instruments to look at the core usage. To my surprise, while my application is single threaded, all four cores were being utilised.
To test whether it changed the performance, I dialled down the number of cores available to 1 (you can do it in Instruments, preferences) and the speed wasn't reduced at all. So (as I knew) the programme isn't parallelised in any way.
I can't find any information on what it means to use multiple cores to perform single threaded tasks. Am I reading the Instruments output wrong? Or is the single-threaded process being shunted between different cores for some reason (like changing lanes on a road instead of driving in two lanes at once - i.e. actual parallelisation)?
Thanks for any insight anyone can give on this.
EDIT with MWE (apologies for not doing this initially).
The following is C++ code that finds primes under 500,000, compiled in Xcode.
#include <iostream>
int main(int argc, const char * argv[]) {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;
int num_max = 500000;
while (num <= num_max) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num){
primes++;
std::cout << "Prime: " << num << std::endl;
}
num++;
}
end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
std::cout << "This machine calculated all " << primes << " under " << num_max << " in " << runTime << " seconds." << std::endl;
return 0;
}
This runs in 36s or thereabouts on my machine, as shown by the final out and my phone's stopwatch. When I profile it (using instruments launched from within Xcode) it gives a run-time of around 28s. The following image shows the core usage.
instruments showing core usage with all 4 cores (with hyper threading)
Now I reduce number of available cores to 1. Re-running from within the profiler (pressing the record button), it says a run-time of 29s; a picture is shown below.
instruments output with only 1 core available
That would accord with my theory that more cores doesn't improve performance for a single thread programme! Unfortunately, when I actually time the programme with my phone, the above took about 1 minute 30s, so there is a meaningful performance gain from having all cores switched on.
One thing that is really puzzling me, is that, if you leave the number of cores at 1, go back to Xcode and run the program, it again says it takes about 33s, but my phone says it takes 1 minute 50s. So changing the cores is doing something to the internal clock (perhaps).
Hopefully that describes the problem fully. I'm running on a 2015 15 inch MBP, with 2.2GHz i7 quad core processor. Xcode 7.3.1
I want to premise your answer lacks a lots of information in order to proceed an accurate diagnostic. Anyway I'll try to explain you the most common reason IHMO, supposing you application doesn't use 3-rd part component which perform in a multi-thread way.
I think that could be a result of scheduler effect. I'm going to explain what I mean.
Each core of the processor takes a process in the system and executed it for a "short" amount of time. This is the most common solution in desktop operative system.
Your process is executed on a single core for this amount of time and then stopped in order to allow other process to continue. When your same process is resumed it could be executed in another core (always one core, but a different one). So a poor precise task manager with a low resolution time could register the utilization of all cores, even if it does not.
In order to verify whether the cause could be that, I suggest you to see the amount of CPU % used in the time your application is running. Indeed in case of a single thread application the CPU should be about 1/#numberCore , in your case 25%.
If it's a release build your compiler may be vectorising parallelise your code. Also libraries you link against, say the standard library for example, may be threaded or vectorised.

Why does perf -e cpu-cycles report different answers on multiple runs?

On this sample code:
#include <iostream>
using namespace std;
int main()
{
cout<<"Hello World!"<<endl;
return 0;
}
I ran the following command 3 times:
perf stat -e cpu-cycles ./sample
Following are the 3 outputs on consecutive executions:
1)
Hello World!
Performance counter stats for './try':
22,71,970 cpu-cycles
0.003634105 seconds time elapsed
2)
Hello World!
Performance counter stats for './try':
18,51,044 cpu-cycles
0.001045616 seconds time elapsed
3)
Hello World!
Performance counter stats for './try':
18,21,834 cpu-cycles
0.001153489 seconds time elapsed
Why would the same program take different number of cpu-cycles on multiple runs?
I am using "Intel(R) Core(TM) i5-5250U CPU # 1.60GHz", "Ubuntu 14.04.3 LTS" and "g++ 4.8.4".
During start-up of the program the binary code is memory-mapped but loaded lazily (as needed). Hence, during the first invocation of the program some CPU cycles are spent by the kernel on (transparently) loading the binary code as execution hits instruction pages that are not yet in the RAM. Subsequent invocations take less time since they reuse the cached code, unless the underlying file has changed or the program hasn't been executed for a long time and its pages were recycled.

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.
Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )
I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).