time-critical application with Python on Windows - python-2.7

I run a time-critical application on Windows 10 using Python 2.7x. And it seems Windows sometimes interrupts my program for a fraction of a seconds. This happens every like 5 - 10 seconds.
How can I "tell" Windows that my program is the only thing which is important as long as it is running?

Sorry, I wasn't clear about my link; I meant the psutil usage as proposed in the second comment on http://code.activestate.com/recipes/496767-set-process-priority-in-windows/:
import psutil, os
p = psutil.Process(os.getpid())
print(p.nice())
p.nice(psutil.HIGH_PRIORITY_CLASS)
print(p.nice())
... yields:
32
128

Related

Python 2.7 : How to track declining RAM?

Data is updated every 5 min. Every 5 min a python script I wrote is run. This data is related to signals, and when the data says a signal is True, then the signal name is shown in a PyQt Gui that I have.
In other words, the Gui is always on my screen, and every 5 min its "main" function is triggered and the "main" function's job is to check the database of signals against the newly downloaded data. I leave this GUI open for hours and days at a time and the computer always crashes. Random python modules get corrupted (pandas can't import this or numpy can't import that) and I have to reinstall python and all the packages.
I have a hypothesis that this is related to the program being open for a long time and using up more and more memory which eventually crashes the computer when the memory runs out.
How would I test this hypothesis? If I can just show that with every 5-min run the available memory decreases, then it would suggest that my hypothesis might be correct.
Here is the code that reruns the "main" function every 5 min:
class Editor(QtGui.QMainWindow):
# my app
def main():
app = QtGui.QApplication(sys.argv)
ex = Editor()
milliseconds_autocheck_frequency = 300000 # number of milliseconds in 5 min
timer = QtCore.QTimer()
timer.timeout.connect(ex.run)
timer.start(milliseconds_autocheck_frequency)
sys.exit(app.exec_())
if __name__ == '__main__':
main()

Poor man's Python data parallelism doesn't scale with the # of processors? Why?

I'm needing to make many computational iterations, say 100 or so, with a very complicated function that accepts a number of input parameters. Though the parameters will be varied, each iteration will take nearly the same amount of time to compute. I found this post by Muhammad Alkarouri. I modified his code so as to fork the script N times where I would plan to set N equal at least to the number of cores on my desktop computer.
It's an old MacPro 1,0 running 32-bit Ubuntu 16.04 with 18GB of RAM, and in none of the runs of the test file below does the RAM usage exceed about 15% (swap is never used). That's according to the System Monitor's Resources tab, on which it is also shown that when I try running 4 iterations in parallel all four CPUs are running at 100%, whereas if I only do one iteration only one CPU is 100% utilized and the other 3 are idling. (Here are the actual specs for the computer. And cat /proc/self/status shows a Cpus_allowed of 8, twice the total number of cpu cores, which may indicate hyper-threading.)
So I expected to find that the 4 simultaneous runs would consume only a little more time than one (and going forward I generally expected run times to pretty much scale in inverse proportion to the number of cores on any computer). However, I found to the contrary that instead of the running time for 4 being not much more than the running time for one, it is instead a bit more than twice the running time for one. For example, with the scheme presented below, a sample "time(1) real" value for a single iteration is 0m7.695s whereas for 4 "simultaneous" iterations it is 0m17.733s. And when I go from running 4 iterations at once to 8, the running time scales in proportion.
So my question is, why doesn't it scale as I had supposed (and can anything be done to fix that)? This is, by the way, for deployment on a few desktops; it doesn't have to scale or run on Windows.
Also, I have for now neglected the multiprocessing.Pool() alternative as my function was rejected as not being pickleable.
Here is the modified Alkarouri script, multifork.py:
#!/usr/bin/env python
import os, cPickle, time
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def run_in_separate_process(func, *args, **kwds):
numruns = 4
result = [ None ] * numruns
pread = [ None ] * numruns
pwrite = [ None ] * numruns
pid = [ None ] * numruns
for i in range(numruns):
pread[i], pwrite[i] = os.pipe()
pid[i] = os.fork()
if pid[i] > 0:
pass
else:
os.close(pread[i])
result[i] = func(*args, **kwds)
with os.fdopen(pwrite[i], 'wb') as f:
cPickle.dump((0,result[i]), f, cPickle.HIGHEST_PROTOCOL)
os._exit(0)
#time.sleep(17)
while True:
for i in range(numruns):
os.close(pwrite[i])
with os.fdopen(pread[i], 'rb') as f:
stat, res = cPickle.load(f)
result[i] = res
#os.waitpid(pid[i], 0)
if not None in result: break
return result
def main():
print 'Running multifork.py.'
print run_in_separate_process( function, 3 )
if __name__ == "__main__":
main()
With multifork.py, uncommenting os.waitpid(pid[i], 0) has no effect. And neither does uncommenting time.sleep() if 4 iterations are being run at once and if the delay is not set to more than about 17 seconds. Given that time(1) real is something like 0m17.733s for 4 iterations done at once, I take that to be an indication that the While True loop is not itself a cause of any appreciable inefficiency (due to the processes all taking the same amount of time) and that the 17 seconds are indeed being consumed solely by the child processes.
Out of a profound sense of mercy I have spared you for now my other scheme, with which I employed subprocess.Popen() in lieu of os.fork(). With that one I had to send the function to an auxiliary script, the script that defines the command that is the first argument of Popen(), via a file. I did however use the same While True loop. And the results? They were the same as with the simpler scheme that I am presenting here--- almost exactly.
Why don't you use joblib.Parallel feature?
#!/usr/bin/env python
from joblib import Parallel, delayed
import numpy as np
np.seterr(all='raise')
def function(x):
print '... the parameter is ', x
arr = np.zeros(5000).reshape(-1,1)
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
return { 'junk': arr, 'shape': arr.shape }
def main():
print 'Running multifork.py.'
print Parallel(n_jobs=2)(delayed(function)(3) for _ in xrange(4))
if __name__ == "__main__":
main()
It seems that you have some bottleneck in your computations.
On your example you passes your data over the pipe which is not very fast method. To avoid this performance problem you should use shared memory. This is how multiprocessing and joblib.Parallel work.
Also you should remember that in single-thread case you don't have to serialize and deserialize the data but in case of multiprocess you have to.
Next, even if you have 8 cores it could be hyper-threading feature enabled with divides core performance into 2 threads, thus if you have 4 HW cores the OS things that there are 8 of them. There are a lot of advantages and disadvantages to use HT but the main thing is if you going to load all your cores to do some computations for a long time then you should disable it.
For example, I have a Intel(R) Core(TM) i3-2100 CPU # 3.10GHz with 2 HW cores and HT enabled. So in top I saw 4 cores. The time of computations for me are:
n_jobs=1 - 0:07.13
n_jobs=2 - 0:06.54
n_jobs=4 - 0:07.45
This is how my lscpu looks like:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Stepping: 7
CPU MHz: 1600.000
BogoMIPS: 6186.10
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3
Note at Thread(s) per core line.
So in your example there are not so many computations rather that data transfering. You application doesn't have time to get the advantages of parallelism. If you will have a long computation job (about 10 minutes) I think you will get it.
ADDITION:
I've taken a look at your function more detailed. I've replaced multiple execution with just one execution of function(3) function and run it within profiler:
$ /usr/bin/time -v python -m cProfile so.py
Output if quite long, you can view full version here (http://pastebin.com/qLBBH5zU). But the main thing is that the program lives most of the time in numpy.concatenate function. You can see it:
ncalls tottime percall cumtime percall filename:lineno(function)
.........
600 1.375 0.002 1.375 0.002 {numpy.core.multiarray.concatenate}
.........
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.64
.........
If you will run multiple instances of this program you will see that the time increases very much as the execution time of individual program instance. I've started 2 copy simultaneously:
$ /usr/bin/time -v python -m cProfile prog.py & /usr/bin/time -v python -m cProfile prog.py &
On the other hand I wrote a small fibo function:
def fibo(x):
arr = (0, 1)
for _ in xrange(x):
arr = (arr[-1], sum(arr))
return arr[0]
And replace the concatinate line with fibo(10000). In this case the execution time of single-instanced program is 0:22.82 while the execution time of two instances takes almost the same time per instance (0:24.62).
Based on this I think perhaps numpy uses some shared resource which leads to parallization problem. Or it can be numpy or scipy -specific issue.
And the last thing about the code, you have to replace the block below:
for r in range(3):
for i in range(200):
arr = np.concatenate( (arr, np.log( np.arange(2, 5002) ).reshape(-1,1) ), axis=1 )
With the only line:
arr = np.concatenate( (arr, np.log(np.arange(2, 5002).repeat(3*200).reshape(-1,3*200))), axis=1 )
I am providing an answer here so as to not leave such a muddled record. First, I can report that most-serene frist's joblib code works, and note that it's shorter and that it also doesn't suffer from the limitations of the While True loop of my example, which only works efficiently with jobs that each take about the same amount of time. I see that the joblib project has current support and if you don't mind dependency upon a third-party library it may be an excellent solution. I may adopt it.
But, with my test function the time(1) real, the running time using the time wrapper, is about the same with either joblib or my poor man's code.
To answer the question of why the scaling of the running time in inverse proportion to the number of physical cores doesn't go as hoped, I had diligently prepared my test code that I presented here so as to produce similar output and to take about as long to run as the code of my actual project, per single-run without parallelism (and my actual project is likewise CPU-bound, not I/O-bound). I did so in order to test before undertaking the not-really-easy fitting of the simple code into my rather complicated project. But I have to report that in spite of that similarity, the results were much better with my actual project. I was surprised to see that I did get the sought inverse scaling of running time with the number of physical cores, more or less.
So I'm left supposing--- here's my tentative answer to my own question--- that perhaps the OS scheduler is fickle and very sensitive to the type of job. And there could be effects due to other processes that may be running even if, as in my case, the other processes are hardly using any CPU time (I checked; they weren't).
Tip #1: Never name your joblib test code joblib.py (you'll get an import error). Tip #2: Never rename your joblib.py file test code file and run the renamed file without deleting the joblib.py file (you'll get an import error).

pthread multithreading in Mac OS vs windows multithreaing

I've developed a multi platform program (using the FLTK toolkit) and implement multithreading to do background intensive tasks.
I have followed the FLTK tutorials/examples on multithreading which involve using pthread on Mac, ie the function pthread_create and windows threading on windows ie _beginthread
What I have noticed is that the performance is much higher on Windows ie 3 to 4 times faster in these background threads (in the time to execute them).
Why might this be? Is it the threading libraries I'm using? Surely there shouldn't be such a difference? Or could it be the runtime libraries underneath it all?
Here are my machine details
Mac:
Intel(R) Core(TM) i7-3820QM CPU # 2.70GHz
16 GB DDR3 1600 MHz
Model MacBookPro9,1
OS: Mac OSX 10.8.5
Windows:
Intel(R) Core(TM) i7-3520M CPU # 2.90GHz
16 GB DDR3 1600 MHz
Model: Dell Latitude E5530
OS: Windows 7 Service Pack 1
EDIT
To just do a basic speed comparison I compiled this on both machines running from the command line
int main(int agrc, char **argv)
{
time_t t = time(NULL);
tm* tt=localtime(&t);
std::stringstream s;
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"1: "<<s.str()<<std::endl;
double sum=0;
for (int i=0;i<100000000;i++){
double ii=i*0.123456789;
sum=sum+sin(ii)*cos(ii);
}
t = time(NULL);
tt=localtime(&t);
s.str("");
s<< std::setfill('0')<<std::setw(2)<<tt->tm_mday<<"/"<<std::setw(2)<<tt->tm_mon+1<<"/"<< std::setw(4)<<tt->tm_year+1900<<" "<< std::setw(2)<<tt->tm_hour<<":"<<std::setw(2)<<tt->tm_min<<":"<<std::setw(2)<<tt->tm_sec;
std::cout<<"2: "<<s.str()<<std::endl;
}
Windows takes less than a second. Mac takes 4/5 seconds. Any ideas?
On Mac I'm compiling with g++, with visual studio 2013 on windows.
SECOND EDIT
if I change the line
std::cout<<"2: "<<s.str()<<std::endl;
to
std::cout<<"2: "<<s.str()<<" "<<sum<<std::endl;
Then all of a sudden Windows takes a little bit longer...
This makes me think that the whole thing might be compiler optimisation. So the question would be is g++ (4.2 is the version I have) worse at optimisation or do I need to provide additional flags?
THIRD(!) AND FINAL EDIT
I can report that I achieve comparable performance by ensuring g++ optimisation flags -O were provided at compile time. One of those annoying things that happens so often
A: Im tearing my hair out on problem x
B: Are you sure you're not doing y?
A: That works, why is this information not plastered all over the place and in every tutorial on problem x on the web?
B: Did you read the manual?
A: No, if I completely read the manual for every single bit of code/program I used I would never actually get round to doing anything...
Meh.

Programmatically getting system boot up time in c++ (windows)

So quite simply, the question is how to get the system boot up time in windows with c/c++.
Searching for this hasn't got me any answer, I have only found a really hacky approach which is reading a file timestamp ( needless to say, I abandoned reading that halfway ).
Another approach that I found was actually reading windows diagnostics logged events? Supposedly that has last boot up time.
Does anyone know how to do this (with hopefully not too many ugly hacks)?
GetTickCount64 "retrieves the number of milliseconds that have elapsed since the system was started."
Once you know how long the system has been running, it is simply a matter of subtracting this duration from the current time to determine when it was booted. For example, using the C++11 chrono library (supported by Visual C++ 2012):
auto uptime = std::chrono::milliseconds(GetTickCount64());
auto boot_time = std::chrono::system_clock::now() - uptime;
You can also use WMI to get the precise time of boot. WMI is not for the faint of heart, but it will get you what you are looking for.
The information in question is on the Win32_OperatingSystem object under the LastBootUpTime property. You can examine other properties using WMI Tools.
Edit:
You can also get this information from the command line if you prefer.
wmic OS Get LastBootUpTime
As an example in C# it would look like the following (Using C++ it is rather verbose):
static void Main(string[] args)
{
// Create a query for OS objects
SelectQuery query = new SelectQuery("Win32_OperatingSystem", "Status=\"OK\"");
// Initialize an object searcher with this query
ManagementObjectSearcher searcher = new ManagementObjectSearcher(query);
string dtString;
// Get the resulting collection and loop through it
foreach (ManagementObject envVar in searcher.Get())
dtString = envVar["LastBootUpTime"].ToString();
}
The "System Up Time" performance counter on the "System" object is another source. It's available programmatically using the PDH Helper methods. It is, however, not robust to sleep/hibernate cycles so is probably not much better than GetTickCount()/GetTickCount64().
Reading the counter returns a 64-bit FILETIME value, the number of 100-NS ticks since the Windows Epoch (1601-01-01 00:00:00 UTC). You can also see the value the counter returns by reading the WMI table exposing the raw values used to compute this. (Read programmatically using COM, or grab the command line from wmic:)
wmic path Win32_PerfRawData_PerfOS_System get systemuptime
That query produces 132558992761256000 for me, corresponding to Saturday, January 23, 2021 6:14:36 PM UTC.
You can use the PerfFormattedData equivalent to get a floating point number of seconds, or read that from the command line in wmic or query the counter in PowerShell:
Get-Counter -Counter '\system\system up time'
This returns an uptime of 427.0152 seconds.
I also implemented each of the other 3 answers and have some observations that may help those trying to choose a method.
Using GetTickCount64 and subtracting from current time
The fastest method, clocking in at 0.112 ms.
Does not produce a unique/consistent value at the 100-ns resolution of its arguments, as it is dependent on clock ticks. Returned values are all within 1/64 of a second of each other.
Requires Vista or newer. XP's 32-bit counter rolls over at ~49 days and can't be used for this approach, if your application/library must support older Windows versions
Using WMI query of the LastBootUpTime field of Win32_OperatingSystem
Took 84 ms using COM, 202ms using wmic command line.
Produces a consistent value as a CIM_DATETIME string
WMI class requires Vista or newer.
Reading Event Log
The slowest method, taking 229 ms
Produces a consistent value in units of seconds (Unix time)
Works on Windows 2000 or newer.
As pointed out by Jonathan Gilbert in the comments, is not guaranteed to produce a result.
The methods also produced different timestamps:
UpTime: 1558758098843 = 2019-05-25 04:21:38 UTC (sometimes :37)
WMI: 20190524222528.665400-420 = 2019-05-25 05:25:28 UTC
Event Log: 1558693023 = 2019-05-24 10:17:03 UTC
Conclusion:
The Event Log method is compatible with older Windows versions, produces a consistent timestamp in unix time that's unaffected by sleep/hibernate cycles, but is also the slowest. Given that this is unlikely to be run in a loop it's this may be an acceptable performance impact. However, using this approach still requires handling the situation where the Event log reaches capacity and deletes older messages, potentially using one of the other options as a backup.
C++ Boost used to use WMI LastBootUpTime but switched, in version 1.54, to checking the system event log, and apparently for a good reason:
ABI breaking: Changed bootstamp function in Windows to use EventLog service start time as system bootup time. Previously used LastBootupTime from WMI was unstable with time synchronization and hibernation and unusable in practice. If you really need to obtain pre Boost 1.54 behaviour define BOOST_INTERPROCESS_BOOTSTAMP_IS_LASTBOOTUPTIME from command line or detail/workaround.hpp.
Check out boost/interprocess/detail/win32_api.hpp, around line 2201, the implementation of the function inline bool get_last_bootup_time(std::string &stamp) for an example. (I'm looking at version 1.60, if you want to match line numbers.)
Just in case Boost ever dies somehow and my pointing you to Boost doesn't help (yeah right), the function you'll want is mainly ReadEventLogA and the event ID to look for ("Event Log Started" according to Boost comments) is apparently 6005.
I haven't played with this much, but I personally think the best way is probably going to be to query the start time of the "System" process. On Windows, the kernel allocates a process on startup for its own purposes (surprisingly, a quick Google search doesn't easily uncover what its actual purposes are, though I'm sure the information is out there). This process is called simply "System" in the Task Manager, and always has PID 4 on current Windows versions (apparently NT 4 and Windows 2000 may have used PID 8 for it). This process never exits as long as the system is running, and in my testing behaves like a full-fledged process as far as its metadata is concerned. From my testing, it looks like even non-elevated users can open a handle to PID 4, requesting PROCESS_QUERY_LIMITED_INFORMATION, and the resulting handle can be used with GetProcessTimes, which will fill in the lpCreationTime with the UTC timestamp of the time the process started. As far as I can tell, there isn't any meaningful way in which Windows is running before the System process is running, so this timestamp is pretty much exactly when Windows started up.
#include <iostream>
#include <iomanip>
#include <windows.h>
using namespace std;
int main()
{
unique_ptr<remove_pointer<HANDLE>::type, decltype(&::CloseHandle)> hProcess(
::OpenProcess(
PROCESS_QUERY_LIMITED_INFORMATION,
FALSE, // bInheritHandle
4), // dwProcessId
::CloseHandle);
FILETIME creationTimeStamp, exitTimeStamp, kernelTimeUsed, userTimeUsed;
FILETIME creationTimeStampLocal;
SYSTEMTIME creationTimeStampSystem;
if (::GetProcessTimes(hProcess.get(), &creationTimeStamp, &exitTimeStamp, &kernelTimeUsed, &userTimeUsed)
&& ::FileTimeToLocalFileTime(&creationTimeStamp, &creationTimeStampLocal)
&& ::FileTimeToSystemTime(&creationTimeStampLocal, &creationTimeStampSystem))
{
__int64 ticks =
((__int64)creationTimeStampLocal.dwHighDateTime) << 32 |
creationTimeStampLocal.dwLowDateTime;
wios saved(NULL);
saved.copyfmt(wcout);
wcout << setfill(L'0')
<< setw(4)
<< creationTimeStampSystem.wYear << L'-'
<< setw(2)
<< creationTimeStampSystem.wMonth << L'-'
<< creationTimeStampSystem.wDay
<< L' '
<< creationTimeStampSystem.wHour << L':'
<< creationTimeStampSystem.wMinute << L':'
<< creationTimeStampSystem.wSecond << L'.'
<< setw(7)
<< (ticks % 10000000)
<< endl;
wcout.copyfmt(saved);
}
}
Comparison for my current boot:
system_clock::now() - milliseconds(GetTickCount64()):
2020-07-18 17:36:41.3284297
2020-07-18 17:36:41.3209437
2020-07-18 17:36:41.3134106
2020-07-18 17:36:41.3225148
2020-07-18 17:36:41.3145312
(result varies from call to call because system_clock::now() and ::GetTickCount64() don't run at exactly the same time and don't have the same precision)
wmic OS Get LastBootUpTime
2020-07-18 17:36:41.512344
Event Log
No result because the event log entry doesn't exist at this time on my system (earliest event is from July 23)
GetProcessTimes on PID 4:
2020-07-18 17:36:48.0424863
It's a few seconds different from the other methods, but I can't think of any way that it is wrong per se, because, if the System process wasn't running yet, was the system actually booted?

Increasing SQLite SELECT performance

I have a program that does some math in an SQL query. There are hundreds of thousands rows (some device measurements) in an SQLite table, and using this query, the application breaks these measurements into groups of, for example, 10000 records, and calculates the average for each group. Then it returns the average value for each of these groups.
The query looks like this:
SELECT strftime('%s',Min(Stamp)) AS DateTimeStamp,
AVG(P) AS MeasuredValue,
((100 * (strftime('%s', [Stamp]) - 1334580095)) /
(1336504574 - 1334580095)) AS SubIntervalNumber
FROM LogValues
WHERE ((DeviceID=1) AND (Stamp >= datetime(1334580095, 'unixepoch')) AND
(Stamp <= datetime(1336504574, 'unixepoch')))
GROUP BY ((100 * (strftime('%s', [Stamp]) - 1334580095)) /
(1336504574 - 1334580095)) ORDER BY MIN(Stamp)
The numbers in this request are substituted by my application with some values.
I don't know if i can optimize this request more (if anyone could help me to do so, i'd really appreciate)..
This SQL query can be executed using an SQLite command line shell (sqlite3.exe). On my Intel Core i5 machine it takes 4 seconds to complete (there are 100000 records in the database that are being processed).
Now, if i write a C program, using sqlite.h C interface, I am waiting for 14 seconds for exactly the same query to complete. This C program "waits" during these 14 seconds on the first sqlite3_step() function call (any following sqlite3_step() calls are executed immediately).
From the Sqlite download page I have downloaded SQLite command line shell's source code and build it using Visual Studio 2008. I ran it and executed the query. Again 14 seconds.
So why does a prebuilt, downloaded from the sqlite website, command line tool takes only 4 seconds, while the same tool, built by me, takes 4 times longer time to execute?
I am running Windows 64 bit. The prebuilt tool is an x86 process. It also does not seem to be multicore optimized - in a Task Manager, during query execution, I can see only one core busy, for both built-by-mine and prebuilt SQLite shells.
Any way I could make my C program execute this query as fast as the prebuilt command line tool does it?
Thanks!