Profiling Python on large dataset - python-2.7

I have a dataset with 3 Mio lines to process. Processing functions are cythonized. When I do the entire processing on a small subsample of 10000 lines, processing time is about 1,5 minute, a subsample of 30000 lines gives a processing time of 3 min. However, when I process the whole dataset after 10 hours only 1/4th of the dataset is processed, although I expect a processing time of max. 5 hours. I'm running Ubuntu 14.04 64 Bit and Anaconda 64 bit. RAM usage is at 50%. I deactivated directing to login after a period of inactivity, performance stayed the same. Switching of the screen after inactivity didn't influence execution time eighter. What else could be the reason for this unexpectedly slow execution?

Related

Latency jitters when using shared memory for IPC

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.
Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Processing tasks in parallel in specific time frame without waiting for them to finish

This is a question about concurrency/parallelism and processes. I am not sure how to express it, so please forgive my ignorance.
It is not related to any specific language, although I'm using Rust lately.
The question is if it is possible to launch processes in concurrent/parallel mode, without waiting for them to finish, and within a specific time frame, even when the total time of the processes takes more than the given time frame.
For example: lets say I have 100 HTTP requests that I want to launch in one second, separated by 10ms each. Each request will take +/- 50ms. I have a computer with 2 cores to make them.
In parallel that would be 100 tasks / 2 cores, 50 tasks each. The problem is that 50 tasks * 50ms each is 2500ms in total, so two seconds and half to run the 100 tasks in parallel.
Would it be possible to launch all these tasks in 1s?

CLIPS EnvAssertString API-function slow performance

I'm trying to develop an expert-system which is capable of managing a real-time data flow. During the coding procedure I found a delay in operation which varied from 3 to almost 20 milliseconds and this is totally inappropriate for the project. The application profiling showed that the problem resided in EnvAssertString function, whilst EnvRun did not produce any delay. I tried to temporaily disable garbage collection before EnvAssertString, but it didn't help. The function in question is performed between 10 and 50 times in a row during the processing of a single block of data and the blocks are arriving at a rate of approx. 15 blocks per second.
How can I fix this? Is there any chance of speeding the process up? Is CLIPS at all suitable for a real-time response like that (sevaral calls in a row to EnvAssertString shouldn't take longer than 1 ms)?

How flink checkpointing time is related to buffer alignment size or alignment time?

My streaming flink job has checkpointing time of 2-3s(15-20% of time) and 3-4 mins(8-12% of time) and 2 mins on an average. We have two operators which are stateful. First is kafka consumer as source(FlinkKafkaConsumer010) and another is hdfs sink(CustomBucketingSink). This two makes state of around 1-1.5Gb for savepoints and 800mb-6Gb(3gb average) for checkpoint. We have 30sec of tumbling processing window. Checkpointing duration and minimum pause between two checkpoiting is 3 mins. My job consumes around 3 millions of records per minute on an average and around 20 millions/min records on peak time. There is more than enough cpu and memory for flink.
Now here are my doubts :
1) Even when few checkpointing state sizes are less(70-80% less) as compare to other checkpointing state, it takes minutes(15-20% of time) as compare to other one which takes 5-10 secs.
2) Buffer alignment size sometimes increases to 7-8gb as compare to 800mb-1gb average but checkpointing time is not affected by this. I guess it should take more time as it should wait for checkpoint barrier.
3) Will checkponting time be affected if we increase tumbling window size. I am considering it shouldn't affect neither savepoint time and nor checkpoint time.
4) Few of the sub-tasks which sinks into hdfs takes 2-3 mins (5-10% time). So while 98% of subtasks are completed in 30-50 secs. 1-2(95% of time, it's only one) subtasks takes 2-3 mins. Which delays the whole checkpointing time. Problem is not with the node on which this sub-tasks are running because it happens sometimes to some node and sometimes to another node.
5) We are getting one exception once every 6-8 hour which restarts the job. TimerException{java.nio.channels.ClosedByInterruptException} at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:288)
6) How to minimize the alignment buffer time.
7) Savepoint time increases or decreases with increase and decrease of rate of input or state size but checkpointing time doesn't hold the same. Checkpointing time sometimes shows inverse relation with state size or we can saw it's not affected with the state size.
8) Whenever we restart the job, all sub-tasks take uniform time for 2-3 days on all nodes but afterwards 1-2 sub-tasks takes 2-3 minutes as compare to other which are taking 15-30 secs. I might be wrong on this behaviour but as far i have observed, this is also a case.
Note that windows are stateful, and unless you are doing incremental aggregation, longer windows have more state, which will in turn affect checkpoint sizes and durations.
It would be helpful to know which state backend you are using, and whether or not you are using incremental checkpointing.
I would start by trying to find the cause of the slow sink subtask(s) causing the backpressure, which is in turn causing the painful checkpointing. Could be data skew, or resource starvation, for example. Some common causes include insufficient CPU, network, or disk bandwidth, or AWS (or other API) rate limits. It may seem that you have plenty of CPU, for example, but one hot key can put way too much load on one thread, and thereby hold back the entire cluster.
If you find a way to correct the imbalance at the sink, then the checkpoint alignment problems should calm down. (Note that if you can tolerate duplicate results, you could disable checkpoint barrier alignment by choosing CheckpointingMode.AT_LEAST_ONCE.)

OpenCL time measurment issues with AMD GPU

I recently compared 2 kinds of doing kernel runtime measuring and I see some confusing results.
I use an AMD Bobcat CPU (E-350) with integrated GPU and Ubuntu Linux (CL_PLATFORM_VERSION is OpenCL 1.2 AMD-APP (923.1)).
The basic gettimeofday idea looks like this:
clFinish(...) // that all tasks are finished on the command queue
gettimeofday(&starttime,0x0)
clEnqueueNDRangeKernel(...)
clFlush(...)
clWaitForEvents(...)
gettimeofday(&endtime,0x0)
This says the kernel needs around 5466 ms.
Second time measurement I did with clGetEventProfilingInfo for QUEUED / SUBMIT / START / END.
With the 4 time values I can calculate the time spend in the different states:
time spend queued: 0.06 ms,
time spend submitted: 2733 ms,
time spend in execution: 2731 ms (actual execution time).
I see that it adds up to the 5466 ms, but why does it stay in submitted state for half the time?
And the funny things are:
the submitted state is always half of the actual execution time, even for different kernels or different workload (so it can't be a constant setup time),
for the CPU the time spend in submitted state is 0 and the execution time is equal to the gettimeofday result,
I tested my kernels on an Intel Ivy Bridge with windows using CPU and GPU and I didn't see the effects there.
Does anyone have a clue?
I suspect that either the GPU runs the kernel twice (resulting in gettimeofday being double of the actual execution time) or that the function clGetEventProfilingInfo is not working correctly for the AMD GPU.
I posted the problem in an AMD forum. They say it's a bug in the AMD profiler.
http://devgurus.amd.com/thread/159809