Extra evaluation time when using GPU compared to CPU - python-2.7

Since I am still very new to Tensorflow, I don't know what caused this problem. I am currently using Tensorflow to do text classification. I needed the timing to evaluate 1 data since I need precision. When I time the evaluation time in CPU, the time required is constant while there seems to be additional time when using GPU.
Here is the statistics:
Using GPU:
1 data --> 300 ms.
10 data --> 300 ms. 30 ms each.
100 data --> 400 ms. 4 ms each.
1000 data --> 4000 ms. 4 ms each.
Using CPU :
1 data --> 10 ms.
10 data --> 100 ms. 10 ms each.
100 data --> 1000 ms. 10 ms each.
1000 data --> 10000 ms. 10 ms each.

Related

429 at very low qps despite adequate headroom

Xoogler in the cloud here. I have a very low qps service that serves HTML plus the follow-up resources. So it typically sits idle and then receives something in the order of 20 requests over 5s with concurrency well below 10, where concurrency limit is 80. I observe that clients regularly receive 429s from Cloud Run, typically after periods of service inactivity, even though an instance is still up (so it's not a cold-start problem). This can either be on the first request but often somewhere in the middle of the sequence (i.e. icons, css don't load).
The instance is concurrent, responsive and could easily handle the load, but Cloud Run doesn't let it. No other instances are spun up either, although we're not even at the max of 2. This suggests that Cloud Run for some reason estimates >2 instances needed?
Here's a typical request sequence, redacted from the logs:
... 20 min idle ...
I 2020-03-27T18:21:27.619317Z GET 307 288 B 5 ms
I 2020-03-27T18:21:27.706580Z GET 302 0 B 0 ms
I 2020-03-27T18:21:27.760271Z GET 200 5.83 KiB 5 ms
I 2020-03-27T18:21:27.838066Z GET 200 1.89 KiB 4 ms
I 2020-03-27T18:21:27.882751Z GET 200 1.05 KiB 4 ms
I 2020-03-27T18:21:27.886743Z GET 200 582 B 3 ms
I 2020-03-27T18:21:27.893060Z GET 200 533 B 4 ms
I 2020-03-27T18:21:27.897352Z GET 200 5.35 KiB 4 ms
I 2020-03-27T18:21:27.899086Z GET 200 11.38 KiB 6 ms
I 2020-03-27T18:21:27.905967Z GET 200 22.48 KiB 13 ms
I 2020-03-27T18:21:27.906113Z GET 200 592 B 13 ms
I 2020-03-27T18:21:27.907967Z GET 200 35.08 KiB 14 ms
...500ms...
I 2020-03-27T18:21:28.434846Z GET 200 2.76 MiB 50 ms
I 2020-03-27T18:21:28.465552Z GET 200 2.29 MiB 67 ms <= up to here all resources served from image
...2500ms...
I 2020-03-27T18:21:31.086943Z GET 200 2.95 KiB 706 ms <= IO-bound, talking to backend api
...1600ms...
W 2020-03-27T18:21:32.674973Z GET 429 14 B 0 ms <= !!!
W 2020-03-27T18:21:32.675864Z GET 429 14 B 0 ms <= !!!
W 2020-03-27T18:21:32.676292Z GET 429 14 B 0 ms <= !!!
I 2020-03-27T18:21:32.684265Z GET 200 547 B 6 ms
I 2020-03-27T18:21:32.686695Z GET 200 504 B 9 ms
I 2020-03-27T18:21:32.690580Z GET 200 486 B 12 ms
Conceivably that last group of requests are 6 parallel requests. Why would three be denied and three served? The service is way under capacity. A couple of reloads typically solve the issue.
It really appears to me as if the algorithm vastly overestimates the required resources after a period of inactivity. I'm happy to try a larger max-instances (redeployed to 10 now) but something really seems off with the estimates on the low end of the spectrum. If "2" as a concurrency setting is below what the platform supports, gcloud probably should probably enforce a higher minimum in the first place.
This is somewhat sad as it impacts people just "trying out" Cloud Run and they observe intermittent errors (partially rendered pages, ...) - which are even pinned on the client (4xx) who is certainly not at fault.
Happy to provide more data.
Configuration:
template:
metadata:
...
annotations:
...
autoscaling.knative.dev/maxScale: '2'
spec:
timeoutSeconds: 900
...
containerConcurrency: 80
containers:
...
resources:
limits:
cpu: 1000m
memory: 244Mi
This looks like a known issue with Cloud Run, I would recommend starring it to receive notifications and expedite resolution.

Intel MPI Benchmarks result exceeds network bandwidth

I was running the Intel MPI Benchmarks on a mini MPI cluster of two nodes on AWS EC2.
The executables of the benchmark were compiled with make CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx.
The cluster was set up based on this tutorial.
(What I did differently is that I installed OpenMPI instead of MPICH.)
The type of the AWS instances is t3a.xlarge.
According to this spec the network performance is up to 5 Gbps (i.e. 625 MB/s).
I chose Ubuntu 18.04 as the OS for these instances.
The PingPong benchmark gave me surprising results:
mpiuser#mpin0:~$ mpirun -rf rankfile -np 2 ./IMB-MPI1 PingPong
#------------------------------------------------------------
# Intel(R) MPI Benchmarks 2019 Update 3, MPI-1 part
#------------------------------------------------------------
# Date : Thu Nov 28 15:17:38 2019
# Machine : x86_64
# System : Linux
# Release : 4.15.0-1054-aws
# Version : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
# MPI Version : 3.1
# MPI_Datatype : MPI_BYTE
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 1000 32.05 0.00
1 1000 31.74 0.03
2 1000 31.98 0.06
4 1000 32.08 0.12
8 1000 31.86 0.25
16 1000 31.30 0.51
32 1000 31.40 1.02
64 1000 32.86 1.95
128 1000 32.66 3.92
256 1000 33.56 7.63
512 1000 35.00 14.63
1024 1000 34.74 29.48
2048 1000 36.56 56.02
4096 1000 39.80 102.91
8192 1000 49.92 164.10
16384 1000 57.73 283.79
32768 1000 78.86 415.54
65536 640 170.26 384.91
131072 320 239.80 546.60
262144 160 357.34 733.61
524288 80 609.82 859.74
1048576 40 1106.36 947.77
2097152 20 2514.62 833.98
4194304 10 5830.39 719.39
Some of the speeds exceed 625 MB/s.
The rankfile was used to force distributing the two processes onto two nodes.
It is not the case that the two MPI processes are running on the same node.
The content of the rankfile is:
rank 0=mpin0 slot=0
rank 1=mpin1 slot=0
What are the possible reasons that caused the benchmark result to exceed the theoretical limit?

Why do I get such huge jitter in time measurement?

I'm trying to measure a function's performance by measuring the time for each iteration.
During the process, I found even if I do nothing, the results still vary quite a bit.
e.g.
volatile long count = 0;
for (int i = 0; i < N; ++i) {
measure.begin();
++count;
measure.end();
}
In measure.end(), I measure the time difference and keep an unordered_map to keep track of the time-count.
I've used clock_gettime as well as rdtsc, but there's always about 1% of the data points lie far away from mean, in a 1000 factor.
Here's what the above loop generates:
T: count percentile
18 117563 11.7563%
19 111821 22.9384%
21 201605 43.0989%
22 541095 97.2084%
23 2136 97.422%
24 2783 97.7003%
...
406 1 99.9994%
3678 1 99.9995%
6662 1 99.9996%
17945 1 99.9997%
18148 1 99.9998%
18181 1 99.9999%
22800 1 100%
mean:21
So whether it's ticks or ns, the worst case 22800 is about 1000 times bigger than mean.
I did isolcpus in grub and was running this with taskset. The simple loop almost does nothing, the hash table to do time-count statistics is outside of the time measurements.
What am I missing?
I'm running this on a laptop with ubuntu installed, CPU is Intel(R) Core(TM) i5-2520M CPU # 2.50GHz
Thank you for all the answers.
The main interrupt that I couldn't stop is the local timer interrupt. And it seems new 3.10 kernel would support tickless. I'll try that one.

SoapUI load test, calculate cnt in variance strategy

I work with SoapUI project and I have one question. In following example I've got 505 requests in 5 seconds with thread count =5. I would like to understand how count has been calculated in this example.
For example, if I want 1000 request in 1 minute what setting should I set in variance strategy?
Regards, Evgeniy
variance strategy as the name implies, it varies the number of threads overtime.Within the specified interval the threads will increase and decrease as per the variance value, thus simulating a realistic real time load on target web-service.
How variance is calculated : its not calculated using the mathematical variance formula. its just a multiplication. (if threads = 10 and variance = 0.5 then 10 * 0.5 = 5. The threads will be incremented and decremented by 5)
For example:
Threads = 20
variance = 0.8
Strategy = variance
interval = 60
limit = 60 seconds
the above will vary the thread by 16 (because 20 * 0.8 = 16), that is the thread count will increase to 36 and decrease to 4 and end with the original 20 within the 60 seconds.
if your requirement is to start with 500 threads and hit 1000 set your variance to 2 and so on.
refrence link:
chek the third bullet - simulating different type of load - soapUI site
Book for reference:
Web Service Testing with SoapUi by Charitha kankanamge

Jython + Django not ready for production?

So recently I was playing around with Django on the Jython platform and wanted to see its performance in "production". The site I tested with was just a simple return HttpResponse("Time %.2f" % time.time()) view, so no database involved.
I tried the following two combinations (measurements done with ab -c15 -n500 -k <url>, everything in Ubuntu Server 10.10 on VirtualBox):
J2EE application server (Tomcat/Glassfish), deployed WAR file
I get results like
Requests per second: 143.50 [#/sec] (mean)
[...]
Percentage of the requests served within a certain time (ms)
50% 16
66% 16
75% 16
80% 16
90% 31
95% 31
98% 641
99% 3219
100% 3219 (longest request)
Obviously, the server hangs for a few seconds once in a while, which is not acceptable. I assume it has something to do with reloading Jython because starting the jython shell takes about 3 seconds, too.
AJP serving using patched flup package (+ Apache as frontend)
Note: flup is the package used by manage.py runfcgi, I had to patch it because flup's threading/forking support doesn't seem to work on Jython (-> AJP was the only working method).
Almost the same results here, but sometimes the last 100 requests don't even get answered at all (but server process still alive).
I'm asking this on SO (instead of serverfault) because it's very Django/Jython-specific. Does anyone have experience with deploying Django sites on Jython? Is there maybe another (faster) way to serve the site? Or is it just too early to use Django on the Java platform?
So as nobody replied, I investigated a bit more and it seems like my problem might have to do with VirtualBox. Using different server OSes (Debian Squeeze, Ubuntu Server), I had similar problems. For example, with simple static file serving, I got this result from the Apache web server (on Debian):
> ab -c50 -n1000 http://ip.of.my.vm/some/static/file.css
Requests per second: 91.95 [#/sec] (mean) <--- quite impossible for static serving
[...]
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 2 22.1 0 688
Processing: 0 206 991.4 31 9188
Waiting: 0 96 401.2 16 3031
Total: 0 208 991.7 31 9203
Percentage of the requests served within a certain time (ms)
50% 31
66% 47
75% 63
80% 78
90% 156
95% 781
98% 844
99% 9141 <--- !!!!!!
100% 9203 (longest request)
This led to the conclusion that (I don't have a conclusion, but) I think the Java reloading might not be the problem here, rather the virtualization. I will try it on a real host and leave this question unanswered till then.
FOLLOWUP
Now I successfully tested a bare-bones Django site (really just the welcome page) using Jython + AJP over TCP/mod_proxy_ajp on Apache (again with patched flup package). This time on a real host (i7 920, 6 GB RAM). The result proved that my above assumption was correct and that I really should never benchmark on a virtual host again. Here's the result for the welcome page:
Document Path: /jython-test/
Document Length: 2059 bytes
Concurrency Level: 40
Time taken for tests: 24.688 seconds
Complete requests: 20000
Failed requests: 0
Write errors: 0
Keep-Alive requests: 0
Total transferred: 43640000 bytes
HTML transferred: 41180000 bytes
Requests per second: 810.11 [#/sec] (mean)
Time per request: 49.376 [ms] (mean)
Time per request: 1.234 [ms] (mean, across all concurrent requests)
Transfer rate: 1726.23 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 1.5 0 20
Processing: 2 49 16.5 44 255
Waiting: 0 48 16.5 44 255
Total: 2 49 16.5 45 256
Percentage of the requests served within a certain time (ms)
50% 45
66% 48
75% 51
80% 53
90% 69
95% 80
98% 90
99% 97
100% 256 (longest request) # <-- no multiple seconds of waiting anymore
Very promising, I would say. The only downside is that the average request time is > 40 ms whereas the development server has a mean of < 3 ms.