What are typical runtimes for Miller-Rabin primality testing? - primes

I'm well aware that a single Miller-Rabin test runs in cubic log time. I know about Montgomery modular exponentiation and GNFS and I'm not asking about any of that fancy theory. What I am wondering is what some representative runtimes for MR (note that this is not the same as an RSA operation) on characteristic hardware (e.g., a 2.2 GHz Opteron or such-and-such graphics card or FPGA).

One random-base Miller-Rabin test implemented in GMP, average of multiple primes and multiple runs. i4770K # 4.3GHz, GMP 6.0.0a. For numbers under 64-bit, times can be faster using a non-GMP implementation (with x86_64 asm mulmod). This implementation seems to follow most other C+GMP implementations fairly closely in performance (for the same numbers, mpz_aprcl's mpz_sprp runs within a couple percent of the times below). Using non-standard API calls to do Montgomery math may be faster (and maybe not).
20 digits: 1.1 uS
40 digits: 3.4 uS
80 digits: 14 uS
200 digits: 0.11 mS
400 digits: 0.65 mS
800 digits: 4.7 mS
1200 digits: 15 mS
1600 digits: 31 mS
2000 digits: 53 mS
4000 digits: 310 mS
With a good implementation, BPSW (base 2 M-R + [extra] strong Lucas test) takes ~3x the cost of one M-R test. Lucas test implementations differ a lot more in performance. A Frobenius test is on the order of 2.5x the cost of a single M-R test.

One round of the Miller-Rabin test might take about 1 millisecond; here is an interactive implementation in JavaScript which you can run in your browser and check the timing yourself:
http://www.javascripter.net/math/primes/millerrabinprimalitytest.htm

Related

How accurate is std::chrono?

std::chrono advertises that it can report results down to the nanosecond level. On a typical x86_64 Linux or Windows machine, how accurate would one expect this to be? What would be the error bars for a measurement of 10 ns, 10 µs, 10 ms, and 10 s, for example?
It's most likely hardware and OS dependent. For example when I ask Windows what the clock frequency is using QueryPerformanceFrequency() I get 3903987, which if you take the inverse of that you get a clock period or resolution of about 256 nanoseconds. This is the value that that my operating system reports.
With std::chrono according to the docs the minimum representable duration is high_resolution_clock::period::num / high_resolution_clock::period::den.
The num and den are numerator and denominator. std::chrono::high_resolution_clock tells me the numerator is 1, and the denominator is 1 billion, supposedly corresponding to 1 nanosecond:
std::cout << (double)std::chrono::high_resolution_clock::period::num /
std::chrono::high_resolution_clock::period::den; // Results in a nanosecond.
So according to the std::chrono I have one nanosecond resolution but I don't believe it because the native OS system call is more likely to be reporting the more accurate frequency/period.
The accuracy will depend upon the application and how this application interacts with the operating system. I am not familiar with chrono specifically, but there are limitations at a lower level you must account for.
For example, if you timestamp network packets using the CPU, the measurement accuracy is very noisy. Even though the precision of the time measurement may be 1 nanosecond, the context switch time for the interrupt corresponding to the packet arrival may be ~1 microsecond. You can accurately measure when your application processes the packet, but not what time the packet arrived.
Short answer: Not accurate in microseconds and below.
Long answer:
I was interested to know about how much my two dp program takes time to execute. So I used the chrono library, but when I ran it it says 0 microseconds. So technically I was unable to compare. I can't increase the array size, because it will not be possible to extend it to 1e8.
So I wrote a sort program to test it and ran it for 100 and the following is the result:
Enter image description here
It is clearly visible that it is not consistent for same input, so I would recommend not to use for higher precision.

Poor speedup with libstdc++ parallel mode quick sort

I cannot get speedup higher than 2 with in-place sorting algorithms (quick sort and balanced quick sort; QS/BQS) from the parallel implementation of libstdc++ (parallel mode). I have tried to run the code on many different systems consisting of from 16 to 24 cores. I have also tried GNU and Intel C++ compilers, even in different versions, always with same results. The speedup around 2 is the same for any number of cores between 2 and max.
On the contrary, multi-way merge sort (MWMS) scales well (speedup around 10 using 16 threads on 16 cores machine). According to J. Singler's presentation "The GNU libstdc++ parallel mode: Benefit from Multi-Core using the STL", their measured speedups for BQS are almost the same as for MWMS (see page 18, http://ls11-www.cs.uni-dortmund.de/people/gutweng/AD08/VO11_parallel_mode_overview.pdf); they observed speedup over 20 with BQS using 32 threads.
Any idea why this happens or what do I wrong?
I have seemingly solved the problem simply by calling:
omp_set_nested(1);
The documentation is little bit unclear about this requirement. Moreover, I would expect that the library is able to perform the call by itself. Hopefully, this will help also someone else.

Generating random numbers: CPU vs GPU, which currently wins?

I've been working on a physics simulations requiring the generation of a large amount of random numbers (at least 10^13 if you want an idea). I've been using the C++11 implementation of the Mersenne twister. I've also read that GPU implementation of this same algorithm are now a part of Cuda libraries and that GPU can be extremely efficient at this task; but I couldn't find explicit numbers or a benchmark comparison. For example compared to an 8 cores i7, are Nvidia cards of the last generations more performant in generating random numbers? If yes, how much and in which price range?
I'm thinking that my simulation could gain from having a GPU generating a huge pile of random numbers and the CPU doing the rest.
Some comparisons can be found here:
https://developer.nvidia.com/cuRAND
If you have a new enough Intel CPU (IvyBridge or newer), you can use the RDRAND instruction.
This can be used via the _rdrand16_step(), _rdrand32_step() and _rdrand64_step() intrinsic functions.
Available via VS2012/13, Intel compiler and gcc.
The generated random number is originally seeded on a real random number. Designed for NIST SP 800-90A compliance, its randomness is very high.
Some numbers for reference:
On an IvyBridge dual core laptop with HT (2.3GHz), 2^32 (4 Gigs) random 32bit numbers took 5.7 seconds for single thread and 1.7 seconds with OpenMP.

What is P99 latency?

What does P99 latency represent? I keep hearing about this in discussions about an application's performance but couldn't find a resource online that would talk about this.
It's 99th percentile. It means that 99% of the requests should be faster than given latency. In other words only 1% of the requests are allowed to be slower.
We can explain it through an analogy, if 100 students are running a race then 99 students should complete the race in "latency" time.
Imagine that you are collecting performance data of your service and the below table is the collection of results (the latency values are fictional to illustrate the idea).
Latency Number of requests
1s 5
2s 5
3s 10
4s 40
5s 20
6s 15
7s 4
8s 1
The P99 latency of your service is 7s. Only 1% of the requests take longer than that. So, if you can decrease the P99 latency of your service, you increase its performance.
Lets take an example from here
Request latency:
min: 0.1
max: 7.2
median: 0.2
p95: 0.5
p99: 1.3
So we can say, 99 percent of web requests, the average latency found was 1.3ms (milli seconds/microseconds depends on your system latency measures configured).
Like #tranmq said, if we decrease the P99 latency of the service, we can increase its performance.
And it is also worth noting the p95, since may be few requests makes p99 to be more costlier than p95 e.g.) initial requests that builds cache, class objects warm up, threads init, etc.
So p95 may be cutting out those 5% worst case scenarios. Still out of that 5%, we dont know percentile of real noise cases Vs worst case inputs.
Finally; we can have roughly 1% noise in our measurements (like network congestions, outages, service degradations), so the p99 latency is a good representative of practically the worst case. And, almost always, our goal is to reduce the p99 latency.
Explaining P99 it through an analogy:
If 100 horses are running in a race, 99 horses should complete the race in less than or equal to "latency" time. Only 1 horse is allowed to finish the race in time higher than "latency" time.
That means if P99 is 10ms, 99 percentile requests should have latency less than or equal to 10ms.
If p99 value is 1ms, it means, 99 out of 100 requests take less than 1ms, and 1 request take about 1 or more than 1ms.

How to calculate Gflops of a kernel

I want a measure of how much of the peak performance my kernel archives.
Say I have a NVIDIA Tesla C1060, which has a peak GFLOPS of 622.08 (~= 240Cores * 1300MHz * 2).
Now in my kernel I counted for each thread 16000 flop (4000 x (2 subtraction, 1 multiplication and 1 sqrt)). So when I have 1,000,000 threads I would come up with 16GFLOP. And as the kernel takes 0.1 seconds I would archive 160GFLOPS, which would be a quarter of the peak performance. Now my questions:
Is this approach correct?
What about comparisons (if(a>b) then....)? Do I have to consider them as well?
Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.
sister question: How to calculate the achieved bandwidth of a CUDA kernel
First some general remarks:
In general, what you are doing is mostly an exercise in futility and is the reverse of how most people would probably go about performance analysis.
The first point to make is that the peak value you are quoting is for strictly for floating point multiply-add instructions (FMAD), which count as two FLOPS, and can be retired at a maximum rate of one per cycle. Other floating point operations which retire at a maximum rate of one per cycle would formally only be classified as a single FLOP, while others might require many cycles to be retired. So if you decided to quote kernel performance against that peak, you are really comparing your codes performance against a stream of pure FMAD instructions, and nothing more than that.
The second point is that when researchers quote FLOP/s values from a piece of code, they are usually using a model FLOP count for the operation, not trying to count instructions. Matrix multiplication and the Linpack LU factorization benchmarks are classic examples of this approach to performance benchmarking. The lower bound of the operation count of those calculations is exactly known, so the calculated throughput is simply that lower bound divided by the time. The actual instruction count is irrelevent. Programmers often use all sorts of techniques, including rundundant calculations, speculative or predictive calculations, and a host of other ideas to make code run faster. The actual FLOP count of such code is irrelevent, the reference is always the model FLOP count.
Finally, when looking at quantifying performance, there are usually only two points of comparison of any real interest
Does version A of the code run faster than version B on the same hardware?
Does hardware A perform better than hardware B doing the task of interest?
In the first case you really only need to measure execution time. In the second, a suitable measure usually isn't FLOP/s, it is useful operations per unit time (records per second in sorting, cells per second in a fluid mechanical simulation, etc). Sometimes, as mentioned above, the useful operations can be the model FLOP count of an operation of known theoretical complexity. But the actual floating point instruction count rarely, if ever, enters into the analysis.
If your interest is really about optimization and understanding the performance of your code, then maybe this presentation by Paulius Micikevicius from NVIDIA might be of interest.
Addressing the bullet point questions:
Is this approach correct?
Strictly speaking, no. If you are counting floating point operations, you would need to know the exact FLOP count from the code the GPU is running. The sqrt operation can consume a lot more than a single FLOP, depending on its implementation and the characteristics of the number it is operating on, for example. The compiler can also perform a lot of optimizations which might change the actual operation/instruction count. The only way to get a truly accurate count would be to disassemble compiled code and count the individual floating point operands, perhaps even requiring assumptions about the characteristics of values the code will compute.
What about comparisons (if(a>b) then....)? Do I have to consider them as well?
They are not floating point multiply-add operations, so no.
Can I use the CUDA profiler for easier and more accurate results? I tried the instructions counter, but I could not figure out, what the figure means.
Not really. The profiler can't differentiate between a floating point intruction and any other type of instruction, so (as of 2011) FLOP count from a piece of code via the profiler is not possible. [EDIT: see Greg's execellent answer below for a discussion of the FLOP counting facilities available in versions of the profiling tools released since this answer was written]
Nsight VSE (>3.2) and the Visual Profiler (>=5.5) support Achieved FLOPs calculation. In order to collect the metric the profilers run the kernel twice (using kernel replay). In the first replay the number of floating point instructions executed is collected (with understanding of predication and active mask). in the second replay the duration is collected.
nvprof and Visual Profiler have a hardcoded definition. FMA counts as 2 operations. All other operations are 1 operation. The flops_sp_* counters are thread instruction execution counts whereas flops_sp is the weighted sum so some weighting can be applied using the individual metrics. However, flops_sp_special covers a number of different instructions.
The Nsight VSE experiment configuration allows the user to define the operations per instruction type.
Nsight Visual Studio Edition
Configuring to collect Achieved FLOPS
Execute the menu command Nsight > Start Performance Analysis... to open the Activity Editor
Set Activity Type to Profile CUDA Application
In Experiment Settings set Experiments to Run to Custom
In the Experiment List add Achieved FLOPS
In the middle pane select Achieved FLOPS
In the right pane you can custom the FLOPS per instruction executed. The default weighting is for FMA and RSQ to count as 2. In some cases I have seen RSQ as high as 5.
Run the Analysis Session.
Viewing Achieved FLOPS
In the nvreport open the CUDA Launches report page.
In the CUDA Launches page select a kernel.
In the report correlation pane (bottom left) select Achieved FLOPS
nvprof
Metrics Available (on a K20)
nvprof --query-metrics | grep flop
flops_sp: Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special)
flops_sp_add: Number of single-precision floating-point add operations executed by non-predicated threads
flops_sp_mul: Number of single-precision floating-point multiply operations executed by non-predicated threads
flops_sp_fma: Number of single-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_dp: Number of double-precision floating-point operations executed non-predicated threads (add, multiply, multiply-accumulate and special)
flops_dp_add: Number of double-precision floating-point add operations executed by non-predicated threads
flops_dp_mul: Number of double-precision floating-point multiply operations executed by non-predicated threads
flops_dp_fma: Number of double-precision floating-point multiply-accumulate operations executed by non-predicated threads
flops_sp_special: Number of single-precision floating-point special operations executed by non-predicated threads
flop_sp_efficiency: Ratio of achieved to peak single-precision floating-point operations
flop_dp_efficiency: Ratio of achieved to peak double-precision floating-point operations
Collection and Results
nvprof --devices 0 --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
==2452== NVPROF is profiling process 2452, command: matrixMul.exe
GPU Device 0: "Tesla K20c" with compute capability 3.5
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 6.18 GFlop/s, Time= 21.196 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK
Note: For peak performance, please refer to the matrixMulCUBLAS example.
==2452== Profiling application: matrixMul.exe
==2452== Profiling result:
==2452== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K20c (0)"
Kernel: void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
301 flops_sp FLOPS(Single) 131072000 131072000 131072000
301 flops_sp_add FLOPS(Single Add) 0 0 0
301 flops_sp_mul FLOPS(Single Mul) 0 0 0
301 flops_sp_fma FLOPS(Single FMA) 65536000 65536000 65536000
NOTE: flops_sp = flops_sp_add + flops_sp_mul + flops_sp_special + (2 * flops_sp_fma) (approximately)
Visual Profiler
The Visual Profiler supports the metrics shown in the nvprof section above.