why tanh has different results in OpenCL and C++ function

why tanh has different results in OpenCL and C++ function - c++

here is my OpenCL code.
#include <iostream>
#include <cmath>
#include <CL/cl.hpp>
int main(){
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
cl::Platform default_platform=all_platforms[0];
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
cl::Device default_device=all_devices[0];
std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";
cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(default_platform)(), 0};
cl::Context context = cl::Context(CL_DEVICE_TYPE_ALL, properties);
cl::Program::Sources sources;
std::string kernel_code=
" void __kernel simple_tanh(__global const float *A, __global float *B){ "
" B[get_global_id(0)]=tanh(A[get_global_id(0)]); "
" } ";
sources.push_back({kernel_code.c_str(),kernel_code.length()});
cl::Program program(context,sources);
if(program.build({default_device})!=CL_SUCCESS){
std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
exit(1);
}
cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(float));
cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(float));
float A[1]; A[0] = 0.0595172755420207977294921875000000000000f;
cl::CommandQueue queue(context,default_device);
queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(float),A);
queue.finish();
cl::Kernel kernel=cl::Kernel(program,"simple_tanh");
kernel.setArg(0,buffer_A);
kernel.setArg(1,buffer_B);
queue.enqueueNDRangeKernel(kernel,cl::NullRange,cl::NDRange(1),cl::NullRange);
queue.finish();
float B[1];
queue.enqueueReadBuffer(buffer_B,CL_TRUE,0,sizeof(float),B);
printf("result: %.40f %.40f\n", tanh(A[0]), B[0]);
return 0;
}
after I compile with this cmd: g++ -std=c++0x hello.cc -lOpenCL -o hello, and run it. I got different results of tanh function.
Using device: Tahiti
result: 0.0594470988394579374913817559900053311139 0.0594470985233783721923828125000000000000
the first is the cpu result, and the second is the OpenCL function. which one should I trust?

When a kernel is unable to be vectorized by compiler(opencl), generated instructions could be scalar types. Then, x87 FPU computes 80 bit. SSE has precision more comparable to GPU, you need float4 or float8 in your kernel such that compiler can produce SSE/AVX which has closer precision to GPU.
Generally Intel's opencl compiler vectorizes better(for some old CPUs at least). Which implementation are you using? There can be differences even between GPUs but they all obey the rule of not crossing ULP limit. If you need more precision with GPU(and SSE/AVX), why not write you own series expansion function then? But it would make learning very slow but faster than a single FPU at least.
What is your CPU? What opencl platform are you using? Did you check generated codes for kernel with some profiler software or some kernel analyzer?
Above all, you shouldn't do this:
cl::NDRange(1)
unless it's learning purpose. This will have %99 kernel launch overhead, %1 data copy overhead and close to zero compute latency. Maybe thats why it's using 80-bit FPU instead of SSE(on CPU). Try computing for multiple-of-8 ndrange values or use float8 types in kernel to let compiler use vectorized instructions.
When global ndrange value is millions, it will have significant effect on learning time, not leraning iterations needed. If CPU can finish learning in 1 day with 1M iterations, maybe GPU can finish it in 1-hour even if it needs 10M iterations. Transcandental functions have high compute to data ratio so speed up ratio versus CPU is higher if you use more of them.
If you derive your own series expansion function to achieve more precision, it would still be much faster than single CPU core in this embarrassingly parallel kernel code.
If the neural network has only a few neurons, then maybe you can do N networks training at the same time and pick the best learner(if learning has any randomization)? So it picks even better results than CPU?

Related

How can I obtain consistently high throughput in this loop?

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

CPU hashes faster than GPU?

I want to generate a random number, and hash that with SHA256 on my GPU using OpenCL with this base code (instead of hashing those pre-given plain-texts, it hashes the random numbers).
I got all the hashing to work on my GPU, but there is one problem:
the amount of hashes done per second lowers when using OpenCL?
Yes, you heard that correctly, at the moment it's faster to use only the CPU over only using the GPU.
My GPU runs at only ~10% while my CPU runs at ~100%
My question is: how can this be possible and more importantly, how do I fix it?
This is the code I use for generating a Pseudo-Random Number (which doesn't change at all between the 2 runs):
long Miner::Rand() {
std::mt19937 rng;
// initialize the random number generator with time-dependent seed
uint64_t timeSeed = std::chrono::high_resolution_clock::now().time_since_epoch().count();
std::seed_seq ss{ uint32_t(timeSeed & 0xffffffff), uint32_t(timeSeed >> 32) };
rng.seed(ss);
// initialize a uniform distribution between 0 and 1
std::uniform_real_distribution<double> unif(0, 1);
double rnd = unif(rng);
return floor(99999999 * rnd);
}
Here is the code that calculates the hashrate for me:
void Miner::ticker() {
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_HIGHEST);
while (true) {
Sleep(1000);
HashesPerSecond = hashes;
hashes = 0;
PrintInfo();
}
}
which gets called from here:
void Miner::Start() {
std::chrono::system_clock::time_point today = std::chrono::system_clock::now();
startTime = std::chrono::system_clock::to_time_t(today);
std::thread tickT(&Miner::ticker, this);
PostHit();
GetAPIBalance();
while (true) {
std::thread t[32]; //max 32
hashFound = false;
if (RequestNewBlock()) {
for (int i = 0; i < numThreads; ++i) {
t[i] = std::thread(&Miner::JSEMine, this);
}
for (auto& th : t)
if (th.joinable())
th.join();
}
}
}
which in turn get's called like this:
Miner m(threads);
m.Start();

CPUs have far better latency characteristics than GPUs. That is to say, CPUs can do one operation way, way WAAAAYYYY faster than a GPU can. That's not even taking into account the CPU -> Main RAM -> PCIe bus -> GDDR5 "Global" GPU -> GPU Registers -> "Global GPU" -> PCIe bus back -> Main RAM -> CPU round trip time (and I'm skipping a few steps here, like pinning and L1 Cache)
GPUs have better bandwidth characteristics than CPUs (provided that the dataset can fit inside of the GPU's limited local memory). GPUs can perform Billions of SHA256 hashes faster than a CPU can perform billions of SHA256 hashes.
Bitcoin requires millions, billions, or even trillions of hashes to achieve a competitive hash rate. Furthermore, computations can take place on the GPU without much collaboration with the CPU (removing the need for the slow round-trip through PCIe).
Its an issue of fundamental design. CPUs are designed to minimize latency, but GPUs are designed to maximize bandwidth. It seems like your problem is latency-bound (you're calculating too few SHA256 hashes for the GPU to be effective). 32 is... really really small in the scale we're talking about.
The AMD GCN architecture doesn't even perform full speed until you have at LEAST 64-work items, and arguably you really need 256 work items to maximize just one of the 44-compute units of say... a R9 290x.
I guess what I'm trying to say is: try it again with 11264 work items (or more), that's the number of work items that GPUs are designed to work with. Not 32. I got this number from 44-compute units on R9 290x * 4-Vector units per compute unit * 64-work items per vector unit.

Understanding the usage of OpenCL in OpenCV (Mat/ Umat Objects)

I ran the code below to check for the performance difference between GPU and CPU usage. I am calculating the Average time for cv::cvtColor() function. I make four function calls:
Just_mat()(Without using OpenCL for Mat object)
Just_UMat()(Without using OpenCL for Umat object)
OpenCL_Mat()(using OpenCL for Mat object)
OpenCL_UMat() (using OpenCL for UMat object)
for both CPU and GPU.
I did not find a huge performance difference between GPU and CPU usage.
int main(int argc, char* argv[])
{
loc = argv[1];
just_mat(loc);// Calling function Without OpenCL
just_umat(loc);//Calling function Without OpenCL
cv::ocl::Context context;
std::vector<cv::ocl::PlatformInfo> platforms;
cv::ocl::getPlatfomsInfo(platforms);
for (size_t i = 0; i < platforms.size(); i++)
{
//Access to Platform
const cv::ocl::PlatformInfo* platform = &platforms[i];
//Platform Name
std::cout << "Platform Name: " << platform->name().c_str() << "\n" << endl;
//Access Device within Platform
cv::ocl::Device current_device;
for (int j = 0; j < platform->deviceNumber(); j++)
{
//Access Device
platform->getDevice(current_device, j);
int deviceType = current_device.type();
cout << "Device name: " << current_device.name() << endl;
if (deviceType == 2)
cout << context.ndevices() << " CPU devices are detected." << std::endl;
if (deviceType == 4)
cout << context.ndevices() << " GPU devices are detected." << std::endl;
cout << "===============================================" << endl << endl;
switch (deviceType)
{
case (1 << 1):
cout << "CPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL Mat
break;
case (1 << 2):
cout << "GPU device\n";
if (context.create(deviceType))
opencl_mat(loc);//With OpenCL UMat
break;
}
cin.ignore(1);
}
}
return 0;
}
int just_mat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int just_umat(string loc);// I check for the average time taken for cvtColor() without using OpenCl
int opencl_mat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
int opencl_umat(string loc);//ocl::setUseOpenCL(true); and check for time difference for cvtColor function
The output(in miliseconds) for the above code is
__________________________________________
|GPU Name|With OpenCL Mat | With OpenCl UMat|
|_________________________________________|
|--Carrizo---|------7.69052 ------ |------0.247069-------|
|_________________________________________|
|---Island--- |-------7.12455------ |------0.233345-------|
|_________________________________________|
__________________________________________
|----CPU---|With OpenCL Mat | With OpenCl UMat |
|_________________________________________|
|---AMD---|------6.76169 ------ |--------0.231103--------|
|_________________________________________|
________________________________________________
|----CPU---| WithOut OpenCL Mat | WithOut OpenCl UMat |
|_______________________________________________|
|----AMD---|------7.15959------ |------------0.246138------------ |
|_______________________________________________|
In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
Can anybody explain the reason for all output time variation?
One more question, i didn't use any OpenCL specific .dll with .exe file and yet GPU was used without any error, while building OpenCV with Cmake i checked With_OpenCL did this built all OpenCL required function within opencv_World310.dll ?

In code, using Mat Object always runs on CPU & using UMat Object always runs on GPU, irrespective of the code ocl::setUseOpenCL(true/false);
I'm sorry, because I'm not sure if this is a question or a statement... in either case it's partially true. In 3.0, for the UMat, if you don't have a dedicated GPU then OpenCV just runs everything on the CPU. If you specifically ask for Mat you get it on the CPU. And in your case you have directed both to run on each of your GPUs/CPU by selecting each specifically (more on "choosing a CPU below)... read this:
Few design choices support the new architecture:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call
OpenCL accelerated version explicitly. These functions use an OpenCL
-enabled GPU if exists in the system, and automatically switch to CPU
operation otherwise.
The UMat abstraction enables functions to be called asynchronously.
Unlike the cv::Mat of the OpenCV version 2.x, access to the underlyi
ng data for the cv::UMat is performed through a method of class, and not though its data member. Such an approach enables the
implementation to explicitly wait for GPU completion only when CPU code absolutely needs the result.
The UMat implementation makes use of CPU-GPU shared physical memory available on Intel SoCs, including allocations that come from pointers passed into OpenCV.
I think there also might be a misunderstanding about "using OpenCL". When you use an UMat, you are specifically trying to use the GPU. And, I'll plead some ignorance here, as a result I believe that CV is using some of the CL library to make that happen automatically... as a side in 2.X we had cv::ocl to specifically/manually do this, so be careful if you are using that 2.X legacy code in 3.X. There are reasons to do it, but they are not always straightforward. But, back on topic, when you say,
with OpenCL UMat
you are potentially being redundant. The CL code you have in your snippet is basically finding out what equipment is installed, how many there are, what their names are, and choosing which to use... I'd have to dig through the way it is instantiated, but perhaps when you make it UMat it automatically sets OpenCL to True? (link) That would definitely support the data you presented. You could probably test that idea by checking what the state of ocl::setUseOpenCL after you set it to false and then use an UMat.
Finally, I'm guessing your CPU has a built in GPU. So it is running parallel processing with OpenCL and not paying a time penalty to travel to the seperate/dedicated GPU and back, hence your perceived performance increase over the GPUs (since it is not technically the CPU running it)... only when you are specifically using the Mat is the CPU only being used.
Your last question, I'm not sure... this is my speculation: OpenCL architexture exists on the GPU, when you install CV with CL you are installing the link between the two libraries and associated header files. I'm not sure which dll files you need to make that magic happen.

System call cost

I'm currently working on operating system operations overheads.
I'm actually studying the cost to make a system call and I've developed a simple C++ program to observe it.
#include <iostream>
#include <unistd.h>
#include <sys/time.h>
uint64_t
rdtscp(void) {
uint32_t eax, edx;
__asm__ __volatile__("rdtscp" //! rdtscp instruction
: "+a" (eax), "=d" (edx) //! output
: //! input
: "%ecx"); //! registers
return (((uint64_t)edx << 32) | eax);
}
int main(void) {
uint64_t before;
uint64_t after;
struct timeval t;
for (unsigned int i = 0; i < 10; ++i) {
before = rdtscp();
gettimeofday(&t, NULL);
after = rdtscp();
std::cout << after - before << std::endl;
std::cout << t.tv_usec << std::endl;
}
return 0;
}
This program is quite straightforward.
The rdtscp function is just a wrapper to call the RTDSCP instruction (a processor instruction which loads the 64-bits cycle count into two 32-bits registers). This function is used to take the timing.
I iterate 10 times. At each iteration I call gettimeofday and determine the take it took to execute it (as a number of CPU cycles).
The results are quite unexpected:
8984
64008
376
64049
164
64053
160
64056
160
64060
160
64063
164
64067
160
64070
160
64073
160
64077
Odd lines in the output are the number of cycles needed to execute the system call. Even lines are the value contains in t.tv_usec (which is set by gettimeofday, the system call that I'm studying).
I don't really understand how that it is possible: the number of cycles drastically decreases, from nearly 10,000 to around 150! But the timeval struct is still updated at each call!
I've tried on different operating system (debian and macos) and the result is similar.
Even if the cache is used, I don't see how it is possible. Making a system call should result in a context switch to switch from user to kernel mode and we still need to read the clock in order to update the time.
Does someone has an idea?

The answer ? try another system call. There's vsyscalls on linux, and they accelerate things for certain syscalls:
What are vdso and vsyscall?
The short version: the syscall is not performed, but instead the kernel maps a region of memory where the process can access the time information. Cost ? Not much (no context switch).

How to Detect the Number of Physical Processors / Cores on Windows, Mac and Linux

I have a multi threaded c++ application that runs on Windows, Mac and a few Linux flavors.
To make a long story short: In order for it to run at maximum efficiency, I have to be able to instantiate a single thread per physical processor/core. Creating more threads than there are physical processors/cores degrades the performance of my program considerably. I can already correctly detect the number of logical processors/cores correctly on all three of these platforms. To be able to detect the number of physical processors/cores correctly I'll have to detect if hyper-treading is supported AND active.
My question therefore is if there is a way to detect whether Hyper Threading is supported and enabled? If so, how exactly.

EDIT: This is no longer 100% correct due to Intel's ongoing befuddlement.
The way I understand the question is that you are asking how to detect the number of CPU cores vs. CPU threads which is different from detecting the number of logical and physical cores in a system. CPU cores are often not considered physical cores by the OS unless they have their own package or die. So an OS will report that a Core 2 Duo, for example, has 1 physical and 2 logical CPUs and an Intel P4 with hyper-threads will be reported exactly the same way even though 2 hyper-threads vs. 2 CPU cores is a very different thing performance wise.
I struggled with this until I pieced together the solution below, which I believe works for both AMD and Intel processors. As far as I know, and I could be wrong, AMD does not yet have CPU threads but they have provided a way to detect them that I assume will work on future AMD processors which may have CPU threads.
In short here are the steps using the CPUID instruction:
Detect CPU vendor using CPUID function 0
Check for HTT bit 28 in CPU features EDX from CPUID function 1
Get the logical core count from EBX[23:16] from CPUID function 1
Get actual non-threaded CPU core count
If vendor == 'GenuineIntel' this is 1 plus EAX[31:26] from CPUID function 4
If vendor == 'AuthenticAMD' this is 1 plus ECX[7:0] from CPUID function 0x80000008
Sounds difficult but here is a, hopefully, platform independent C++ program that does the trick:
#include <iostream>
#include <string>
using namespace std;
void cpuID(unsigned i, unsigned regs[4]) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
int main(int argc, char *argv[]) {
unsigned regs[4];
// Get vendor
char vendor[12];
cpuID(0, regs);
((unsigned *)vendor)[0] = regs[1]; // EBX
((unsigned *)vendor)[1] = regs[3]; // EDX
((unsigned *)vendor)[2] = regs[2]; // ECX
string cpuVendor = string(vendor, 12);
// Get CPU features
cpuID(1, regs);
unsigned cpuFeatures = regs[3]; // EDX
// Logical core count per CPU
cpuID(1, regs);
unsigned logical = (regs[1] >> 16) & 0xff; // EBX[23:16]
cout << " logical cpus: " << logical << endl;
unsigned cores = logical;
if (cpuVendor == "GenuineIntel") {
// Get DCP cache info
cpuID(4, regs);
cores = ((regs[0] >> 26) & 0x3f) + 1; // EAX[31:26] + 1
} else if (cpuVendor == "AuthenticAMD") {
// Get NC: Number of CPU cores - 1
cpuID(0x80000008, regs);
cores = ((unsigned)(regs[2] & 0xff)) + 1; // ECX[7:0] + 1
}
cout << " cpu cores: " << cores << endl;
// Detect hyper-threads
bool hyperThreads = cpuFeatures & (1 << 28) && cores < logical;
cout << "hyper-threads: " << (hyperThreads ? "true" : "false") << endl;
return 0;
}
I haven't actually tested this on Windows or OSX yet but it should work as the CPUID instruction is valid on i686 machines. Obviously, this wont work for PowerPC but then they don't have hyper-threads either.
Here is the output on a few different Intel machines:
Intel(R) Core(TM)2 Duo CPU T7500 # 2.20GHz:
logical cpus: 2
cpu cores: 2
hyper-threads: false
Intel(R) Core(TM)2 Quad CPU Q8400 # 2.66GHz:
logical cpus: 4
cpu cores: 4
hyper-threads: false
Intel(R) Xeon(R) CPU E5520 # 2.27GHz (w/ x2 physical CPU packages):
logical cpus: 16
cpu cores: 8
hyper-threads: true
Intel(R) Pentium(R) 4 CPU 3.00GHz:
logical cpus: 2
cpu cores: 1
hyper-threads: true

Note this, does not give the number of physically cores as intended, but logical cores.
If you can use C++11 (thanks to alfC's comment beneath):
#include <iostream>
#include <thread>
int main() {
std::cout << std::thread::hardware_concurrency() << std::endl;
return 0;
}
Otherwise maybe the Boost library is an option for you. Same code but different include as above. Include <boost/thread.hpp> instead of <thread>.

Windows only solution desribed here:
GetLogicalProcessorInformation
for linux, /proc/cpuinfo file. I am not running linux
now so can't give you more detail. You can count
physical/logical processor instances. If logical count
is twice as physical, then you have HT enabled
(true only for x86).

The current highest voted answer using CPUID appears to be obsolete. It reports both the wrong number of logical and physical processors. This appears to be confirmed from this answer cpuid-on-intel-i7-processors.
Specifically, using CPUID.1.EBX[23:16] to get the logical processors or CPUID.4.EAX[31:26]+1 to get the physical ones with Intel processors does not give the correct result on any Intel processor I have.
For Intel CPUID.Bh should be used Intel_thread/Fcore and cache topology. The solution does not appear to be trivial. For AMD a different solution is necessary.
Here is source code by by Intel which reports the correct number of physical and logical cores as well as the correct number of sockets https://software.intel.com/en-us/articles/intel-64-architecture-processor-topology-enumeration/. I tested this on a 80 logical core, 40 physical core, 4 socket Intel system.
Here is source code for AMD http://developer.amd.com/resources/documentation-articles/articles-whitepapers/processor-and-core-enumeration-using-cpuid/. It gave the correct result on my single socket Intel system but not on my four socket system. I don't have a AMD system to test.
I have not dissected the source code yet to find a simple answer (if one exists) with CPUID. It seems that if the solution can change (as it seems to have) that the best solution is to use a library or OS call.
Edit:
Here is a solution for Intel processors with CPUID leaf 11 (Bh). The way to do this is loop over the logical processors and get the x2APIC ID for each logical processor from CPUID and count the number of x2APIC IDs were the least significant bit is zero. For systems without hyper-threading the x2APIC ID will always be even. For systems with hyper-threading each x2APIC ID will have an even and odd version.
// input: eax = functionnumber, ecx = 0
// output: eax = output[0], ebx = output[1], ecx = output[2], edx = output[3]
//static inline void cpuid (int output[4], int functionnumber)
int getNumCores(void) {
//Assuming an Intel processor with CPUID leaf 11
int cores = 0;
#pragma omp parallel reduction(+:cores)
{
int regs[4];
cpuid(regs,11);
if(!(regs[3]&1)) cores++;
}
return cores;
}
The threads must be bound for this to work. OpenMP by default does not bind threads. Setting export OMP_PROC_BIND=true will bind them or they can be bound in code as shown at thread-affinity-with-windows-msvc-and-openmp.
I tested this on my 4 core/8 HT system and it returned 4 with and without hyper-threading disabled in the BIOS. I also tested in on a 4 socket system with each socket having 10 cores / 20 HT and it returned 40 cores.
AMD processors or older Intel processors without CPUID leaf 11 have to do something different.

From gathering ideas and concepts from some of the above ideas, I have come up with this solution. Please critique.
//EDIT INCLUDES
#ifdef _WIN32
#include <windows.h>
#elif MACOS
#include <sys/param.h>
#include <sys/sysctl.h>
#else
#include <unistd.h>
#endif
For almost every OS, the standard "Get core count" feature returns the logical core count. But in order to get the physical core count, we must first detect if the CPU has hyper threading or not.
uint32_t registers[4];
unsigned logicalcpucount;
unsigned physicalcpucount;
#ifdef _WIN32
SYSTEM_INFO systeminfo;
GetSystemInfo( &systeminfo );
logicalcpucount = systeminfo.dwNumberOfProcessors;
#else
logicalcpucount = sysconf( _SC_NPROCESSORS_ONLN );
#endif
We now have the logical core count, now in order to get the intended results, we first must check if hyper threading is being used or if it's even available.
__asm__ __volatile__ ("cpuid " :
"=a" (registers[0]),
"=b" (registers[1]),
"=c" (registers[2]),
"=d" (registers[3])
: "a" (1), "c" (0));
unsigned CPUFeatureSet = registers[3];
bool hyperthreading = CPUFeatureSet & (1 << 28);
Because there is not an Intel CPU with hyper threading that will only hyper thread one core (at least not from what I have read). This allows us to find this is a really painless way. If hyper threading is available,the logical processors will be exactly double the physical processors. Otherwise, the operating system will detect a logical processor for every single core. Meaning the logical and the physical core count will be identical.
if (hyperthreading){
physicalcpucount = logicalcpucount / 2;
} else {
physicalcpucount = logicalcpucount;
}
fprintf (stdout, "LOGICAL: %i\n", logicalcpucount);
fprintf (stdout, "PHYSICAL: %i\n", physicalcpucount);

To follow on from math's answer, as of boost 1.56 there exists the physical_concurrency attribute which does exactly what you want.
From the documentation - http://www.boost.org/doc/libs/1_56_0/doc/html/thread/thread_management.html#thread.thread_management.thread.physical_concurrency
The number of physical cores available on the current system. In contrast to hardware_concurrency() it does not return the number of virtual cores, but it counts only physical cores.
So an example would be
#include <iostream>
#include <boost/thread.hpp>
int main()
{
std::cout << boost::thread::physical_concurrency();
return 0;
}

I know this is an old thread, but no one mentioned hwloc. The hwloc library is available on most Linux distributions and can also be compiled on Windows. The following code will return the number of physical processors. 4 in the case of a i7 CPU.
#include <hwloc.h>
int nPhysicalProcessorCount = 0;
hwloc_topology_t sTopology;
if (hwloc_topology_init(&sTopology) == 0 &&
hwloc_topology_load(sTopology) == 0)
{
nPhysicalProcessorCount =
hwloc_get_nbobjs_by_type(sTopology, HWLOC_OBJ_CORE);
hwloc_topology_destroy(sTopology);
}
if (nPhysicalProcessorCount < 1)
{
#ifdef _OPENMP
nPhysicalProcessorCount = omp_get_num_procs();
#else
nPhysicalProcessorCount = 1;
#endif
}

It is not sufficient to test if an Intel CPU has hyperthreading, you also need to test if hyperthreading is enabled or disabled. There is no documented way to check this. An Intel guy came up with this trick to check if hyperthreading is enabled: Check the number of programmable performance counters using CPUID[0xa].eax[15:8] and assume that if the value is 8, HT is disabled, and if the value is 4, HT is enabled (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/831551).
There is no problem on AMD chips: The CPUID reports 1 or 2 threads per core depending on whether simultaneous multithreading is disabled or enabled.
You also have to compare the thread count from the CPUID with the thread count reported by the operating system to see if there are multiple CPU chips.
I have made a function that implements all of this. It reports both the number of physical processors and the number of logical processors. I have tested it on Intel and AMD processors in Windows and Linux. It should work on Mac as well. I have published this code at
https://github.com/vectorclass/add-on/tree/master/physical_processors

On OS X, you can read these values from sysctl(3) (the C API, or the command line utility of the same name). The man page should give you usage information. The following keys may be of interest:
$ sysctl hw
hw.ncpu: 24
hw.activecpu: 24
hw.physicalcpu: 12 <-- number of cores
hw.physicalcpu_max: 12
hw.logicalcpu: 24 <-- number of cores including hyper-threaded cores
hw.logicalcpu_max: 24
hw.packages: 2 <-- number of CPU packages
hw.ncpu = 24
hw.availcpu = 24

On Windows, there are GetLogicalProcessorInformation and GetLogicalProcessorInformationEx available for Windows XP SP3 or older and Windows 7+ respectively. The difference is that GetLogicalProcessorInformation doesn't support setups with more than 64 logical cores, which might be important for server setups, but you can always fall back to GetLogicalProcessorInformation if you're on XP. Example usage for GetLogicalProcessorInformationEx (source):
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer = NULL;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX ptr = NULL;
BOOL rc;
DWORD length = 0;
DWORD offset = 0;
DWORD ncpus = 0;
DWORD prev_processor_info_size = 0;
for (;;) {
rc = psutil_GetLogicalProcessorInformationEx(
RelationAll, buffer, &length);
if (rc == FALSE) {
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
if (buffer) {
free(buffer);
}
buffer = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)malloc(length);
if (NULL == buffer) {
return NULL;
}
}
else {
goto return_none;
}
}
else {
break;
}
}
ptr = buffer;
while (offset < length) {
// Advance ptr by the size of the previous
// SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX struct.
ptr = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*)\
(((char*)ptr) + prev_processor_info_size);
if (ptr->Relationship == RelationProcessorCore) {
ncpus += 1;
}
// When offset == length, we've reached the last processor
// info struct in the buffer.
offset += ptr->Size;
prev_processor_info_size = ptr->Size;
}
free(buffer);
if (ncpus != 0) {
return ncpus;
}
else {
return NULL;
}
return_none:
if (buffer != NULL)
free(buffer);
return NULL;
On Linux, parsing /proc/cpuinfo might help.

I don't know that all three expose the information in the same way, but if you can safely assume that the NT kernel will report device information according to the POSIX standard (which NT supposedly has support for), then you could work off that standard.
However, differing of device management is often cited as one of the stumbling blocks to cross platform development. I would at best implement this as three strands of logic, I wouldn't try to write one piece of code to handle all platforms evenly.
Ok, all that's assuming C++. For ASM, I presume you'll only be running on x86 or amd64 CPUs? You'll still need two branch paths, one for each architecture, and you'll need to test Intel separate from AMD (IIRC) but by and large you just check for the CPUID. Is that what you're trying to find? The CPUID from ASM on Intel/AMD family CPUs?

OpenMP should do the trick:
// test.cpp
#include <omp.h>
#include <iostream>
using namespace std;
int main(int argc, char** argv) {
int nThreads = omp_get_max_threads();
cout << "Can run as many as: " << nThreads << " threads." << endl;
}
most compilers support OpenMP. If you are using a gcc-based compiler (*nix, MacOS), you need to compile using:
$ g++ -fopenmp -o test.o test.cpp
(you might also need to tell your compiler to use the stdc++ library):
$ g++ -fopenmp -o test.o -lstdc++ test.cpp
As far as I know OpenMP was designed to solve this kind of problems.

This is very easy to do in Python:
$ python -c "import psutil; psutil.cpu_count(logical=False)"
4
Maybe you could look at the psutil source code to see what is going on?

You may use the library libcpuid (Also on GitHub - libcpuid).
As can be seen in its documentation page:
#include <stdio.h>
#include <libcpuid.h>
int main(void)
{
if (!cpuid_present()) { // check for CPUID presence
printf("Sorry, your CPU doesn't support CPUID!\n");
return -1;
}
if (cpuid_get_raw_data(&raw) < 0) { // obtain the raw CPUID data
printf("Sorry, cannot get the CPUID raw data.\n");
printf("Error: %s\n", cpuid_error()); // cpuid_error() gives the last error description
return -2;
}
if (cpu_identify(&raw, &data) < 0) { // identify the CPU, using the given raw data.
printf("Sorrry, CPU identification failed.\n");
printf("Error: %s\n", cpuid_error());
return -3;
}
printf("Found: %s CPU\n", data.vendor_str); // print out the vendor string (e.g. `GenuineIntel')
printf("Processor model is `%s'\n", data.cpu_codename); // print out the CPU code name (e.g. `Pentium 4 (Northwood)')
printf("The full brand string is `%s'\n", data.brand_str); // print out the CPU brand string
printf("The processor has %dK L1 cache and %dK L2 cache\n",
data.l1_data_cache, data.l2_cache); // print out cache size information
printf("The processor has %d cores and %d logical processors\n",
data.num_cores, data.num_logical_cpus); // print out CPU cores information
}
As can be seen, data.num_cores, holds the number of Physical cores of the CPU.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js