How do I call "cpuid" in Linux? - c++

While writing new code for Windows, I stumbled upon _cpuinfo() from the Windows API. As I am mainly dealing with a Linux environment (GCC) I want to have access to the CPUInfo.
I have tried the following:
#include <iostream>
int main()
{
int a, b;
for (a = 0; a < 5; a++)
{
__asm ( "mov %1, %%eax; " // a into eax
"cpuid;"
"mov %%eax, %0;" // eax into b
:"=r"(b) // output
:"r"(a) // input
:"%eax","%ebx","%ecx","%edx" // clobbered register
);
std::cout << "The CPUID level " << a << " gives EAX= " << b << '\n';
}
return 0;
}
This use assembly but I don't want to re-invent the wheel. Is there any other way to implement CPUInfo without assembly?

Since you are compiling with GCC then you can include cpuid.h which declares these functions:
/* Return highest supported input value for cpuid instruction. ext can
be either 0x0 or 0x8000000 to return highest supported value for
basic or extended cpuid information. Function returns 0 if cpuid
is not supported or whatever cpuid returns in eax register. If sig
pointer is non-null, then first four bytes of the signature
(as found in ebx register) are returned in location pointed by sig. */
unsigned int __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
/* Return cpuid data for requested cpuid level, as found in returned
eax, ebx, ecx and edx registers. The function checks if cpuid is
supported and returns 1 for valid cpuid information or 0 for
unsupported cpuid level. All pointers are required to be non-null. */
int __get_cpuid (unsigned int __level,
unsigned int *__eax, unsigned int *__ebx,
unsigned int *__ecx, unsigned int *__edx)
You don't need to, and should not, re-implement this functionality.

for (a =0; a < 5; ++a;)
There should only be two semicolons there. You've got three.
This is basic C/C++ syntax; the CPUID is a red herring.

Related

Using std::memmove to work around strict aliasing?

Can std::memmove() be used to "move" the memory to the same location to be able to alias it using different types?
For example:
#include <cstring>
#include <cstdint>
#include <iomanip>
#include <iostream>
struct Parts { std::uint16_t v[2u]; };
static_assert(sizeof(Parts) == sizeof(std::uint32_t), "");
static_assert(alignof(Parts) <= alignof(std::uint32_t), "");
int main() {
std::uint32_t u = 0xdeadbeef;
Parts * p = reinterpret_cast<Parts *>(std::memmove(&u, &u, sizeof(u)));
std::cout << std::hex << u << " ~> "
<< p->v[0] << ", " << p->v[1] << std::endl;
}
$ g++-10.2.0 -Wall -Wextra test.cpp -o test -O2 -ggdb -fsanitize=address,undefined -std=c++20 && ./test
deadbeef ~> beef, dead
Is this a safe approach? What are the caveats? Can static_cast be used instead of reinterpret_cast here?
If one is interested in knowing what constructs will be reliably processed by the clang and gcc optimizers in the absence of the -fno-strict-aliasing, rather than assuming that everything defined by the Standard will be processed meaningfully, both clang and gcc will sometimes ignore changes to active/effective types made by operations or sequences of operations that would not affect the bit pattern in a region of storage.
As an example:
#include <limits.h>
#include <string.h>
#if LONG_MAX == LLONG_MAX
typedef long long longish;
#elif LONG_MAX == INT_MAX
typedef int longish;
#endif
__attribute((noinline))
long test(long *p, int index, int index2, int index3)
{
if (sizeof (long) != sizeof (longish))
return -1;
p[index] = 1;
((longish*)p)[index2] = 2;
longish temp2 = ((longish*)p)[index3];
p[index3] = 5; // This should modify p[index3] and set its active/effective type
p[index3] = temp2; // Shouldn't (but seems to) reset effective type to longish
long temp3;
memmove(&temp3, p+index3, sizeof (long));
memmove(p+index3, &temp3, sizeof (long));
return p[index];
}
#include <stdio.h>
int main(void)
{
long arr[1] = {0};
long temp = test(arr, 0, 0, 0);
printf("%ld should equal %ld\n", temp, arr[0]);
}
While gcc happens to process this code correctly on 32-bit ARM (even if one uses flag -mcpu=cortex-m3 to avoid the call to memmove), clang processes it incorrectly on both platforms. Interestingly, while clang makes no attempt to reload p[index], gcc does reload it on both platforms, but the code for test on x64 is:
test(long*, int, int, int):
movsx rsi, esi
movsx rdx, edx
lea rax, [rdi+rsi*8]
mov QWORD PTR [rax], 1
mov rax, QWORD PTR [rax]
mov QWORD PTR [rdi+rdx*8], 2
ret
This code writes the value 1 to p[index1], then reads p[index1], stores 2 to p[index2], and returns the value just read from p[index1].
It's possible that memmove will scrub active/effective type on all implementations that correctly handle all of the corner cases mandated by the Standards, but it's not necessary on the -fno-strict-aliasing dialects of clang and gcc, and it's insufficient on the -fstrict-aliasing dialects processed by those compilers.

convert to string with and without reinterpret_cast

I have been looking for code to get CPUID in Linux and came across several good examples. These are specific questions between difference in implementation. Below are unsigned integers and it uses reinterpret_cast with size_t as 12
struct CPUVendorID{
unsigned int ebx;
unsigned int edx;
unsigned int ecx;
string toString() const {
return string(reinterpret_cast<const char *>(this), 12);
}
};
...
CPUVendorID vendorID { .ebx = ebx, .edx = edx, .ecx = ecx };
string vendor = vendorID.toString();
another form of to get the same output with size_t as 4 is given below:
string vendor;
vendor += string((const char *)&cpuID.EBX(), 4);
vendor += string((const char *)&cpuID.EDX(), 4);
vendor += string((const char *)&cpuID.ECX(), 4);
cout << "CPU vendor = " << vendor << endl;
Both output 12 character string. Can someone explain me what is happening in the reinterpret_cast statement above ? I find this way of implementation very elegant, but I don't know why its working obviously 4*3=12. But, how does it manage to concatenate data from the 3 ebx, edx and ecx?
CPU's manufacturer ID string – a twelve-character ASCII string stored in EBX, EDX, ECX (in that order) CPUID wiki
Can someone explain me what is happening in the reinterpret_cast statement above?
The structure address this coincides with the address of its first member, which is int ebx. Since the following members are of the same type, there is no padding between them, so that this code treats the three int members as an int[3] array.

mmap system call returning -14(-EFAULT??)

I am implementing mmap function using system call.(I am implementing mmap manually because of some reasons.)
But I am getting return value -14 (-EFAULT, I checked with GDB) whith this message:
WARN Nar::Mmap: Memory allocation failed.
Here is function:
void *Mmap(void *Address, size_t Length, int Prot, int Flags, int Fd, off_t Offset) {
MmapArgument ma;
ma.Address = (unsigned long)Address;
ma.Length = (unsigned long)Length;
ma.Prot = (unsigned long)Prot;
ma.Flags = (unsigned long)Flags;
ma.Fd = (unsigned long)Fd;
ma.Offset = (unsigned long)Offset;
void *ptr = (void *)CallSystem(SysMmap, (uint64_t)&ma, Unused, Unused, Unused, Unused);
int errCode = (int)ptr;
if(errCode < 0) {
Print("WARN Nar::Mmap: Memory allocation failed.\n");
return NULL;
}
return ptr;
}
I wrote a macro(To use like malloc() function):
#define Malloc(x) Mmap(0, x, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
and I used like this:
Malloc(45);
I looked at man page. I couldn't find about EFAULT on mmap man page, but I found something about EFAULT on mmap2 man page.
EFAULT Problem with getting the data from user space.
I think this means something is wrong with passing struct to system call.
But I believe nothing is wrong with my struct:
struct MmapArgument {
unsigned long Address;
unsigned long Length;
unsigned long Prot;
unsigned long Flags;
unsigned long Fd;
unsigned long Offset;
};
Maybe something is wrong with handing result value?
Openning a file (which doesn't exist) with CallSystem gave me -2(-ENOENT), which is correct.
EDIT: Full source of CallSystem. open, write, close works, but mmap(or old_mmap) not works.
All of the arguments were passed well.
section .text
global CallSystem
CallSystem:
mov rax, rdi ;RAX
mov rbx, rsi ;RBX
mov r10, rdx
mov r11, rcx
mov rcx, r10 ;RCX
mov rdx, r11 ;RDX
mov rsi, r8 ;RSI
mov rdi, r9 ;RDI
int 0x80
mov rdx, 0 ;Upper 64bit
ret ;Return
It is unclear why you are calling mmap via your CallSystem function, I'll assume it is a requirement of your assignment.
The main problem with your code is that you are using int 0x80. This will only work if all the addresses passed to int 0x80 can be expressed in a 32-bit integer. That isn't the case in your code. This line:
MmapArgument ma;
places your structure on the stack. In 64-bit code the stack is at the top end of the addressable address space well beyond what can be represented in a 32-bit address. Usually the bottom of the stack is somewhere in the region of 0x00007FFFFFFFFFFF. int 0x80 only works on the bottom half of the 64-bit registers, so effectively stack based addresses get truncated, resulting in an incorrect address. To make proper 64-bit system calls it is preferable to use the syscall instruction
The 64-bit System V ABI has a section on the general mechanism for the syscall interface in section A.2.1 AMD64 Linux Kernel Conventions. It says:
User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi,
%rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. The kernel destroys
registers %rcx and %r11.
We can create a simplified version of your SystemCall code by placing the systemcallnum as the last parameter. As the 7th parameter it will be the first and only value passed on the stack. We can move that value from the stack into RAX to be used as the system call number. The first 6 values are passed in the registers, and with the exception of RCX we can simply keep all the registers as-is. RCX has to be moved to R10 because the 4th parameter differs between a normal function call and the Linux kernel SYSCALL convention.
Some simplified code for demonstration purposes could look like:
global CallSystem
section .text
CallSystem:
mov rax, [rsp+8] ; CallSystem 7th arg is 1st val passed on stack
mov r10, rcx ; 4th argument passed to syscall in r10
; RDI, RSI, RDX, R8, R9 are passed straight through
; to the sycall because they match the inputs to CallSystem
syscall
ret
The C++ could look like:
#include <stdlib.h>
#include <sys/mman.h>
#include <stdint.h>
#include <iostream>
using namespace std;
extern "C" uint64_t CallSystem (uint64_t arg1, uint64_t arg2,
uint64_t arg3, uint64_t arg4,
uint64_t arg5, uint64_t arg6,
uint64_t syscallnum);
int main()
{
uint64_t addr;
addr = CallSystem(static_cast<uint64_t>(NULL), 45,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0, 0x9);
cout << reinterpret_cast<void *>(addr) << endl;
}
In the case of mmap the syscall is 0x09. That can be found in the file asm/unistd_64.h:
#define __NR_mmap 9
The rest of the arguments are typical of the newer form of mmap. From the manpage:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If your run strace on your executable (ie strace ./a.out) you should find a line that looks like this if it works:
mmap(NULL, 45, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed8e7cc000
The return value will differ, but it should match what the demonstration program displays.
You should be able to adapt this code to what you are doing. This should at least be a reasonable starting point.
If you want to pass the syscallnum as the first parameter to CallSystem you will have to modify the assembly code to move all the registers so that they align properly between the function call convention and syscall conventions. I leave that as a simple exercise to the reader. Doing so will yield a lot less efficient code.

Adding stringstream/cout hurts performance, even when the code is never called

I have a program in which a simple function is called a large number of times. I have added some simple logging code and find that this significantly affects performance, even when the logging code is not actually called. A complete (but simplified) test case is shown below:
#include <chrono>
#include <iostream>
#include <random>
#include <sstream>
using namespace std::chrono;
std::mt19937 rng;
uint32_t getValue()
{
// Just some pointless work, helps stop this function from getting inlined.
for (int x = 0; x < 100; x++)
{
rng();
}
// Get a value, which happens never to be zero
uint32_t value = rng();
// This (by chance) is never true
if (value == 0)
{
value++; // This if statment won't get optimized away when printing below is commented out.
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
return value;
}
int main(int argc, char* argv[])
{
// Just fror timing
high_resolution_clock::time_point start = high_resolution_clock::now();
uint32_t sum = 0;
for (uint32_t i = 0; i < 10000000; i++)
{
sum += getValue();
}
milliseconds elapsed = duration_cast<milliseconds>(high_resolution_clock::now() - start);
// Use (print) the sum to make sure it doesn't get optimized away.
std::cout << "Sum = " << sum << ", Elapsed = " << elapsed.count() << "ms" << std::endl;
return 0;
}
Note that the code contains stringstream and cout but these are never actually called. However, the presence of these three lines of code increases the run time from 2.9 to 3.3 seconds. This is in release mode on VS2013. Curiously, if I build in GCC using '-O3' flag the extra three lines of code actually decrease the runtime by half a second or so.
I understand that the extra code could impact the resulting executable in a number of ways, such as by preventing inlining or causing more cache misses. The real question is whether there is anything I can do to improve on this situation? Switching to sprintf()/printf() doesn't seem to make a difference. Do I need to simply accept that adding such logging code to small functions will affect performance even if not called?
Note: For completeness, my real/full scenario is that I use a wrapper macro to throw exceptions and I like to log when such an exception is thrown. So when I call THROW_EXCEPT(...) it inserts code similar to that shown above and then throws. This in then hurting when I throw exceptions from inside a small function. Any better alternatives here?
Edit: Here is a VS2013 solution for quick testing, and so compiler settings can be checked: https://drive.google.com/file/d/0B7b4UnjhhIiEamFyS0hjSnVzbGM/view?usp=sharing
So I initially thought that this was due to branch prediction and optimising out branches so I took a look at the annotated assembly for when the code is commented out:
if (value == 0)
00E21371 mov ecx,1
00E21376 cmove eax,ecx
{
value++;
Here we see that the compiler has helpfully optimised out our branch, so what if we put in a more complex statement to prevent it from doing so:
if (value == 0)
00AE1371 jne getValue+99h (0AE1379h)
{
value /= value;
00AE1373 xor edx,edx
00AE1375 xor ecx,ecx
00AE1377 div eax,ecx
Here the branch is left in but when running this it runs about as fast as the previous example with the following lines commented out. So lets have a look at the assembly for having those lines left in:
if (value == 0)
008F13A0 jne getValue+20Bh (08F14EBh)
{
value++;
std::stringstream ss;
008F13A6 lea ecx,[ebp-58h]
008F13A9 mov dword ptr [ss],8F32B4h
008F13B3 mov dword ptr [ebp-0B0h],8F32F4h
008F13BD call dword ptr ds:[8F30A4h]
008F13C3 push 0
008F13C5 lea eax,[ebp-0A8h]
008F13CB mov dword ptr [ebp-4],0
008F13D2 push eax
008F13D3 lea ecx,[ss]
008F13D9 mov dword ptr [ebp-10h],1
008F13E0 call dword ptr ds:[8F30A0h]
008F13E6 mov dword ptr [ebp-4],1
008F13ED mov eax,dword ptr [ss]
008F13F3 mov eax,dword ptr [eax+4]
008F13F6 mov dword ptr ss[eax],8F32B0h
008F1401 mov eax,dword ptr [ss]
008F1407 mov ecx,dword ptr [eax+4]
008F140A lea eax,[ecx-68h]
008F140D mov dword ptr [ebp+ecx-0C4h],eax
008F1414 lea ecx,[ebp-0A8h]
008F141A call dword ptr ds:[8F30B0h]
008F1420 mov dword ptr [ebp-4],0FFFFFFFFh
That's a lot of instructions if that branch is ever hit. So what if we try something else?
if (value == 0)
011F1371 jne getValue+0A6h (011F1386h)
{
value++;
printf("This never gets printed, but commenting out these three lines improves performance.");
011F1373 push 11F31D0h
011F1378 call dword ptr ds:[11F30ECh]
011F137E add esp,4
Here we have far fewer instructions and once again it runs as quickly as with all lines commented out.
So I'm not sure I can say for certain exactly what is happening here but I feel at the moment it is a combination of branch prediction and CPU instruction cache misses.
In order to solve this problem you could move the logging into a function like so:
void log()
{
std::stringstream ss;
ss << "This never gets printed, but commenting out these three lines improves performance." << std::endl;
std::cout << ss.str();
}
and
if (value == 0)
{
value++;
log();
Then it runs as fast as before with all those instructions replaced with a single call log (011C12E0h).

CPUID implementations in C++

I would like to know if somebody around here has some good examples of a C++ CPUID implementation that can be referenced from any of the managed .net languages.
Also, should this not be the case, should I be aware of certain implementation differences between X86 and X64?
I would like to use CPUID to get info on the machine my software is running on (crashreporting etc...) and I want to keep everything as widely compatible as possible.
Primary reason I ask is because I am a total noob when it comes to writing what will probably be all machine instructions though I have basic knowledge about CPU registers and so on...
Before people start telling me to Google: I found some examples online, but usually they were not meant to allow interaction from managed code and none of the examples were aimed at both X86 and X64. Most examples appeared to be X86 specific.
Accessing raw CPUID information is actually very easy, here is a C++ class for that which works in Windows, Linux and OSX:
#ifndef CPUID_H
#define CPUID_H
#ifdef _WIN32
#include <limits.h>
#include <intrin.h>
typedef unsigned __int32 uint32_t;
#else
#include <stdint.h>
#endif
class CPUID {
uint32_t regs[4];
public:
explicit CPUID(unsigned i) {
#ifdef _WIN32
__cpuid((int *)regs, (int)i);
#else
asm volatile
("cpuid" : "=a" (regs[0]), "=b" (regs[1]), "=c" (regs[2]), "=d" (regs[3])
: "a" (i), "c" (0));
// ECX is set to zero for CPUID function 4
#endif
}
const uint32_t &EAX() const {return regs[0];}
const uint32_t &EBX() const {return regs[1];}
const uint32_t &ECX() const {return regs[2];}
const uint32_t &EDX() const {return regs[3];}
};
#endif // CPUID_H
To use it just instantiate an instance of the class, load the CPUID instruction you are interested in and examine the registers. For example:
#include "CPUID.h"
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char *argv[]) {
CPUID cpuID(0); // Get CPU vendor
string vendor;
vendor += string((const char *)&cpuID.EBX(), 4);
vendor += string((const char *)&cpuID.EDX(), 4);
vendor += string((const char *)&cpuID.ECX(), 4);
cout << "CPU vendor = " << vendor << endl;
return 0;
}
This Wikipedia page tells you how to use CPUID: http://en.wikipedia.org/wiki/CPUID
EDIT: Added #include <intrin.h> for Windows, per comments.
See this MSDN article about __cpuid.
There is a comprehensive sample that compiles with Visual Studio 2005 or better. For Visual Studio 6, you can use this instead of the compiler instrinsic __cpuid:
void __cpuid(int CPUInfo[4], int InfoType)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
xor ecx, ecx
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
For Visual Studio 2005, you can use this instead of the compiler instrinsic __cpuidex:
void __cpuidex(int CPUInfo[4], int InfoType, int ECXValue)
{
__asm
{
mov esi, CPUInfo
mov eax, InfoType
mov ecx, ECXValue
cpuid
mov dword ptr [esi + 0], eax
mov dword ptr [esi + 4], ebx
mov dword ptr [esi + 8], ecx
mov dword ptr [esi + 12], edx
}
}
Might not be exactly what you are looking for, but Intel have a good article and sample code for enumerating Intel 64 bit platform architectures (processor, cache, etc.) which also seems to cover 32 bit x86 processors.