Can std::memmove() be used to "move" the memory to the same location to be able to alias it using different types?
For example:
#include <cstring>
#include <cstdint>
#include <iomanip>
#include <iostream>
struct Parts { std::uint16_t v[2u]; };
static_assert(sizeof(Parts) == sizeof(std::uint32_t), "");
static_assert(alignof(Parts) <= alignof(std::uint32_t), "");
int main() {
std::uint32_t u = 0xdeadbeef;
Parts * p = reinterpret_cast<Parts *>(std::memmove(&u, &u, sizeof(u)));
std::cout << std::hex << u << " ~> "
<< p->v[0] << ", " << p->v[1] << std::endl;
}
$ g++-10.2.0 -Wall -Wextra test.cpp -o test -O2 -ggdb -fsanitize=address,undefined -std=c++20 && ./test
deadbeef ~> beef, dead
Is this a safe approach? What are the caveats? Can static_cast be used instead of reinterpret_cast here?
If one is interested in knowing what constructs will be reliably processed by the clang and gcc optimizers in the absence of the -fno-strict-aliasing, rather than assuming that everything defined by the Standard will be processed meaningfully, both clang and gcc will sometimes ignore changes to active/effective types made by operations or sequences of operations that would not affect the bit pattern in a region of storage.
As an example:
#include <limits.h>
#include <string.h>
#if LONG_MAX == LLONG_MAX
typedef long long longish;
#elif LONG_MAX == INT_MAX
typedef int longish;
#endif
__attribute((noinline))
long test(long *p, int index, int index2, int index3)
{
if (sizeof (long) != sizeof (longish))
return -1;
p[index] = 1;
((longish*)p)[index2] = 2;
longish temp2 = ((longish*)p)[index3];
p[index3] = 5; // This should modify p[index3] and set its active/effective type
p[index3] = temp2; // Shouldn't (but seems to) reset effective type to longish
long temp3;
memmove(&temp3, p+index3, sizeof (long));
memmove(p+index3, &temp3, sizeof (long));
return p[index];
}
#include <stdio.h>
int main(void)
{
long arr[1] = {0};
long temp = test(arr, 0, 0, 0);
printf("%ld should equal %ld\n", temp, arr[0]);
}
While gcc happens to process this code correctly on 32-bit ARM (even if one uses flag -mcpu=cortex-m3 to avoid the call to memmove), clang processes it incorrectly on both platforms. Interestingly, while clang makes no attempt to reload p[index], gcc does reload it on both platforms, but the code for test on x64 is:
test(long*, int, int, int):
movsx rsi, esi
movsx rdx, edx
lea rax, [rdi+rsi*8]
mov QWORD PTR [rax], 1
mov rax, QWORD PTR [rax]
mov QWORD PTR [rdi+rdx*8], 2
ret
This code writes the value 1 to p[index1], then reads p[index1], stores 2 to p[index2], and returns the value just read from p[index1].
It's possible that memmove will scrub active/effective type on all implementations that correctly handle all of the corner cases mandated by the Standards, but it's not necessary on the -fno-strict-aliasing dialects of clang and gcc, and it's insufficient on the -fstrict-aliasing dialects processed by those compilers.
Related
I am trying to count how many bits an int takes up,e.g. count(10)=4,count(7)=3,count(127)=7 etc.
I have tried brute-forcing(<<ing a 1 until it's strictly bigger than the number) and using floor(log2(v))+1,but both are too slow for my needs.
I know that there exists a __builtin_popcnt function that quickly count how many 1s there are in an int,but I had trouble finding a built-in that fits my applictions,is there no such function or have I overlooked something?
Edit:I'm working with g++ version 9.3.0
Edit2:mediocrevegetable1's answer was chosen because it was the only one usable with g++ at the time.However,future readers may also try out chris's answer for better compatibility or in hopes that the compiler will give a more efficient implementation.
C++20 added std::bit_width (live example):
#include <bit>
#include <iostream>
int main() {
std::cout
<< std::bit_width(10u) << ' ' // 4
<< std::bit_width(7u) << ' ' // 3
<< std::bit_width(127u); // 7
}
With Clang trunk and, for example, -O3 -march=skylake, a function doing nothing but calling bit_width produces the following assembly:
count(unsigned int):
lzcnt ecx, edi
mov eax, 32
sub eax, ecx
ret
There are a number of similar functions in this new <bit> header as well. You can see them all here.
You can use the GCC built-in function __builtin_clz, which gives the leading zero's:
#include <climits>
constexpr unsigned bit_space(unsigned n)
{
return (sizeof n * CHAR_BIT) - __builtin_clz(n);
}
you can use asm
int index;
int value = 0b100010000000;
asm("bsr %1, %0":"=a"(index):"a"(value)); // now index equals 11
this method returns zero-based index of the last set bit
which means it will also return zero if no bit is set
to get around this, one can first check if value is not zero
if value is zero that means no bit is set
if value is not zero that means index + 1 is the number of bits used by value
There is a standard library function for this.
No need to write anything yourself.
https://en.cppreference.com/w/cpp/utility/bitset/count
#include <cassert>
#include <bitset>
int main()
{
std::bitset<32> value = 0b10010110;
auto count = value.count();
assert(count == 4);
}
After the feedback below, that this was the incorrect answer.
I don't know if there is a build in function.
But this compile time constexpr will give the same result.
#include <cassert>
static constexpr auto bits_needed(unsigned int value)
{
size_t n = 0;
for (; value > 0; value >>= 1, ++n);
return n;
}
int main()
{
assert(3 == bits_needed(7));
assert(4 == bits_needed(10));
assert(4 == bits_needed(9));
assert(16 == bits_needed(65000));
}
I want to run a set of operations over elements in a (custom) singly-linked list. The code to traverse the linked list and run the operations is simple, but repetitive and could be done wrong if copy/pasted everywhere. Performance & careful memory allocation is important in my program, so I want to avoid unnecessary overheads.
I want to write a wrapper to include the repetitive code and encapsulate the operations which are to take place on each element of the linked list. As the functions which take place within the operation vary I need to capture multiple variables (in the real code) which must be provided to the operation, so I looked at using std::function. The actual calculations done in this example code are meaningless here.
#include <iostream>
#include <memory>
struct Foo
{
explicit Foo(int num) : variable(num) {}
int variable;
std::unique_ptr<Foo> next;
};
void doStuff(Foo& foo, std::function<void(Foo&)> operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
int main(int argc, char** argv)
{
int val = 7;
Foo first(4);
first.next = std::make_unique<Foo>(5);
first.next->next = std::make_unique<Foo>(6);
#ifdef USE_FUNC
for (long i = 0; i < 100000000; ++i)
{
doStuff(first, [&](Foo& foo){ foo.variable += val + i; /*Other, more complex functionality here */ });
}
doStuff(first, [&](Foo& foo){ std::cout << foo.variable << std::endl; /*Other, more complex and different functionality here */ });
#else
for (long i = 0; i < 100000000; ++i)
{
Foo* fooPtr = &first;
do
{
fooPtr->variable += val + i;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
Foo* fooPtr = &first;
do
{
std::cout << fooPtr->variable << std::endl;
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
#endif
}
If run as:
g++ test.cpp -O3 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.252s
user 0m0.250s
sys 0m0.001s
Whereas if run as:
g++ test.cpp -O3 -Wall -DUSE_FUNC -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.834s
user 0m0.831s
sys 0m0.001s
These timings are fairly consistent across multiple runs, and show a 4x multiplier when using std::function. Is there a better way I can do what I want to do?
Use a template:
template<typename T>
void doStuff(Foo& foo, T const& operation)
For me this gives:
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -DUSE_FUNC -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.534s
user 0m0.529s
sys 0m0.005s
mvine#xxx:~/mikeytemp$ g++ test.cpp -O3 -std=c++14 -Wall -o mytest && time ./mytest
1587459716
1587459717
1587459718
real 0m0.583s
user 0m0.583s
sys 0m0.000s
Function objects are quite heavy weight, but have a use where the payload is quite large (>10000 cycles) or needs to be polymorphic such as in a generalised job scheduler.
They need to contain a copy of your callable object and handle any exceptions it might throw.
Using a template gets you much closer to the metal as the resulting code frequently gets inlined.
template <typename Func>
void doStuff(Foo& foo, Func operation)
{
Foo* fooPtr = &foo;
do
{
operation(*fooPtr);
} while (fooPtr->next && (fooPtr = fooPtr->next.get()));
}
The compiler will be able to look inside your function and eliminate redundancy.
On Golbolt, your inner loop becomes
.LBB0_6: # =>This Loop Header: Depth=1
lea edx, [rax + 7]
mov rsi, rcx
.LBB0_7: # Parent Loop BB0_6 Depth=1
add dword ptr [rsi], edx
mov rsi, qword ptr [rsi + 8]
test rsi, rsi
jne .LBB0_7
mov esi, eax
or esi, 1
add esi, 7
mov rdx, rcx
.LBB0_9: # Parent Loop BB0_6 Depth=1
add dword ptr [rdx], esi
mov rdx, qword ptr [rdx + 8]
test rdx, rdx
jne .LBB0_9
add rax, 2
cmp rax, 100000000
jne .LBB0_6
As a bonus, if you didn't use a linked list, the loop may disappear entirely.
I have been wondering how V8 JavaScript Engine and any other JIT compilers execute the generated code.
Here are the articles I read during my attempt to write a small demo.
http://eli.thegreenplace.net/2013/11/05/how-to-jit-an-introduction
http://nullprogram.com/blog/2015/03/19/
I only know very little about assembly, so I initially used http://gcc.godbolt.org/ to write a function and get the disassembled output, but the code is not working on Windows.
I then wrote a small C++ code, compiled with -g -Og, then get disassmbled output with gdb.
#include <stdio.h>
int square(int num) {
return num * num;
}
int main() {
printf("%d\n", square(10));
return 0;
}
Output:
Dump of assembler code for function square(int):
=> 0x00000000004015b0 <+0>: imul %ecx,%ecx
0x00000000004015b3 <+3>: mov %ecx,%eax
0x00000000004015b5 <+5>: retq
I copy-pasted the output ('%' removed) to online x86 assembler and get { 0x0F, 0xAF, 0xC9, 0x89, 0xC1, 0xC3 }.
Here is my final code. if I compiled it with gcc, I always get 1. If I compiled it with VC++, I get random number. What is going on?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <windows.h>
typedef unsigned char byte;
typedef int (*int0_int)(int);
const byte square_code[] = {
0x0f, 0xaf, 0xc9,
0x89, 0xc1,
0xc3
};
int main() {
byte* buf = reinterpret_cast<byte*>(VirtualAlloc(0, 1 << 8, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
if (buf == nullptr) return 0;
memcpy(buf, square_code, sizeof(square_code));
{
DWORD old;
VirtualProtect(buf, 1 << 8, PAGE_EXECUTE_READ, &old);
}
int0_int square = reinterpret_cast<int0_int>(buf);
int ans = square(100);
printf("%d\n", ans);
VirtualFree(buf, 0, MEM_RELEASE);
return 0;
}
Note
I am trying to learn how JIT works, so please do not suggest me to use LLVM or any library. I promise I will use a proper JIT library in real project rather than writing from scratch.
Note: as Ben Voigt points out in the comments, this is really only valid for x86, not x86_64. For x86_64 you just have some errors in your assembly (which are still errors in x86 as well) as Ben Voigt points out as well in his answer.
This is happening because your compiler could see both sides of the function call when you generated your assembly. Since the compiler was in control of generating code for both the caller and the callee, it didn't have to follow the cdecl calling convention, and it didn't.
The default calling convention for MSVC is cdecl. Basically, function parameters are pushed onto the stack in the reverse of the order they're listed, so a call to foo(10, 100) could result in the assembly:
push 100
push 10
call foo(int, int)
In your case, the compiler will generate something like the following at the call site:
push 100
call esi ; assuming the address of your code is in the register esi
That's not what your code is expecting though. Your code is expecting its argument to be passed in the register ecx, not the stack.
The compiler has used what looks like the fastcall calling convention. If I compile a similar program (I get slightly different assembly) I get the expected result:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <windows.h>
typedef unsigned char byte;
typedef int (_fastcall *int0_int)(int);
const byte square_code[] = {
0x8b, 0xc1,
0x0f, 0xaf, 0xc0,
0xc3
};
int main() {
byte* buf = reinterpret_cast<byte*>(VirtualAlloc(0, 1 << 8, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE));
if (buf == nullptr) return 0;
memcpy(buf, square_code, sizeof(square_code));
{
DWORD old;
VirtualProtect(buf, 1 << 8, PAGE_EXECUTE_READ, &old);
}
int0_int square = reinterpret_cast<int0_int>(buf);
int ans = square(100);
printf("%d\n", ans);
VirtualFree(buf, 0, MEM_RELEASE);
return 0;
}
Note that I've told the compiler to use the _fastcall calling convention. If you want to use cdecl, the assembly would need to look more like this:
push ebp
mov ebp, esp
mov eax, DWORD PTR _n$[ebp]
imul eax, eax
pop ebp
ret 0
(DISCLAMER: I'm not great at assembly, and that was generated by Visual Studio)
I copy-pasted the output ('%' removed)
Well, that means your second instruction was
mov ecx, eax
which makes no sense at all (it overwrites the result of the multiplication with the uninitialized return value).
On the other hand
mov eax, foo
ret
is a very common pattern for ending a function with non-void return type.
The difference between your two assembly languages (AT&T style vs Intel style) is more than just the % marker, the operand order is reversed and pointers and offsets are denoted very differently as well.
You'll want to issue a set disassembly-flavor intel command in gdb
I am implementing mmap function using system call.(I am implementing mmap manually because of some reasons.)
But I am getting return value -14 (-EFAULT, I checked with GDB) whith this message:
WARN Nar::Mmap: Memory allocation failed.
Here is function:
void *Mmap(void *Address, size_t Length, int Prot, int Flags, int Fd, off_t Offset) {
MmapArgument ma;
ma.Address = (unsigned long)Address;
ma.Length = (unsigned long)Length;
ma.Prot = (unsigned long)Prot;
ma.Flags = (unsigned long)Flags;
ma.Fd = (unsigned long)Fd;
ma.Offset = (unsigned long)Offset;
void *ptr = (void *)CallSystem(SysMmap, (uint64_t)&ma, Unused, Unused, Unused, Unused);
int errCode = (int)ptr;
if(errCode < 0) {
Print("WARN Nar::Mmap: Memory allocation failed.\n");
return NULL;
}
return ptr;
}
I wrote a macro(To use like malloc() function):
#define Malloc(x) Mmap(0, x, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
and I used like this:
Malloc(45);
I looked at man page. I couldn't find about EFAULT on mmap man page, but I found something about EFAULT on mmap2 man page.
EFAULT Problem with getting the data from user space.
I think this means something is wrong with passing struct to system call.
But I believe nothing is wrong with my struct:
struct MmapArgument {
unsigned long Address;
unsigned long Length;
unsigned long Prot;
unsigned long Flags;
unsigned long Fd;
unsigned long Offset;
};
Maybe something is wrong with handing result value?
Openning a file (which doesn't exist) with CallSystem gave me -2(-ENOENT), which is correct.
EDIT: Full source of CallSystem. open, write, close works, but mmap(or old_mmap) not works.
All of the arguments were passed well.
section .text
global CallSystem
CallSystem:
mov rax, rdi ;RAX
mov rbx, rsi ;RBX
mov r10, rdx
mov r11, rcx
mov rcx, r10 ;RCX
mov rdx, r11 ;RDX
mov rsi, r8 ;RSI
mov rdi, r9 ;RDI
int 0x80
mov rdx, 0 ;Upper 64bit
ret ;Return
It is unclear why you are calling mmap via your CallSystem function, I'll assume it is a requirement of your assignment.
The main problem with your code is that you are using int 0x80. This will only work if all the addresses passed to int 0x80 can be expressed in a 32-bit integer. That isn't the case in your code. This line:
MmapArgument ma;
places your structure on the stack. In 64-bit code the stack is at the top end of the addressable address space well beyond what can be represented in a 32-bit address. Usually the bottom of the stack is somewhere in the region of 0x00007FFFFFFFFFFF. int 0x80 only works on the bottom half of the 64-bit registers, so effectively stack based addresses get truncated, resulting in an incorrect address. To make proper 64-bit system calls it is preferable to use the syscall instruction
The 64-bit System V ABI has a section on the general mechanism for the syscall interface in section A.2.1 AMD64 Linux Kernel Conventions. It says:
User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi,
%rsi, %rdx, %r10, %r8 and %r9.
A system-call is done via the syscall instruction. The kernel destroys
registers %rcx and %r11.
We can create a simplified version of your SystemCall code by placing the systemcallnum as the last parameter. As the 7th parameter it will be the first and only value passed on the stack. We can move that value from the stack into RAX to be used as the system call number. The first 6 values are passed in the registers, and with the exception of RCX we can simply keep all the registers as-is. RCX has to be moved to R10 because the 4th parameter differs between a normal function call and the Linux kernel SYSCALL convention.
Some simplified code for demonstration purposes could look like:
global CallSystem
section .text
CallSystem:
mov rax, [rsp+8] ; CallSystem 7th arg is 1st val passed on stack
mov r10, rcx ; 4th argument passed to syscall in r10
; RDI, RSI, RDX, R8, R9 are passed straight through
; to the sycall because they match the inputs to CallSystem
syscall
ret
The C++ could look like:
#include <stdlib.h>
#include <sys/mman.h>
#include <stdint.h>
#include <iostream>
using namespace std;
extern "C" uint64_t CallSystem (uint64_t arg1, uint64_t arg2,
uint64_t arg3, uint64_t arg4,
uint64_t arg5, uint64_t arg6,
uint64_t syscallnum);
int main()
{
uint64_t addr;
addr = CallSystem(static_cast<uint64_t>(NULL), 45,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS,
-1, 0, 0x9);
cout << reinterpret_cast<void *>(addr) << endl;
}
In the case of mmap the syscall is 0x09. That can be found in the file asm/unistd_64.h:
#define __NR_mmap 9
The rest of the arguments are typical of the newer form of mmap. From the manpage:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If your run strace on your executable (ie strace ./a.out) you should find a line that looks like this if it works:
mmap(NULL, 45, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fed8e7cc000
The return value will differ, but it should match what the demonstration program displays.
You should be able to adapt this code to what you are doing. This should at least be a reasonable starting point.
If you want to pass the syscallnum as the first parameter to CallSystem you will have to modify the assembly code to move all the registers so that they align properly between the function call convention and syscall conventions. I leave that as a simple exercise to the reader. Doing so will yield a lot less efficient code.
While writing new code for Windows, I stumbled upon _cpuinfo() from the Windows API. As I am mainly dealing with a Linux environment (GCC) I want to have access to the CPUInfo.
I have tried the following:
#include <iostream>
int main()
{
int a, b;
for (a = 0; a < 5; a++)
{
__asm ( "mov %1, %%eax; " // a into eax
"cpuid;"
"mov %%eax, %0;" // eax into b
:"=r"(b) // output
:"r"(a) // input
:"%eax","%ebx","%ecx","%edx" // clobbered register
);
std::cout << "The CPUID level " << a << " gives EAX= " << b << '\n';
}
return 0;
}
This use assembly but I don't want to re-invent the wheel. Is there any other way to implement CPUInfo without assembly?
Since you are compiling with GCC then you can include cpuid.h which declares these functions:
/* Return highest supported input value for cpuid instruction. ext can
be either 0x0 or 0x8000000 to return highest supported value for
basic or extended cpuid information. Function returns 0 if cpuid
is not supported or whatever cpuid returns in eax register. If sig
pointer is non-null, then first four bytes of the signature
(as found in ebx register) are returned in location pointed by sig. */
unsigned int __get_cpuid_max (unsigned int __ext, unsigned int *__sig)
/* Return cpuid data for requested cpuid level, as found in returned
eax, ebx, ecx and edx registers. The function checks if cpuid is
supported and returns 1 for valid cpuid information or 0 for
unsupported cpuid level. All pointers are required to be non-null. */
int __get_cpuid (unsigned int __level,
unsigned int *__eax, unsigned int *__ebx,
unsigned int *__ecx, unsigned int *__edx)
You don't need to, and should not, re-implement this functionality.
for (a =0; a < 5; ++a;)
There should only be two semicolons there. You've got three.
This is basic C/C++ syntax; the CPUID is a red herring.