I am programming a 512 bits integer in C++.
For the integer, I allocate memory from the heap using the new keyword, but the compiler (g++ version 8.1 on MINGW) seems to wrongfully optimize that out.
i.e compiler commands are:
g++ -Wall -fexceptions -Og -g -fopenmp -std=c++14 -c main.cpp -o main.o
g++ -o bin\Debug\cs.exe obj\Debug\main.o -O0 -lgomp
Code:
#include <iostream>
#include <cstdint>
#include <omp.h>
constexpr unsigned char arr_size = 16;
constexpr unsigned char arr_size_half = 8;
void exit(int);
struct uint512_t{
uint32_t * bytes;
uint512_t(uint32_t num){
//The line below is either (wrongfully) ignored or (wrongfully) optimized out
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
uint512_t inline operator &(uint512_t &b){
uint32_t* itera = bytes;
uint32_t* iterb = b.bytes;
uint512_t ret(0);
uint32_t* iterret = ret.bytes;
for(char i = 0; i < arr_size; ++i){
*(iterret++) = *(itera++) & *(iterb++);
}
return ret;
}
uint512_t inline operator =(uint512_t &b){
uint32_t * itera=bytes, *iterb=b.bytes;
for(char i = 0; i < arr_size; ++i){
*(itera++) = *(iterb++);
}
return *this;
}
uint512_t inline operator + (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+*(p2res) : res;
}
return ret;
}
uint512_t inline operator += (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+(*p2res) : res;
}
(*this) = ret;
return *this;
}
//uint512_t inline operator * (uint512_t &b){
//}
~uint512_t(){
delete[] bytes;
}
};
int main(void){
uint512_t a(3);
}
ptr < ptr+16 is always true. The loop is infinite, and eventually overflows the buffer that it writes to.
Simple solution: Value initialise the array so that you don't need the loop:
bytes = new(std::nothrow) uint32_t[arr_size]();
// ^^
PS. If you copy an instance, the behaviour will be undefined, since the copy would point to same allocation and both instances would attempt to delete it in the destructor.
Simple solution: Don't use bare owning pointers. Use a RAII container such as std::vector if you need to allocate an array dynamically.
PPS. Carefully consider whether you need dynamic allocation (and the associated overhead) in the first place. 512 bits is in many cases a fairly safe size to have in-place.
The error is at this line and has nothing to do with new being optimized away:
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
*ptr = 0;
}
The condition for the for is wrong. ptr < ptr+16 will never be false. The loop will go on forever and eventually you will dereference an invalid memory location because ptr gets incremented ad-infinitum.
By the way, the compiler is allowed to perform optimizations but it is not allowed to change the apparent behavior of the program. If your code performs a new, the compiler can optimize it away if it can ensure that the side effects of new are there when you need them (in this case at the moment you access the array).
You are accessing the array out of bound. The smallest reproducible example would be:
#include <cstdint>
int main() {
uint32_t bytes[16];
for(uint32_t * ptr = bytes + 1; ptr < ptr + 16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
The ptr < ptr + 16 is always true (maybe except for overflow).
p.s i tried your solution, and it worked fine -
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
auto ptrp16 = bytes+16;
for(uint32_t * ptr = bytes+1;ptr < ptrp16 ; ++ptr){
*ptr = 0;
}
Related
I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.
A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:
// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (int i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
It I use the following code, I get huge speedup on x86-64
void f1(double* x, const unsigned char* p) {
double* r = reinterpret_cast<const double*>(p);
*x = *r;
}
But, as I understand, the program would crash on ARM if p is not 8-byte aligned.
Here are my questions:
Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?
Here is a small benchmark to test on your machine
#include <chrono>
#include <iostream>
void copy_int_0(int* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 4; ++i) {
q[i] = p[i];
}
}
void copy_double_0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
void copy_int_1(int* x, const unsigned char* p) {
*x = *reinterpret_cast<const int*>(p);
}
void copy_double_1(double* x, const unsigned char* p) {
*x = *reinterpret_cast<const double*>(p);
}
int main() {
const std::size_t n = 10000000;
const std::size_t nb_times = 200;
unsigned char* p = new unsigned char[12 * n];
for (std::size_t i = 0; i < 12 * n; ++i) {
p[i] = 0;
}
int* q0 = new int[n];
for (std::size_t i = 0; i < n; ++i) {
q0[i] = 0;
}
double* q1 = new double[n];
for (std::size_t i = 0; i < n; ++i) {
q1[i] = 0.0;
}
const auto begin_0 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_0(q0 + i, p + 12 * i);
copy_double_0(q1 + i, p + 4 + 12 * i);
}
}
const auto end_0 = std::chrono::high_resolution_clock::now();
const double time_0 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
.count();
std::cout << "Time 0: " << time_0 << " s" << std::endl;
const auto begin_1 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_1(q0 + i, p + 12 * i);
copy_double_1(q1 + i, p + 4 + 12 * i);
}
}
const auto end_1 = std::chrono::high_resolution_clock::now();
const double time_1 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
.count();
std::cout << "Time 1: " << time_1 << " s" << std::endl;
std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;
delete[] q1;
delete[] q0;
delete[] p;
return 0;
}
The results I get are
clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s
g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s
icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s
I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.
Is the second program guaranteed to work on both x86 and x86-64?
No.
When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).
A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.
This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.
In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.
How would you write such a function on ARM if you need it as fast as you can?
The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.
Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().
If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().
FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.
For two dimensional array allocation in C/C++ , the very common code is :
const int array_size = .. ;
array = (int**) malloc(array_size);
for (int c=0;c<array_size;c++)
array[c] = (int*) malloc(other_size);
But I think we should be writing this:
const int array_size = .. ;
array = (int*) malloc(array_size);
int c;
bool free_array = false;
for (c=0;c<array_size;c++) {
array[c] = (int*) malloc(other_size);
if(array[c] == NULL){
free_array = true;
break;
}
}
if(free_array) {
for (int c1=0;c1<c;c1++)
free(array[c1]);
}
to make sure that if one allocation failed we will free the previously allocated memory.
Am I correct?
Note : in C++ there is an alternative safe method with smart pointers and STL containers, but lets talk about raw pointers here or about C pointers.
Generally speaking, if you detect that malloc fails, the only thing you can really do is exit(). At that point, you can't safely do anything regarding memory allocation or deallocation.
The only exception is if you're in an embedded environment where exiting is not an option. In that case, you probably shouldn't be using malloc in the first place.
Firstly, your code is malformed
array = (int*)
array[c] = (int*)
this suggests you intended
array = (int**)
array[c] = (int*)
Next you claim this is "very common", when all it is is "very lazy".
A better solution is a single allocation.
#include <string.h>
void* alloc_2d_array(size_t xDim, size_t yDim, size_t elementSize)
{
size_t indexSize = sizeof(void*) * xDim;
size_t dataSize = elementSize * yDim * xDim;
size_t totalSize = indexSize + dataSize;
void* ptr = calloc(1, totalSize);
if (!ptr)
return ptr;
void** index = (void**)ptr;
void** endIndex = index + xDim;
char* data = (char*)ptr + indexSize;
do {
*index = *data;
data += elementSize;
} while (++index < endIndex);
return ptr;
}
int main()
{
int** ptr = (int**)alloc_2d_array(3, 7, sizeof(int));
for (size_t x = 0; x < 3; ++x) {
for (size_t y = 0; y < 7; ++y) {
ptr[x][y] = (10 * (x+1)) + (y + 1);
}
}
free(ptr);
return 0;
}
However, this assumes the language is C, in C++ the above code is pretty much total fail.
I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC.
static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) {
#pragma clang loop vectorize(enable) interleave(enable)
for (size_t i = 0; i < payload.size(); i++) {
payload[i] = payload[i] ^ masking_key[i % 4];
}
}
The following flags are passed to clang:
-O3
-Rpass=loop-vectorize
-Rpass-analysis=loop-vectorize
However, the vectorization fails with the following error:
WebSocket.cpp:5:
WebSocket.h:14:
In file included from boost/asio/io_service.hpp:767:
In file included from boost/asio/impl/io_service.hpp:19:
In file included from boost/asio/detail/service_registry.hpp:143:
In file included from boost/asio/detail/impl/service_registry.ipp:19:
c++/v1/vector:1498:18: remark: loop not vectorized: could not determine number
of loop iterations [-Rpass-analysis]
return this->__begin_[__n];
^
c++/v1/vector:1498:18: error: loop not vectorized: failed explicitly specified
loop vectorization [-Werror,-Wpass-failed]
How do I vectorize this for loop?
Thanks to #PaulR and #PeterCordes. Unrolling the loop by a factor of 4 works.
void apply_mask(vector<uint8_t> &payload, const uint8_t (&masking_key)[4]) {
const size_t size = payload.size();
const size_t size4 = size / 4;
size_t i = 0;
uint8_t *p = &payload[0];
uint32_t *p32 = reinterpret_cast<uint32_t *>(p);
const uint32_t m = *reinterpret_cast<const uint32_t *>(&masking_key[0]);
#pragma clang loop vectorize(enable) interleave(enable)
for (i = 0; i < size4; i++) {
p32[i] = p32[i] ^ m;
}
for (i = (size4*4); i < size; i++) {
p[i] = p[i] ^ masking_key[i % 4];
}
}
gcc.godbolt code
How can I copy everything from char**** first to a contiguous variable: char* second without using nested loops?
For example, I could use nested for loops that would look like this:
for (size_t a=0; a < sizeof(***first); a++) {
for (size_t b=0; b < sizeof(**first); b++) {
//etc
}
}
I just want to know if this is possible.
You seem to have a serious misconception about what sizeof does.
sizeof(x) returns the size in bytes of object x. When dealing with pointers sizeof(*x) is not going to be in general the same as the number of elements that x is pointing to.
Note also that, in the cases you are using, sizeof(x) is a value decided at compile time and sizeof(*x) doesn't even look at what x is pointing to (only looks to what is the type of the object x is pointing to: for example with int *x = NULL; the expression sizeof(*x) is the same as sizeof(int)).
Moreover you need to understand the difference between a multi-dimensional array of ints:
int x[10][10];
and a pointer to a pointer to an int:
int **y;
Even if the two can be dereferenced using the same syntax
x[1][2] = 42;
y[1][2] = 42;
the meaning is completely different. More specifically for example y[0] and y[1] may be pointing to (the first elements of) arrays of different sizes, they may be NULL or the may point to single integers (y[0] and y[1] could even be pointing to the same object).
There is no way to copy a pointer-pointer data structure into a multidimensional array without loops because the two are in general objects with a completely different kind of shape.
IFF you know the individual sizes of the multidimensional arrays, you can flatten the array in a single loop. It should be noted though that you might not actually see a performance increase of flattening an array in 1 loop over having some nested loops; it can depend on your compiler optimizations (like vectorization) and platform/architecture, as well, if you have to determine the size of each nested array you'll incur a performance hit there on top of the large loop. YMMV, so it's best to do some small scale benchmarks to verify it achieves the result and performance you're wanting.
The following is some example code on how to flatten an array; note that I'm using a std::string type in this example just so you can run the code and print to a file/stdout to see that the array has indeed been flattened.
#include <iostream>
#include <string>
#include <sstream>
void flatten(std::string**** first, int len_a, int len_b, int len_c, int len_d, std::string* second)
{
int i, a, b, c, d;
int max_len = len_a * len_b * len_c * len_d;
a = b = c = d = -1;
for (i = 0; i < max_len; ++i) {
d = (i % len_d);
// if all lengths were equal, you could further optimize this loop
if (d == 0) {
++c;
if ((c % len_c) == 0) {
c = 0;
++b;
if ((b % len_b) == 0) {
b = 0;
++a;
}
}
}
second[i] = first[a][b][c][d];
}
}
int main()
{
const int len_a = 11;
const int len_b = 22;
const int len_c = 33;
const int len_d = 44;
const int max_len = len_a * len_b * len_c * len_d;
// create the arrays
std::string**** first = new std::string***[len_a];
std::string* second = new std::string[max_len];
for (int i1 = 0; i1 < len_a; ++i1) {
first[i1] = new std::string**[len_b];
for (int i2 = 0; i2 < len_b; ++i2) {
first[i1][i2] = new std::string*[len_c];
for (int i3 = 0; i3 < len_c; ++i3) {
first[i1][i2][i3] = new std::string[len_d];
for (int i4 = 0; i4 < len_d; ++i4) {
std::stringstream ss;
ss <<"["<<i1<<"]["<<i2<<"]["<<i3<<"]["<<i4<<"]";
first[i1][i2][i3][i4] = ss.str(); // or what have you
}
}
}
}
// flatten the multidimensional array 'first' into the array 'second'
flatten(first, len_a, len_b, len_c, len_d, second);
// print it
for (int i1 = 0; i1 < max_len; ++i1) {
std::cout<<"second["<<i1<<"] = "<<second[i1]<<std::endl;
}
// clean up
delete[] second;
for (int i1 = 0; i1 < len_a; ++i1) {
for (int i2 = 0; i2 < len_b; ++i2) {
for (int i3 = 0; i3 < len_c; ++i3) {
delete[] first[i1][i2][i3];
}
delete[] first[i1][i2];
}
delete[] first[i1];
}
delete[] first;
return 0;
}
Again, this is obviously not a safe/clean/efficient method of doing what you're looking for, but I'm merely trying to demonstrate that you can achieve it and I'll leave further efficiencies/implementation details to you.
I hope that can help.
I want to compile some of my cpp-functions with the avr-g++ compiler & linker. My experience from former projects tells me that it definitely works with new and delete. But somehow this function compiles without errors:
void usart_controller::send_data(uint32_t * data32, size_t data32_size)
{
size_t data_size = 4 * data32_size;
//uint8_t * data = new uint8_t[data_size];
uint8_t data[data_size];
uint8_t *data_ptr = &data[0];
for(unsigned int i = 0; i < data32_size; i++)
{
for(int j = 0; j < 4; j++)
{
data[i*j+j] = (data32[i] << (j*8));
}
}
/*usart_serial_write_packet(this->usart, *data_ptr, (size_t)(data_size * sizeof(uint8_t)));*/
size_t len = sizeof(uint8_t)*data_size;
while (len) {
usart_serial_putchar(this->usart, *data_ptr);
len--;
data_ptr++;
}
//delete[] data;//Highly discouraged, because of memory leak!//Works as a charme because of C, but I don't care at the moment
}
but the same function with new does not work:
void usart_controller::send_data(uint32_t * data32, size_t data32_size)
{
size_t data_size = 4 * data32_size;
uint8_t * data = new uint8_t[data_size];
//uint8_t data[data_size];
//uint8_t *data_ptr = &data[0];
for(unsigned int i = 0; i < data32_size; i++)
{
for(int j = 0; j < 4; j++)
{
data[i*j+j] = (data32[i] << (j*8));
}
}
/*usart_serial_write_packet(this->usart, *data_ptr, (size_t)(data_size * sizeof(uint8_t)));*/
size_t len = sizeof(uint8_t)*data_size;
while (len) {
usart_serial_putchar(this->usart, *data);
len--;
data++;
}
delete[] data;
}
Here I get the following errors:
error: undefined reference to `operator new[](unsigned int)'
error: undefined reference to `operator delete[](void*)'
The compiling and linking command is (shorted):
"C:\Program Files (x86)\Atmel\Atmel Toolchain\AVR8 GCC\Native\3.4.1061\avr8-gnu-toolchain\bin\avr-g++.exe" -o PreAmp.elf <...> usart_controller.o <...> -Wl,-Map="PreAmp.map" -Wl,--start-group -Wl,-lm -Wl,--end-group -Wl,--gc-sections -mmcu=atxmega16a4u
so I am assuming that I am using the g++-compiler and not the gcc-compiler. But in cpp it is impossible to declare a variable-length array as done above. Where is my mistake here?
I did not see any information on controller used, IDE (if any).
But if you are using Atmel studio/AVR tool chain from atmel.
They make it pretty clear that new and delete functionality is not supported and has to be implemented by user.
This makes sense since this is not a desktop application but a implementation on uC.
http://www.atmel.com/webdoc/avrlibcreferencemanual/faq_1faq_cplusplus.html