Vectorize a function in clang

Vectorize a function in clang - c++

I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC.
static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) {
#pragma clang loop vectorize(enable) interleave(enable)
for (size_t i = 0; i < payload.size(); i++) {
payload[i] = payload[i] ^ masking_key[i % 4];
}
}
The following flags are passed to clang:
-O3
-Rpass=loop-vectorize
-Rpass-analysis=loop-vectorize
However, the vectorization fails with the following error:
WebSocket.cpp:5:
WebSocket.h:14:
In file included from boost/asio/io_service.hpp:767:
In file included from boost/asio/impl/io_service.hpp:19:
In file included from boost/asio/detail/service_registry.hpp:143:
In file included from boost/asio/detail/impl/service_registry.ipp:19:
c++/v1/vector:1498:18: remark: loop not vectorized: could not determine number
of loop iterations [-Rpass-analysis]
return this->__begin_[__n];
^
c++/v1/vector:1498:18: error: loop not vectorized: failed explicitly specified
loop vectorization [-Werror,-Wpass-failed]
How do I vectorize this for loop?

Thanks to #PaulR and #PeterCordes. Unrolling the loop by a factor of 4 works.
void apply_mask(vector<uint8_t> &payload, const uint8_t (&masking_key)[4]) {
const size_t size = payload.size();
const size_t size4 = size / 4;
size_t i = 0;
uint8_t *p = &payload[0];
uint32_t *p32 = reinterpret_cast<uint32_t *>(p);
const uint32_t m = *reinterpret_cast<const uint32_t *>(&masking_key[0]);
#pragma clang loop vectorize(enable) interleave(enable)
for (i = 0; i < size4; i++) {
p32[i] = p32[i] ^ m;
}
for (i = (size4*4); i < size; i++) {
p[i] = p[i] ^ masking_key[i % 4];
}
}
gcc.godbolt code

Related

Error: Initialization with '{...}' expected for aggregate object

below is my code which processes the payload[] array and store it's result on myFinalShellcode[] array.
#include <windows.h>
#include <stdio.h>
unsigned char payload[] = { 0xf0,0xe8,0xc8,0x00,0x00,0x00,0x41,0x51,0x41,0x50,0x52,0x51,0x56,0x48,0x31 };
constexpr int length = 891;
constexpr int number_of_chunks = 5;
constexpr int chunk_size = length / number_of_chunks;
constexpr int remaining_bytes = length % number_of_chunks;
constexpr int size_after = length * 2;
unsigned char* restore_original(unsigned char* high_ent_payload)
{
constexpr int payload_size = (size_after + 1) / 2;
unsigned char low_entropy_payload_holder[size_after] = { 0 };
memcpy_s(low_entropy_payload_holder, sizeof low_entropy_payload_holder, high_ent_payload, size_after);
unsigned char restored_payload[payload_size] = { 0 };
int offset_payload_after = 0;
int offset_payload = 0;
for (size_t i = 0; i < number_of_chunks; i++)
{
for (size_t j = 0; j < chunk_size; j++)
{
restored_payload[offset_payload] = low_entropy_payload_holder[offset_payload_after];
offset_payload_after++;
offset_payload++;
}
for (size_t k = 0; k < chunk_size; k++)
{
offset_payload_after++;
}
}
if (remaining_bytes)
{
for (size_t i = 0; i < sizeof remaining_bytes; i++)
{
restored_payload[offset_payload++] = high_ent_payload[offset_payload_after++];
}
}
return restored_payload;
}
int main() {
unsigned char shellcode[] = restore_original(payload);
}
I get the following error on the last code line (inside main function):
Error: Initialization with '{...}' expected for aggregate object
I tried to change anything on the array itself (seems like they might be the problem). I would highly appreciate your help as this is a part of my personal research :)

In order to initialize an array defined with [], you must supply a list of values enclosed with {}, exactly as the error message says.
E.g.:
unsigned char shellcode[] = {1,2,3};
You can change shellcode to be a pointer if you want to assign it the output from restore_original:
unsigned char* shellcode = restore_original(payload);
Update:
As you can see in #heapunderrun's comment, there is another problem in your code. restore_original returns a pointer to a local variable, which is not valid when the function returns (a dangling pointer).
In order to fix this, restore_original should allocate memory on the heap using new. This allocation has to be freed eventually, when you are done with shellcode.
However - although you can make it work this way, I highly recomend you to use std::vector for dynamic arrays allocated on the heap. It will save you the need to manually manage the memory allocations/deallocations, as well as other advantages.

You can't assign a char * to a char []. You can probably do something with constexpr but I'm suspecting an XY problem here.

compiler ignores operator new allocation

I am programming a 512 bits integer in C++.
For the integer, I allocate memory from the heap using the new keyword, but the compiler (g++ version 8.1 on MINGW) seems to wrongfully optimize that out.
i.e compiler commands are:
g++ -Wall -fexceptions -Og -g -fopenmp -std=c++14 -c main.cpp -o main.o
g++ -o bin\Debug\cs.exe obj\Debug\main.o -O0 -lgomp
Code:
#include <iostream>
#include <cstdint>
#include <omp.h>
constexpr unsigned char arr_size = 16;
constexpr unsigned char arr_size_half = 8;
void exit(int);
struct uint512_t{
uint32_t * bytes;
uint512_t(uint32_t num){
//The line below is either (wrongfully) ignored or (wrongfully) optimized out
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
uint512_t inline operator &(uint512_t &b){
uint32_t* itera = bytes;
uint32_t* iterb = b.bytes;
uint512_t ret(0);
uint32_t* iterret = ret.bytes;
for(char i = 0; i < arr_size; ++i){
*(iterret++) = *(itera++) & *(iterb++);
}
return ret;
}
uint512_t inline operator =(uint512_t &b){
uint32_t * itera=bytes, *iterb=b.bytes;
for(char i = 0; i < arr_size; ++i){
*(itera++) = *(iterb++);
}
return *this;
}
uint512_t inline operator + (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+*(p2res) : res;
}
return ret;
}
uint512_t inline operator += (uint512_t &b){
uint32_t * itera = bytes;
uint32_t * iterb = b.bytes;
uint64_t res = 0;
uint512_t ret(0);
uint32_t *p2ret = ret.bytes;
uint32_t *p2res = 1+(uint32_t*)&res;
//#pragma omp parallel for shared(p2ret, res, p2res, itera, iterb, ret) private(i, arr_size) schedule(auto)
for(char i = 0; i < arr_size;++i){
res = *p2res;
res += *(itera++);
res += *(iterb++);
*(p2ret++) = (i<15) ? res+(*p2res) : res;
}
(*this) = ret;
return *this;
}
//uint512_t inline operator * (uint512_t &b){
//}
~uint512_t(){
delete[] bytes;
}
};
int main(void){
uint512_t a(3);
}

ptr < ptr+16 is always true. The loop is infinite, and eventually overflows the buffer that it writes to.
Simple solution: Value initialise the array so that you don't need the loop:
bytes = new(std::nothrow) uint32_t[arr_size]();
// ^^
PS. If you copy an instance, the behaviour will be undefined, since the copy would point to same allocation and both instances would attempt to delete it in the destructor.
Simple solution: Don't use bare owning pointers. Use a RAII container such as std::vector if you need to allocate an array dynamically.
PPS. Carefully consider whether you need dynamic allocation (and the associated overhead) in the first place. 512 bits is in many cases a fairly safe size to have in-place.

The error is at this line and has nothing to do with new being optimized away:
for(uint32_t * ptr = bytes+1; ptr < ptr+16; ++ptr){
*ptr = 0;
}
The condition for the for is wrong. ptr < ptr+16 will never be false. The loop will go on forever and eventually you will dereference an invalid memory location because ptr gets incremented ad-infinitum.
By the way, the compiler is allowed to perform optimizations but it is not allowed to change the apparent behavior of the program. If your code performs a new, the compiler can optimize it away if it can ensure that the side effects of new are there when you need them (in this case at the moment you access the array).

You are accessing the array out of bound. The smallest reproducible example would be:
#include <cstdint>
int main() {
uint32_t bytes[16];
for(uint32_t * ptr = bytes + 1; ptr < ptr + 16; ++ptr){
//OS throws error 0xC0000005 (accessing unallocated memory) here
*ptr = 0;
}
}
The ptr < ptr + 16 is always true (maybe except for overflow).

p.s i tried your solution, and it worked fine -
bytes = new(std::nothrow) uint32_t[arr_size];
if(!bytes){
std::cerr << "Error - not enough memory available.";
exit(-1);
}
*bytes = num;
auto ptrp16 = bytes+16;
for(uint32_t * ptr = bytes+1;ptr < ptrp16 ; ++ptr){
*ptr = 0;
}

Fast memcpy for small unaligned data

I need to read a binary file which is made of many basic types such as int, double, UTF8 strings, etc. For instance, think about one file containing n pairs of (int, double) one after the other, without any alignment with n being in the order of tens of millions. I need to get very fast access to that file. I read the file using fread calls and my own buffer which is about 16 kB long.
A profiler shows that my main bottleneck happens to be copying from the memory buffer to its final destination. The most obvious way to write a a function that copy from the buffer to a double would be:
// x: a pointer to the final destination of the data
// p: a pointer to the buffer used to read the file
//
void f0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (int i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
It I use the following code, I get huge speedup on x86-64
void f1(double* x, const unsigned char* p) {
double* r = reinterpret_cast<const double*>(p);
*x = *r;
}
But, as I understand, the program would crash on ARM if p is not 8-byte aligned.
Here are my questions:
Is the second program guaranteed to work on both x86 and x86-64?
How would you write such a function on ARM if you need it as fast as you can?
Here is a small benchmark to test on your machine
#include <chrono>
#include <iostream>
void copy_int_0(int* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 4; ++i) {
q[i] = p[i];
}
}
void copy_double_0(double* x, const unsigned char* p) {
unsigned char* q = reinterpret_cast<unsigned char*>(x);
for (std::size_t i = 0; i < 8; ++i) {
q[i] = p[i];
}
}
void copy_int_1(int* x, const unsigned char* p) {
*x = *reinterpret_cast<const int*>(p);
}
void copy_double_1(double* x, const unsigned char* p) {
*x = *reinterpret_cast<const double*>(p);
}
int main() {
const std::size_t n = 10000000;
const std::size_t nb_times = 200;
unsigned char* p = new unsigned char[12 * n];
for (std::size_t i = 0; i < 12 * n; ++i) {
p[i] = 0;
}
int* q0 = new int[n];
for (std::size_t i = 0; i < n; ++i) {
q0[i] = 0;
}
double* q1 = new double[n];
for (std::size_t i = 0; i < n; ++i) {
q1[i] = 0.0;
}
const auto begin_0 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_0(q0 + i, p + 12 * i);
copy_double_0(q1 + i, p + 4 + 12 * i);
}
}
const auto end_0 = std::chrono::high_resolution_clock::now();
const double time_0 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_0 - begin_0)
.count();
std::cout << "Time 0: " << time_0 << " s" << std::endl;
const auto begin_1 = std::chrono::high_resolution_clock::now();
for (std::size_t k = 0; k < nb_times; ++k) {
for (std::size_t i = 0; i < n; ++i) {
copy_int_1(q0 + i, p + 12 * i);
copy_double_1(q1 + i, p + 4 + 12 * i);
}
}
const auto end_1 = std::chrono::high_resolution_clock::now();
const double time_1 =
1.0e-9 *
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - begin_1)
.count();
std::cout << "Time 1: " << time_1 << " s" << std::endl;
std::cout << "Prevent optimization: " << q0[0] << " " << q1[0] << std::endl;
delete[] q1;
delete[] q0;
delete[] p;
return 0;
}
The results I get are
clang++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.49403 s
Time 1: 4.01617 s
g++ -std=c++11 -O3 -march=native copy.cpp -o copy
./copy
Time 0: 8.65762 s
Time 1: 3.89979 s
icpc -std=c++11 -O3 -xHost copy.cpp -o copy
./copy
Time 0: 8.46155 s
Time 1: 0.0278496 s
I did not check the assembly yet but I guess that the Intel compiler is fooling my benchmark here.

Is the second program guaranteed to work on both x86 and x86-64?
No.
When you dereference a double* the compiler is free to assume that the memory location actually contains a double, which means that it must be aligned to alignof(double).
A lot of x86 instructions are safe to use for unaligned data, but not all of them. Specifically, there are SIMD instructions which require proper alignment which your compiler is free to use.
This isn't just theoretical; LZ4 used to use something very similar to what you posted (it's C, not C++, so it was a C-style cast not reinterpret_cast, but that doesn't really matter), and everything worked as expected. Then GCC 5 was released, and it auto-vectorized the code in question at -O3 using vmovdqa, which requires proper alignment. The end result is that code which worked fine in GCC ≤ 4.9 started crashing at runtime when compiled with GCC ≥ 5.
In other words, even if your program happens to work today, if you depend on unaligned access (or other undefined behavior), it can easily stop working tomorrow. Don't do it.
How would you write such a function on ARM if you need it as fast as you can?
The answer isn't really ARM-specific. After the LZ4 incident Yann Collet (the author of LZ4) did a lot of research to answer this question. There isn't one option which well generate optimal code with every compiler on every architecture.
Using memcpy() is the safest option. If the size is known at compile time the compiler will generally optimize the memcpy() call away… for larger buffers, you can take advantage of that by calling memcpy() in a loop; you'll generally get a loop of fast instructions without the additional overhead of calling memcpy().
If you're feeling more adventurous you can use a packed union to "cast" instead of reinterpret_cast. This is compiler-specific, but when supported it should be safe, and it may be faster than memcpy().
FWIW, I have some code which attempts to find the optimal way to do this depending on various factors (compiler, compiler version, architecture, etc.). It is a bit conservative about platforms I haven't tested, but it should achieve good results on the vast majority of platforms people actually use.

C++ Comparison between signed and unsigned integer expressions

I am new to c++. I need help fixing this error:
Item.cpp: In member function ‘char* ict::Item::sku() const’:
Item.cpp:65:36: warning: comparison between signed and unsigned integer
expressions [-Wsign-compare]
This is the part of the code that is giving the error:
//in header file
char m_sku[MAX_SKU_LEN + 1];
//in cpp file
char* Item::sku() const
{
int length = strlen(m_sku);
char *arr = new char[length]();
for (int i = 0; i <= strlen(m_sku); i++) {
arr[i] = m_sku[i];
}
return arr;
}

The most straightforward way to fix this is to make i an unsigned variable instead of a signed one. You can use size_t to match the return type of strlen:
size_t length = strlen(m_sku);
char *arr = new char[length]();
for (size_t i = 0; i <= length; i++) {
arr[i] = m_sku[i];
}
But be careful since this same replacement doesn't work with loops that count down towards 0.
// oops! This is an infinite loop:
for (size_t i = length-1; i >=0; i--) {
arr[i] = m_sku[i];
}

Write a static cast (int)strlen(m_sku) or vice versa std::size_t i = 0.
So that compared items will be the same.

avr-g++ generates errors with c++-code

I want to compile some of my cpp-functions with the avr-g++ compiler & linker. My experience from former projects tells me that it definitely works with new and delete. But somehow this function compiles without errors:
void usart_controller::send_data(uint32_t * data32, size_t data32_size)
{
size_t data_size = 4 * data32_size;
//uint8_t * data = new uint8_t[data_size];
uint8_t data[data_size];
uint8_t *data_ptr = &data[0];
for(unsigned int i = 0; i < data32_size; i++)
{
for(int j = 0; j < 4; j++)
{
data[i*j+j] = (data32[i] << (j*8));
}
}
/*usart_serial_write_packet(this->usart, *data_ptr, (size_t)(data_size * sizeof(uint8_t)));*/
size_t len = sizeof(uint8_t)*data_size;
while (len) {
usart_serial_putchar(this->usart, *data_ptr);
len--;
data_ptr++;
}
//delete[] data;//Highly discouraged, because of memory leak!//Works as a charme because of C, but I don't care at the moment
}
but the same function with new does not work:
void usart_controller::send_data(uint32_t * data32, size_t data32_size)
{
size_t data_size = 4 * data32_size;
uint8_t * data = new uint8_t[data_size];
//uint8_t data[data_size];
//uint8_t *data_ptr = &data[0];
for(unsigned int i = 0; i < data32_size; i++)
{
for(int j = 0; j < 4; j++)
{
data[i*j+j] = (data32[i] << (j*8));
}
}
/*usart_serial_write_packet(this->usart, *data_ptr, (size_t)(data_size * sizeof(uint8_t)));*/
size_t len = sizeof(uint8_t)*data_size;
while (len) {
usart_serial_putchar(this->usart, *data);
len--;
data++;
}
delete[] data;
}
Here I get the following errors:
error: undefined reference to `operator new[](unsigned int)'
error: undefined reference to `operator delete[](void*)'
The compiling and linking command is (shorted):
"C:\Program Files (x86)\Atmel\Atmel Toolchain\AVR8 GCC\Native\3.4.1061\avr8-gnu-toolchain\bin\avr-g++.exe" -o PreAmp.elf <...> usart_controller.o <...> -Wl,-Map="PreAmp.map" -Wl,--start-group -Wl,-lm -Wl,--end-group -Wl,--gc-sections -mmcu=atxmega16a4u
so I am assuming that I am using the g++-compiler and not the gcc-compiler. But in cpp it is impossible to declare a variable-length array as done above. Where is my mistake here?

I did not see any information on controller used, IDE (if any).
But if you are using Atmel studio/AVR tool chain from atmel.
They make it pretty clear that new and delete functionality is not supported and has to be implemented by user.
This makes sense since this is not a desktop application but a implementation on uC.
http://www.atmel.com/webdoc/avrlibcreferencemanual/faq_1faq_cplusplus.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js