OpenCL SHA1 Throughput Optimisation - c++

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.

Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.

Related

Is HAL_UARTEx_RxEventCallback Size parameter calculated programmatically or by hardware

I'm realizing UART-DMA with STM_HAL library and I want to know if message size is counted by hardware (counting clock ticks till line is idle for example) or by some program method(something like strlen). So if Size in
HAL_UARTEx_RxEventCallback(UART_HandleTypeDef *huart, uint16_t Size)
is counted by hardware, I can send data in pure HEX format, but if it is calculated by something like strline, I may recieve problems if data is 0x00 and have to send data in ASCII.
I've tried to make some research in generated code in Keil but failed (maybe I didn't try hard enough) so maybe somebody can help me.
If you are using UART DMA, it is calculated by hardware.
If you check the call hierarchy of HAL_UARTEx_RxEventCallback using your ide, you can see how the Size variable is calculated.
The function is executed in the following flow.(Depending on the version of HAL Driver, it may be slightly different)
UART Idle Interrupt occur
Call HAL_UART_IRQHandler()
If DMA mod is enabled, Call HAL_UARTEx_RxEventCallback(huart, (huart->RxXferSize - huart->RxXferCount))
Therefore, Size variable is calculated as (huart->RxXferSize - huart->RxXferCount)
huart->RxXferSize is a set value when initializing RX DMA.
huart->RxXferCount is (huart->hdmarx)->Instance->NDTR
NDTR is a value calculated by hardware as the size of the buffer remaining after DMA transfer data to memory!!

Monitor buffers in GNU Radio

I have a question regarding buffering in between blocks in GNU Radio. I know that each block in GNU (including custom blocks) have buffers to store items that are going to be sent or received items. In my project, there is a certain sequence I have to maintain to synchronize events between blocks. I am using GNU radio on the Xilinx ZC706 FPGA platform with the FMCOMMS5.
In the GNU radio companion I created a custom block that controls a GPIO Output port on the board. In addition, I have an independent source block that is feeding information into the FMCOMMS GNU block. The sequence I am trying to maintain is that, in GNU radio, I first send data to the FMCOMMS block, second I want to make sure that the data got consumed by the FMCOMMS block (essentially by checking buffer), then finally I want to control the GPIO output.
From my observations, the source block buffer doesn’t seem to send the items until it’s full. This will cause a major issue in my project because this means that the GPIO data will be sent before or in parallel with sending the items to the other GNU blocks. That’s because I’m setting the GPIO value through direct access to its address in the ‘work’ function of my custom block.
I tried to use pc_output_buffers_full() in the ‘work’ function of my custom source in order to monitor the buffer, but I’m always getting 0.00. I’m not sure if it’s supposed to be used in custom blocks or if the ‘buffer’ in this case is something different from where the output items are stored. Here's a small code snippet which shows the problem:
char level_count = 0, level_val = 1;
vector<float> buff (1, 0.0000);
for(int i=0; i< noutput_items; i++)
{
if(level_count < 20 && i< noutput_items)
{
out[i] = gr_complex((float)level_val,0);
level_count++;
}
else if(i<noutput_items)
{
level_count = 0;
level_val ^=1;
out[i] = gr_complex((float)level_val,0);
}
buff = pc_output_buffers_full();
for (int n = 0; n < buff.size(); n++)
cout << fixed << setw(5) << setprecision(2) << setfill('0') << buff[n] << " ";
cout << "\n";
}
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent? Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
GNU Radio Companion version: 3.7.8
OS: Linaro 14.04 image running on the FPGA
Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
Nope, that's not how GNU Radio works (at all!):
A while back I wrote an article that explains how GNU Radio deals with buffers, and what these actually are. While the in-memory architecture of GNU Radio buffers might be of lesser interest to you, let me quickly summarize the dynamics of it:
The buffers that (general_)work functions are called with behave for all that's practical like linearly addressable ring buffers. You get a random number of samples at once (restrictable to minimum numbers, multiples of numbers), and all that you not consume will be handed to you the next time work is called.
These buffers hence keep track of how much you've consumed, and thus, how much free space is in a buffer.
The input buffer a block sees is actually the output buffer of the "upstream" block in the flow graph.
GNU Radio's computation is backpressure-controlled: Any block's work method will immediately be called in an endless loop given that:
There's enough input for the block to do work,
There's enough output buffer space to write to.
Therefore, as soon as one block finishes its work call, the upstream block is informed that there's new free output space, thus typically leading to it running
That leads to high parallelity, since even adjacent blocks can run simultaneously without conflicting
This architecture favors large chunks of input items, especially for blocks that take a relative long time to computer: while the block is still working, its input buffer is already being filled with chunks of samples; when it's finished, chances are it's immediately called again with all the available input buffer being already filled with new samples.
This architecture is asynchronous: even if two blocks are "parallel" in your flow graph, there's no defined temporal relation between the numbers of items they produce.
I'm not even convinced switching GPIOs at times based on the speed computation in this completely non-deterministic timing data flow graph model is a good idea to start with. Maybe you'd rather want to calculate "timestamps" at which GPIOs should be switched, and send (timestamp, gpio state) command tuples to some entity in your FPGA that keeps absolute time? On the scale of radio propagation and high-rate signal processing, CPU timing is really inaccurate, and you should use the fact that you have an FPGA to actually implement deterministic timing, and use the software running on the CPU (i.e. GNU Radio) to determine when that should happen.
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent?
Other than that, a method to asynchronously tell another another block that, yes, you've processed N samples, would be either to have a single block that just observes the outputs of both blocks that you want to synchronize and consumes an identical number of samples from both inputs, or to implement something using message passing. Again, my suspicion is that this is not a solution to your actual problem.

fine tuning vgg raise memory error

Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Concatenate data in an array in C ++

I'm working on software for processing audio in real time in C++ with Qt. I need that requirements are minimized.
Defining a temporary buffer 40ms, launching our device with a sampling frequency Fs = 8000Hz, every 320 samples entered a feature called Data Processing ().
The idea is to have a global buffer that stores the 10s last recorded, 80000 samples.
This Buffer in each iteration eliminates the initial 320 samples and looped at the end, 320 new samples. Thus the buffer is updated and the user can observe the real-time graphical representation of the recorded signal.
At first I thought of using QVector (equivalent to std::vector but for Qt) for this deployment, thus we reduce the process a few lines of code
int NUM_POINTS=320;
DatosTemporales.erase(DatosTemporales.begin(),DatosTemporales.begin()+NUM_POINTS);
DatosTemporales+= (DatosNuevos); // Datos Nuevos con un tamaño de NUM_POINTS
In each iteration we create a vector of 80000 samples in addition to free some positions so requires some processing time. An alternative for opting was the use of * double, and iterations a loop:
for(int i=0;i<80000;i++){
if(i<80000-NUM_POINTS){
aux=DatosTemporales[i];
DatosTemporales[i+NUM_POINTS]=aux;
}else{
DatosTemporales[i]=DatosNuevos[i-NUN_POINTS];
}
}
Does fails. I think the best way is to use dynamic memory. Implementing this process by pointers. Could anyone give me some idea how to implement it?
It sounds like what you are looking for is a circular buffer.
https://www.google.com/search?q=qcircularbuffer
https://qt.gitorious.org/qt/qtbase/merge_requests/60
And it looks like you only need the header file and you should be good to go.
A similar tool that is already in the Qt data set is found here:
http://doc.qt.io/qt-5/qcontiguouscache.html#details
The advantage of using a system like these presented, is that they don't need to have dynamic memory, it just needs to move the head and the tail pointers.
Hope that helps.