C++. 2d vectors growing too big - c++

I got 2 dimensional vectors that keep increase the size ( as it hold all the permutation pattern) but when i forming 11 variables permutation my program will crash as the vectors are growing too big and my ram cant sustain it, how should i solve it ? i tried to output the formation as text but it taking too long n the text file growing too big as few GB and keep growing.
my laptop , i7 4700MQ , 8GB ram , Windows 8.1 Pro x64
below is the code I use to form the 2d Vectors.
while (next_permutation(route.begin() + 1, route.end())) {
//check for every route permutation
//first store route pattern x inside 1st vector,then will store the next route pattern in another row.
for (counter = 0; counter < route.size(); counter++) {
routePattern.push_back(route[counter]);
}
routeFormation.push_back(routePattern);
routePattern.clear();
}

Actually, its better to use dequeue for large portions of data because dequeue allocates data in chunks rather than in one large portion (vector guarantees that all data can be accessed like c array)
Archivation can be used to reduce required memory, archived data can be either stored in memory or to disk. There are a lot of archivation libraries for c/c++, for example
http://nih.at/libzip/

Related

Reading directory contents in FAT32

I am trying to write bare metal FAT32 file system driver on RPI 3B.
I am able to read FAT sectors and root directory sectors using emmc driver.
I have doubt on how to follow FAT entries linked list when next entry pointer (next cluster number) doesn't fit in the current FAT sector?
Should I read the FAT sector each time I get to the new cluster number?
My current understanding is as follows:
Get first cluster number (cluster_number) of directory/file
Read the FAT sector which contains first_cluster_number entry.
Say I read FAT sector as
uint8_t fat_sector[512] = { 0 };
uint32_t this_fat_sector_num, this_fat_entry_offset;
this_fat_sector_num = unusedSectors + reservedSectorCount + ((cluster_number * 4)/ bytesPerSector);
this_fat_entry_offset = (cluster_number * 4)/ bytesPerSector;
read_fat_sector(this_fat_sector_num, & fat_sector[0]);
// Calculate next cluster in chain
uint32_t next_cluster_number = ((uint32_t * fat_sector[this_fat_entry_offset])) & 0x0fffffff;
// Calculate next cluster in chain 1 more time, is below code correct ?
uint32_t next_next_cluster_number = ((uint32_t * fat_sector[next_cluster_number])) & 0x0fffffff;
What happens when the next cluster number is not present in already read fat_sector buffer (512 bytes)?
If cluster number = index of next entry in fat_sector, Do I need to multiply it by 4 given that fat 32 entries span 4 bytes.
If anyone could give some clarity, that will be helpful. Thanks in advance.
Implement a cache (in RAM) of the FAT. Let's say that the cache has enough RAM for 20 sectors and starts out empty.
Next write a "getFATentry" function that checks if the sector is in the cache and finds the right entry in the cache if it is; or (if necessary) evicts something from the cache to make room, fetches the right sector from disk into the cache, then finds the right entry in the cache.
Once that's done you can next_cluster = getFATentry(previous_cluster); without worrying about the cache or any disk IO (but will want to do something when the FAT is modified - e.g. adopt a "write-through" or "write-back" policy).
Note: By adjusting the size of the "FAT cache" you can improve performance or reduce RAM consumption. It'd be nice to allow the cache to grow/shrink dynamically (e.g. grow to be as large as the whole FAT if nothing else needs RAM, but shrink to bare minimum if all the RAM is needed for something else).
I have found solution.
First read the initial fat sector for given cluster number.
Find out thisFatEntryOffset and read the next fat entry.
New fat entry will be new cluster number. Find out the thisFatNumber and thisFatEntryOffset for new cluster number.
If new fat sector != old fat sector then read new fat sector and read entry using thisFatEntryOffset.

Slowdown when reading big file randomly with C++

I've run into some trouble when reading chunks of data at random locations all over a big file (>4GB).
The task is to save a 3D datacube to a file and transpose the axes while not loading the whole dataset into RAM.
The storage format is as follows:
I've got 3 Integer at the beginning of the File, storing the dimensions (nX, nY, nZ ).
After that follows the data as lines with length nX.
These Lines are repeated nY times which results in a page and the pages are repeated nZ times.
Meaning:
A line has nX bytes
A page has nX * nY bytes
The file has nX * nY * nZ + 12 bytes
To transpose the dataset i execute the following loop:
for( int i=0;i<nY;i++ )
{
for( int j=0;j<nZ;j++ )
{
read( pBuf, i*nX+j*nY*nX );//read nX bytes from offset i*nX+j*nX*nY
writeNext(pBuf);
}
}
When using fopen, _fseeki64 and fread it happens that after approx. 30% of the overall reads every 6th read or so takes up to 7 s, since there are multiple millions of those reads i can't accept these delays.
Thus i implemented the same algorithm with memory mapped files (CreateFile, CreateFileMapping and MapViewOfFile), but now every 6th read takes about 2 s.
Is there a method/chance of increasing the readout speed?
EDIT1:
I've added some code at http://pastebin.com/MejiTKj0
EDIT2:
Some may notice an inconsistency regarding the offset in the read function. To simplify matters i didn't tell about all variables saved in the file header thus the offset of 15 bytes is okay
If you have a HDD disk on which the files are stored, you should know that seek times dominate heavily when trying to perform random access. You may find you're better off reading the entire file sequentially into memory (a relatively quick operation compared to seek) and then performing your processing on the memory data instead. You may find this is quicker even if you need only a relatively small percentage of the overall file data.
In your loop Z / nZ should be outer most loop and Y should be inner loop. That would save seek times, if the storage memory layout has stored a nZ pages one by one .
In the current code displayed it shows nZ in inner loop, which is no good. The current arrangement of loops is analogous to book reading, with reading first line for each page of the book, then reading second line and so on;
Thank you all very much for your input.
Actually the first thing i should have checked was at fault, being the HDD, which wasn't able to provide the needed datarate.
I'm now thinking about switching to a SSD - Device.

Generation and storage of all DES keys

I'm writing Data Encryption Standard "cracker" using C++ and CUDA. It was going to be simple brute-force - trying all possible keys to decrypt encrypted data and check if result is equal to initial plain-text message.
The problem is that generation of 2^56 keys takes time (and memory). My first approach was to generate keys recursively and save them to file.
Do you have any suggestions how to improve this?
You don' really need recursion, neither you need storing your keys.
All space of DES keys (if we don't count 12 or so weak keys, which won't change anything for your purposes) is a space of 56-bit-long numbers (which BTW fit into standard uint64_t), and you can just iterate through numbers from 0 to 2^56-1, feeding the next number as a 56-bit number to your CUDA core whenever the core reports that it is done with the previous key.
If not for cores, the code could look such as:
for(uint64_t i=0;i<0xFFFFFFFFFFFFFFULL /* double-check number of F's so the number is 2^56-1 */;++i) {
uint8_t key[7];
//below is endianness-agnostic conversion
key[0] = (uint8_t)i;
key[1] = (uint8_t)(i>>8);
key[2] = (uint8_t)(i>>16);
key[3] = (uint8_t)(i>>24);
key[4] = (uint8_t)(i>>32);
key[5] = (uint8_t)(i>>40);
key[6] = (uint8_t)(i>>48);
bool found = try_your_des_code(key,data_to_decrypt);
if(found) printf("Eureka!\n");
}
To allow restarting your program in case if anything goes wrong, you need to store (in persistent storage, such as file) only this number i (with cores, strictly speaking - the number i should be written to persistent storage only after all the numbers before it has already been processed by CUDA cores, but generally the difference of 2000 or so keys won't make any difference performance-wise).

Concatenate data in an array in C ++

I'm working on software for processing audio in real time in C++ with Qt. I need that requirements are minimized.
Defining a temporary buffer 40ms, launching our device with a sampling frequency Fs = 8000Hz, every 320 samples entered a feature called Data Processing ().
The idea is to have a global buffer that stores the 10s last recorded, 80000 samples.
This Buffer in each iteration eliminates the initial 320 samples and looped at the end, 320 new samples. Thus the buffer is updated and the user can observe the real-time graphical representation of the recorded signal.
At first I thought of using QVector (equivalent to std::vector but for Qt) for this deployment, thus we reduce the process a few lines of code
int NUM_POINTS=320;
DatosTemporales.erase(DatosTemporales.begin(),DatosTemporales.begin()+NUM_POINTS);
DatosTemporales+= (DatosNuevos); // Datos Nuevos con un tamaƱo de NUM_POINTS
In each iteration we create a vector of 80000 samples in addition to free some positions so requires some processing time. An alternative for opting was the use of * double, and iterations a loop:
for(int i=0;i<80000;i++){
if(i<80000-NUM_POINTS){
aux=DatosTemporales[i];
DatosTemporales[i+NUM_POINTS]=aux;
}else{
DatosTemporales[i]=DatosNuevos[i-NUN_POINTS];
}
}
Does fails. I think the best way is to use dynamic memory. Implementing this process by pointers. Could anyone give me some idea how to implement it?
It sounds like what you are looking for is a circular buffer.
https://www.google.com/search?q=qcircularbuffer
https://qt.gitorious.org/qt/qtbase/merge_requests/60
And it looks like you only need the header file and you should be good to go.
A similar tool that is already in the Qt data set is found here:
http://doc.qt.io/qt-5/qcontiguouscache.html#details
The advantage of using a system like these presented, is that they don't need to have dynamic memory, it just needs to move the head and the tail pointers.
Hope that helps.

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.
Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.