Search sequence byte location in binary file - c++

I want to write a plugin like search function in binary Viewer, searching specific sequence in an binary file by text, hex or bit. https://www.proxoft.com/BinaryViewer.aspx
std::vector<int> search_bit(std::string& file_path, std::string& bit)
std::vector<int> search_hex(std::string& file_path, std::string& hex)
std::vector<int> search_text(std::string& file_path, std::string& text)
For example, I open a 6 bytes binary file "path" : 30 30 31 31 30 31(hex view)
search_bit(path, "001100000001"),
search_hex(path, "3031"), search_text(path, "01") all return {1, 4}.
Because "30 31" starts at 1 and 4 byte in this file. (hex"30" stands for ASCII 0, "31" stands for 1)
Is there any reference that could help? sorry this looks like asking for code, I know litte about memory mapping technique.
[1]: https://i.stack.imgur.com/9YmKc.png

For all three problems you could use https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm it is relatively simple (https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/) and will yield relatively fast execution times.
In context of "subbyte" comparisons:
For searching a bit pattern you could use std::vector (space efficent), or std::bitset<[...]> (algorithm used would be the same just the "character" would be 1 bit), but be aware that you can only load one byte at a time from a file.
For searching a hex pattern you could use custom bitfield struct (https://en.cppreference.com/w/cpp/language/bit_field):
struct hex_pair
{
uint8_t h0 : 4; // only use 4 bits
uint8_t h1 : 4; // only use 4 bits
}; // this will be space efficent and fast in term of loading (this struct is trivial)
// loading
file_size = [...]
hex_pair* data_store = [...];
ifile.read(reinterpret_cast<char*>(data_store), file_size);
If you want to complicate the problem a bit you could use https://en.cppreference.com/w/cpp/iterator/istream_iterator to load the data from file (this will optimize memory usage a bit) + some window buffer for KMP algorithm to have enough data to work on.

Related

Add a bit value to a string

I am trying to send a packet over a network and so want it to be as small as possible (in terms of size).
Each of the input can contain a common prefix substring, like ABCD. In such cases, I just wanna send a single bit say, 1 denoting that the current string has the same prefix ABCD and append it to the remaining string. So, if the string was ABCDEF, I will send 1EF; if it was LKMPS, I wish to send the string LKMPS as is.
Could someone please point out how I could add a bit to a string?
Edit: I get that adding a 1 to a string does not mean that this 1 is a bit - it is just a character that I added to the string. And that exactly is my question - for each string, how do I send a bit denoting that the prefix matches? And then send the remaining part of the string that is different?
In common networking hardware, you won't be able to send individual bits. And most architectures cannot address individual bits, either.
However, you can still minimize the size as you want by using one of the bits that you may not be using. For instance, if your strings contain only 7-bit ASCII characters, you could use the highest bit to encode the information you want in the first byte of the string.
For example, if the first byte is:
0b01000001 == 0x41 == 'A'
Then set the highest bit using |:
(0b01000001 | 0x80) == 0b11000001 == 0xC1
To test for the bit, use &:
(0b01000001 & 0x80) == 0
(0b11000001 & 0x80) != 0
To remove the bit (in the case where it was set) to get back the original first byte:
(0b11000001 & 0x7F) == 0b01000001 == 0x41 == 'A'
If you're working with a buffer for use in your communications protocol, it should generally not be an std::string. Standard strings are not intended for use as buffers; and they can't generally be prepended in-place with anything.
It's possible that you may be better served by an std::vector<std::byte>; or by a (compile-time-fixed-size) std::array. Or, again, a class of your own making. That is especially true if you want your "in-place" prepending of bits or characters to not merely keep the same span of memory for your buffer, but to actually not move any of the existing data. For that, you'd need twice the maximum length of the buffer, and start it at the middle, so you can either append or prepend data without shifting anything - while maintaining bit-resolution "pointers" to the effective start and end of the full part of the buffer. This is would be readily achievable with, yes you guessed it, your own custom buffer class.
I think the smallest amount of memory you can work with is 8 bits.
If you wanted to operate with bits, you could specify 8 prefixes as follows:
#include <iostream>
using namespace std;
enum message_header {
prefix_on = 1 << 0,
bitdata_1 = 1 << 1,
bitdata_2 = 1 << 2,
bitdata_3 = 1 << 3,
bitdata_4 = 1 << 4,
bitdata_5 = 1 << 5,
bitdata_6 = 1 << 6,
bitdata_7 = 1 << 7,
};
int main() {
uint8_t a(0);
a ^= prefix_1;
if(a & prefix_on) {
std::cout << "prefix_on" << std::endl;
}
}
That being said, networks pretty fast nowadays so I wouldn't do it.

How to understand MNIST Binary converter in c++?

I've recently needed to convert mnist data-set to images and labels, it is binary and the structure is in the previous link, so i did a little research and as I'm fan of c++ ,I've read the I/O binary in c++,after that I've found this link in stack. That link works well but no code commenting and no explanation of algorithm so I've get confused and that raise some question in my mind which i need a professional c++ programmer to ask.
1-What is the algorithm to convert the data-set in c++ with help of ifstream?
I've realized to read a file as a binary with file.read and move to the next record, but in C , we define a struct and move it inside the file but i can't see any struct in c++ program for example to read this:
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
2-What the function reverseInt is doing? (It is not obviously doing simple reversing an integer)
int ReverseInt (int i)
{
unsigned char ch1, ch2, ch3, ch4;
ch1 = i & 255;
ch2 = (i >> 8) & 255;
ch3 = (i >> 16) & 255;
ch4 = (i >> 24) & 255;
return((int) ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4;
}
I've did a little debugging with cout and when it revised for example 270991360 it return 10000 , which i cannot find any relation, I understand it AND the number multiples with two with 255 but why?
PS :
1-I already have the MNIST converted images but i want to understand the algorithm.
2-I've already unzip the gz files so the file is pure binary.
1-What is the algorithm to convert the data-set in c++ with help of ifstream?
This function read a file (t10k-images-idx3-ubyte.gz) as follow:
Read a magic number and adjust endianness
Read number of images and adjust endianness
Read number rows and adjust endianness
Read number of columns and adjust endianness
Read all the given images x rows x columns characters (but loose them).
The function use normal int and always switch endianness, that means it target a very specific architecture and is not portable.
How can we go to the specific offset for example 0004 and read for example 32 bit integer and put it to an integer variable.
ifstream provides a function to seek to a given position:
file.seekg( posInBytes, std::ios_base::beg);
At the given position, you could read the 32-bit integer:
int32_t val;
file.read ((char*)&val,sizeof(int32_t));
2- What the function reverseInt is doing?
This function reverse order of the bytes of an int value:
Considering an integer of 32bit like aaaaaaaabbbbbbbbccccccccdddddddd, it return the integer ddddddddccccccccbbbbbbbbaaaaaaaa.
This is useful for normalizing endianness, however, it is probably not very portable, as int might not be 32bit (but e.g. 16bit or 64bit)

C++ save and load huge vector<bool>

I have a huge vector<vector<bool>> (512x 44,000,000 bits). It takes me 4-5 hours to do the calculation for creating it and obviously I want to save results to spare me of repeating the process ever again. When I run the program again, all I want to do is load the same vector (no other app will use this file).
I believe text files are out of the question for such a great size. Is there a simple (quick and dirty) way to do this? I do not use Boost and this is only a minor part of my scientific app, so it must be something quick. I also thought of inversing it online and store it in a Postgres DB (44000000 records with a 512 bit data), so the DB can handle it easily. I have seen answers such take 8bits > 1byte and then save, but with my limited newbie C++ experience, they sound too complicated. Any ideas?
You can save 8 bits into a single byte:
unsigned char saver(bool bits[])
{
unsigned char output=0;
for(int i=0;i<8;i++)
{
output=output|(bits[i]<<i); //probably faster than if(){output|=(1<<i);}
//example: for the starting array 00000000
//first iteration sets: 00000001 only if bits[0] is true
//second sets: 0000001x only if bits[1] is true
//third sets: 000001xx only third is true
//fifth: 00000xxx if fifth is false
// x is the value before
}
return output;
}
You can load 8 bits from a single byte:
void loader(unsigned char var, bool * bits)
{
for(int i=0;i<8;i++)
{
bits[i] = var & (1 << i);
// for example you loaded var as "200" which is 11001000 in binary
// 11001000 --> zeroth iteration gets false
// first gets false
// second false
// third gets true
//...
}
}
1<<0 is 1 -----> 00000001
1<<1 is 2 -----> 00000010
1<<2 is 4 -----> 00000100
1<<3 is 8 -----> 00001000
1<<4 is 16 ----> 00010000
1<<5 is 32 ----> 00100000
1<<6 is 64 ----> 01000000
1<<7 is 128 ---> 10000000
Edit: Using gpgpu, an embarrassingly parallel algorithm taking 4-5 hours on cpu can be shortened to 0.04 - 0.05 hours on gpu(or even less than a minute with multiple gpus) For example, the upper "saver/loader" functions are embarrassingly parallel.
I have seen answers such take 8bits > 1byte and then save, but with my limited newbie C++ experience, they sound too complicated. Any ideas?
If you are going to read the file often, this would be a good time to learn bitwise operations. Using one bit per bool would be 1/8th the size. That's going to save a lot of memory and I/O.
So save it as one bit per bool, then either break it into chunks and/or read it using mapped memory (e.g. mmap). You can put this behind a usable interface, so you need to implement it just once and abstract the serialized format when you need to read the values.
Process as said before, here vec is the vector of vector of bool and we pack all bit in sub vector 8 x 8 in bytes and push those a bytes in a vector.
std::vector<unsigned char> buf;
int cmp = 0;
unsigned char output=0;
FILE* of = fopen("out.bin")
for_each ( auto& subvec in vec)
{
for_each ( auto b in subvec)
{
output=output | ((b ? 1 : 0) << cmp);
cmp++;
if(cmp==8)
{
buf.push_back(output);
cmp = 0;
output = 0;
}
}
fwrite(&buf[0], 1, buf.size(), of);
buf.clear();
}
fclose(of);

Writing unsigned int to binary file

My first time working with binary files and I'm having clumps of hair in my hands. Anyway, I have the following defined:
unsigned int cols, rows;
Those variables can be anywhere from 1 to about 500. When I get to writing them to a binary file, I'm doing this:
myFile.write(reinterpret_cast<const char *>(&cols), sizeof(cols));
myFile.write(reinterpret_cast<const char *>(&rows), sizeof(rows));
When I go back to read the file, on cols = 300, I get this as result:
44
1
0
0
Can someone please explain to me why I'm getting that result? I can't say that there's something wrong, as I honestly think it's me who don't understand things. What I'd LIKE to do is store the value, as is, in the file so that when I read it back, I get that as well. And maybe I do, I just don't know it.
I'd like some explanation of how this is working and how do I get the data I put in read back.
You are simply looking at the four bytes of a 32 bit integer, interpreted on a little-endian platform.
300 base 10 = 0x12C
So little-endianness gives you 0x2C 0x01, and of course 0x2C=44.
Each byte in the file has 8 bits, so can represent values from 0 to 255. It's written in little-endian order, with the low byte first. So, starting at the other end, treat the numbers as digits in base 256. The value is 0 * 256^3 + 0 * 256^2 + 1 * 256^1 + 44 * 256^0 (where ^ means exponentiation, not xor).
You have not (yet) shown how you unmarshal the data nor how you printed this text that you've cited. 44 01 00 00 looks like the bytewise decimal representation of each of the little-endian bytes of the the data you've written (decimal "300").
If you read the data back like so, it should give you the effect you want (presuming that you're okay with the limitation that the computer which writes this file is the same endianness as the one which reads it back):
unsigned int colsReadFromFile = 0;
myOtherFile.read(reinterpret_cast<char *>(&colsReadFromFile), sizeof(colsReadFromFile));
if (!myOtherFile)
{
std::cerr << "Oh noes!" << std::endl;
}
300 in binary is 100101100 which is 9 bits long.
But when you say char*, compiler looks for only first 1 byte(8 bits)
so it is 00101100(bits) of (1 00101100) = 44
^^^^^^^^

Best way to merge hex strings in c++? [heavily edited]

I have two hex strings, accompanied by masks, that I would like to merge into a single string value/mask pair. The strings may have bytes that overlap but after applying masks, no overlapping bits should contradict what the value of that bit must be, i.e. value1 = 0x0A mask1 = 0xFE and value2 = 0x0B, mask2 = 0x0F basically says that the resulting merge must have the upper nibble be all '0's and the lower nibble must be 01011
I've done this already using straight c, converting strings to byte arrays and memcpy'ing into buffers as a prototype. It's tested and seems to work. However, it's ugly and hard to read and doesn't throw exceptions for specific bit requirements that contradict. I've considered using bitsets, but is there another way that might not demand the conversion overhead? Performance would be nice, but not crucial.
EDIT: More detail, although writing this makes me realize I've made a simple problem too difficult. But, here it is, anyway.
I am given a large number of inputs that are binary searches of a mixed-content document. The document is broken into pages, and pages are provided by an api the delivers a single page at a time. Each page needs to be searched with the provided search terms.
I have all the search terms prior to requesting pages. The input are strings representing hex digits (this is what I mean by hex strings) as well a mask to indicate bits that are significant in the input hex string. Since I'm given all input up-front I wanted to improve the search of each page returned. I wanted to pre-process merge these hex strings together. To make the problem more interesting, every string has an optional offset into the page where they must appear and a lack of an offset indicates that the string can appear anywhere in a page requested. So, something like this:
class Input {
public:
int input_id;
std::string value;
std::string mask;
bool offset_present;
unsigned int offset;
};
If a given Input object has offset_present = false, then any value assigned to offset is ignored. If offset_present is false, then it clearly can't be merged with other inputs.
To make the problem more interesting, I want to report an output that provides information about what was found (input_id that was found, where the offset was, etc). Merging some input (but not others) makes this a bit more difficult.
I had considered defining a CompositeInput class and was thinking about the underlying merger be a bitset, but further reading about about bitsets made me realize it wasn't what I really thought. My inexperience made me give up on the composite idea and go brute force. I necessarily skipped some details about other input types an additional information to be collected for the output (say, page number, parag. number) when an input is found. Here's an example output class:
class Output {
public:
Output();
int id_result;
unsigned int offset_result;
};
I would want to product N of these if I merge N hex strings, keeping any merger details hidden from the user.
I don't know what a hexstring is... but other than that it should be like this:
outcome = (value1 & mask1) | (value2 & mask2);
it sounds like |, & and ~ would work?
const size_t prefix = 2; // "0x"
const size_t bytes = 2;
const char* value1 = "0x0A";
const char* mask1 = "0xFE";
const char* value2 = "0x0B";
const char* mask2 = "0x0F";
char output[prefix + bytes + 1] = "0x";
uint8_t char2int[] = { /*zeroes until index '0'*/ 0,1,2,3,4,5,6,7,8,9 /*...*/ 10,11,12,13,14,15 };
char int2char[] = { '0', /*...*/ 'F' };
for (size_t ii = prefix; ii != prefix + bytes; ++ii)
{
uint8_t result1 = char2int[value1[ii]] & char2int[mask1[ii]];
uint8_t result2 = char2int[value2[ii]] & char2int[mask2[ii]];
if (result1 & result2)
throw invalid_argument("conflicting bits");
output[ii] = int2char[result1 | result2];
}