Parsing long strings of binary data with C++

Parsing long strings of binary data with C++ - c++

I am looking for idea how to parse long binary data, so for example :"10100011111000111001"
bits: 0-4 are the id
bits 5-15 are the data
etc etc...
the binary data structure can be change so I need to build a kind of data-base will store the data how to parse each string.
illustration (it could be 200~ bits) :
Ideas how to implement it?
Thanks
Edit
What am I missing here?
struct Bitfield {
uint16_t a : 10 , b:6;};
void diag(){
uint16_t t= 61455;
struct Bitfield test = {t};
cout<<"a: "<<test.a<<endl;
cout<<"b: "<<test.b<<endl;
return;}
and the output is:
a: 15
b: 0

Options available
To manage a large structured set of bits, you have the following options:
C++ bit-fields: you define a structure with bitfield members. You can have as many members as you want, provided that each single one has no more bits than an unsigned long long.
It's super easy to use; The compiler manages the access to bits or groups of bits for you. The major inconvenience is that the bit layout is implementation dependent. So this is not an option for writing portable code that exchanges data in a binary format.
Container of unsigned integral type: you define an array large enough to hold the all the bits, and access bits or groups of bits using a combination of logical operations.
It requires to be at ease with binary operations and is not practical if groups of bits are split over consecutive elements. For exchanging data in binary format with the outside world in a protable way, you'd need to either take care of differences between big and little endian architectures or use arrays of uint8_t.
std::vector<bool>: gives you total flexibility to manage you bits. The main constraint is that you need to address each bit separately. Moreover, there's no data() member that could give direct access to the binary data .
std::bitset: is very similar to vector<bool> for accessing bits. It has a fixed size at compile time, but offers useful features such as reading and writing binary in ascci from strings or streams]5, converting from binary values of integral types, and logical operations on the full bitset.
A combination of these techniques
Make your choice
To communicate with the outside world in a portable way, the easiest approach is to use bitsets. Bitsets offer easy input/output/string conversion in a format using ascci '0' or '1' (or any substitutes thereof)
bitset<msg_header_size> bh,bh2;
bitset<msg_body_size> bb,bb2;
cin>>bh>>bb; // reads a string od ascii 0 and 1
cout<<bh<<"-"<<bb<<endl<<endl; // writes a string of ascii 0 and 1
You can also convert from/to binary data (but a single element, large enough for the bitset size):
bitset<8> b(static_cast<uint8_t>(c));
cout<<b<<endl;
cout<<b.to_ulong()<<endl;
For reading/writing large sets, you'd need to read small bitsets and use logical operators to aggregate them in a larger bitset. It this seems time consuming, it's in fact very close to what you'd do in containers of integrals, but without having to care about byte boundaries.
In your case, with a fixed size header and a maximum size, the bitset seems to be a good choice (be careful however because the variable part is right justified) for exchanging binary data with the external world.
For working the data content, it's easy to access a specific bit, but you have to use some logical operations (shift, and) to access to groups of bits. Moreover, if you want readable and maintainable code, it's better to abstract the bit layout.
Conclusion:
I would therefore strongly advise to use internally a bit-field structure for working with the data and keep a comparable memory footprint than the original data and at the same time, use bitsets just to convert from/to this structure for the purpose of external data exchanges.

The "best way" depends on the details of the problem.
If the whole number fits into the largest integer type available (usually long long), convert the string into an integer first (for example with stoi/stol/stoll functions, assuming C++11 is available). Then use bit-shifting combined with binary and (&) to extract the sections of the value you are interested in.
If the whole number does not fit into the largest integer type available, chop it up as a string (using the substr function) and then convert the substrings into integers one by one.

Related

Efficient 8-bits and 16-bits computations in OCaml

I am currently working on a project in OCaml where I have to manipulate unsigned integers on 8 bits and on 16 bits. In my context, things can get a little messy, I sometimes want to convert an 8 bit integer into a 16 bits one, or split a 16 bits integer into two 8 bits one. I also want to use all the operations like addition, or the bitwise operations on those. Since there are all these interaction between 8 and 16 bits, I really like the comfort of having separate types for those. However, I still want my program to compute reasonnably efficiently, and I don't want to actually lose too much time casting an integer of a given size into another size. So my question is essentially how should I go about this? I have two main options but I don't know enough about the low-level interpretation of OCaml to comfortably chose:
Option 1 : Use dedicated types
I figured that I can use the Stdint library that is available through opam and has an implementation of the types uint8 and uint16 which are exactly what I am looking for.
Pros
I get very good mileage from the typing and will definitely avoid silly bugs from this
Cons
I have to constantly use the functions Uint8.to_uint16 and Uint16.to_uint8, which might eventually add up to heavy memory usage and poor efficiency of the compiled program, depending on how the precise representation is stored in machine
Option 2 : Encode everything within the type int
This means that all my integers will simply be of type int and I will have to program the addition of two 8-bits integers and of two 16-bits integers in this type, for instance.
Pros:
I think these operations can be programmed in a very efficient way using usual operations and the bitwise operations on the type int.
Cons:
I get essentially nothing from the typing and I have to trust myself to chose the right function at the right time.
Possible workaround
I could use two modules for defining 8-bits and 16-bits integers encoded in an int declared as private. I think that would essentially work like I presented with Option 2. The fact that I chose the type to be private would however mean that I cannot switch from one to the other without running into a typing mistake, thus forcing explicit casts and getting leverage from the type system. Still I expect the casts to be very efficient, since the memory representation of the object won't change.
So I would like to know how you would go about that? Is it worth going through all the trouble, do you think a solution is clearly better, or are they reasonably equivalent?
Bonus
Everytime I want to print (in hexadecimal) the value of a variable a of type uint8, I am writing
Printf.ksprintf "a = %02x" (Uint8.to_int a)
There is again a cast that seems to me a bit silly, I could also use direclty the Uint8.to_string_hex function, but it writes explicitly the 0x in front of the number, which I don't want. Ideally I would like to just write
Printf.ksprintf "a = %02x" (Uint8.to_int a)
Is there a way to change the scopes and do some magic with the Printf to make it happen?

In the stdint library both int8 and int16 are represented as int so there is no real tradeoff between option 1 and option 2,
type int8 = private int
(** Signed 8-bit integer *)
type int16 = private int
(** Signed 16-bit integer *)
The stdint library already provides you the best of two worlds, you have an efficient implementation and type safety. Yes, you need to do these translations but they no-ops and there only for the typechecker.
Also, if you're looking for modular arithmetic (and, in general, modeling machine words and bitvectors) then you can look at our Bitvec library, which we developed as part of the Binary Analysis Platform. It is focused on performance while still providing type safety and a lot of operations. We modeled it based on the latest SMT-LIB specification to give clear semantics to all operations. It uses the excellent Zarith library underneath the hood that enables efficient representation for small and arbitrary-length integers.
Since the modularity is not a property of a bitvector itself, but a property of an operation we do not encode the number of bits in the type and use the same type (and representation) for all bitvectors from 1-bits to thousands-of-bits. However, it is impossible to use mix-match the types incorrectly. E.g., you can use generic functions,
(x + y) mod m8
Or predefined modules for the specified modulus, e.g.,
M8.(x + y)
The library has a minimal number of dependencies, so try it by installing
opam install bitvec
There are also additional libraries like bitvec-order, bitvec-sexp, and bitvec-order that enable further integration with the Core suite of libraries, if you need them.

Bit shifting on non contiguous internal number representation

I am writing a c++ arbitrary integer library as an homework. I represented numbers internally as a vector of unsigned int, in base 10^n, where n is as big as possible while fitting into a single unsigned int digit.
I made this choice as a trade-off between space, performance and digit access (it allows me to have much better performances than using base 10, without adding any complexity when converting to a human readable string).
So for example:
base10(441243123294967295) 18 digits
base1000000000(441243123,294967295) 2 digits (comma separated)
Internal representation with uint32
[00011010 01001100 11010101 11110011] [00010001 10010100 11010111 11111111]
To complete the homework I have to implement bit shifting and other bitwise operators.
Does it make sense to implement shift for a number with such an internal representation?
Should I change to a base 2^n so that all bits of the internal representation are significative?

You can, but you do not have to: bit shifting will double the number, no matter what base you use for interpreting it later, because internally these ints are still interpreted as binary by the underlying shift operations. Your implementation will have to decide on the tradeoff there, because your shifting will become harder to implement. On the other hand, the printing in base-10 will remain simpler.
Another solution favoring the decimal system that you may consider is using binary-coded decimals (BCD). Back in the day, hardware used to accelerate these operations (e.g. 6502, the CPU of Apple-2) included special instructions for adding bytes in the BCD interpretation. You would have to implement special correction if you use this representation, but it may be a worthy learning exercise.

Should I change to a base 2^n so that all bits of the internal representation are significative?
Most definitely yes!
Not only that, but modern-day computers in general are all about base2. If this is an excercise, you'll most likely want to learn how to do it well.

All libraries for this kind uses base 2. They do that for a reason: faster processing, possibility for bitwise operations, more compact storage, and more. These advantages outweights the difficulty to covert to decimal. Therefore it's highly recommended that you convert to binary.

What data structure should I use for BigInt class

I would like to implement a BigInt class which will be able to handle really big numbers. I want only to add and multiply numbers, however the class should also handle negative numbers.
I wanted to represent the number as a string, but there is a big overhead with converting string to int and back for adding. I want to implement addition as on the high school, add corresponding order and if the result is bigger than 10, add the carry to next order.
Then I thought that it would be better to handle it as a array of unsigned long long int and keep the sign separated by bool. With this I'm afraid of size of the int, as C++ standard as far as I know guarantees only that int < float < double. Correct me if I'm wrong. So when I reach some number I should move in array forward and start adding number to the next array position.
Is there any data structure that is appropriate or better for this?

So, you want a dynamic array of integers of a well known size?
Sounds like vector<uint32_t> should work for you.

As you already found out, you will need to use specific types in your platform (or the language if you have C++11) that have a fixed size. A common implementation of big number would use 32bit integers and ensure that only the lower 16 bits are set. This enables you to operate on the digits (where digit would be [0..2^16) ) and then normalize the result by applying the carry-overs.

On a modern, 64-bit x86 platform, the best approach is probably to store your bigint as a dynamically-allocated array of unsigned 32-bit integers, so your arithmetic can fit in 64 bits. You can handle your sign separately, as a member variable of the class, or you can use 2's-complement arithmetic (which is how signed int's are typically represented).
The standard C <stdint.h> include file defines uint32_t and uint64_t, so you can avoid platform-dependent integer types. Or, (if your platform doesn't provide these), you can improvise and define this kind of thing yourself -- preferably in a separate "platform_dependent.h" file...

C++: a scalable number class - bitset<1>* or unsigned char*

I'm planning on creating a number class. The purpose is to hold any amount of numbers without worrying about getting too much (like with int, or long). But at the same time not USING too much. For example:
If I have data that only really needs 1-10, I don't need a int (4 bytes), a short(2 bytes) or even a char(1 byte). So why allocate so much?
If i want to hold data that requires an extremely large amount (only integers in this scenario) like past the billions, I cannot.
My goal is to create a number class that can handle this problem like strings do, sizing to fit the number. But before I begin, I was wondering..
bitset<1>, bitset is a template class that allows me to minipulate bits in C++, quite useful, but is it efficient?, bitset<1> would define 1 bit, but do I want to make an array of them? C++ can allocate a byte minimum, does bitset<1> allocate a byte and provide 1 bit OF that byte? if thats the case I'd rather create my number class with unsigned char*'s.
unsigned char, or BYTE holds 8 bytes, anything from 0 - 256 would only need one, more would require two, then 3, it would simply keep expanding when needed in byte intervals rather than bit intervals.
Which do you think is MORE efficient?, the bits would be if bitset actually allocated 1 bit, but I have a feeling that it isn't even possible. In fact, it may actually be more efficient to allocate in bytes until 4 bytes, (32 bits), on a 32 bit processor 32 bit allocation is most efficient thus I would use 4 bytes at a time from then on out.
Basically my question is, what are your thoughts? how should I go about this implementation, bitset<1>, or unsigned char (or BYTE)??

Optimizing for bits is silly unless you're target architecture is a DigiComp-1. Reading individual bits is always slower than reading ints - 4 bits isn't more efficient than 8.
Use unsigned char if you want to store it as a decimal number. This will be the most efficient.
Or, you could just use GMP.

The bitset template requires a compile-time const integer for its template argument. This could be a drawback when you have to determine the max bits size at run-time. Another thing is that most of the compilers / libraries use unsigned int or unsigned long long to store the bits for faster memory access. If your application would run in a environment with limited memory, you should create a new class like bitset or use a different library.

While it won't directly help you with arithmetic on giant numbers, if this kind of space-saving is your goal then you might find my Nstate library useful (boost license):
http://hostilefork.com/nstate/
For instance: if you have a value that can be between 0 and 2...then so long as you are going to be storing a bunch of these in an array you can exploit the "wasted" space for the unused 4th state (3) to pack more values. In that particular case, you can get 20 tristates in a 32-bit word instead of the 16 that you would get with 2-bits per tristate.

Usage of 'short' in C++

Why is it that for any numeric input we prefer an int rather than short, even if the input is of very few integers.
The size of short is 2 bytes on my x86 and 4 bytes for int, shouldn't it be better and faster to allocate than an int?
Or I am wrong in saying that short is not used?

CPUs are usually fastest when dealing with their "native" integer size. So even though a short may be smaller than an int, the int is probably closer to the native size of a register in your CPU, and therefore is likely to be the most efficient of the two.
In a typical 32-bit CPU architecture, to load a 32-bit value requires one bus cycle to load all the bits. Loading a 16-bit value requires one bus cycle to load the bits, plus throwing half of them away (this operation may still happen within one bus cycle).

A 16-bit short makes sense if you're keeping so many in memory (in a large array, for example) that the 50% reduction in size adds up to an appreciable reduction in memory overhead. They are not faster than 32-bit integers on modern processors, as Greg correctly pointed out.

In embedded systems, the short and unsigned short data types are used for accessing items that require less bits than the native integer.
For example, if my USB controller has 16 bit registers, and my processor has a native 32 bit integer, I would use an unsigned short to access the registers (provided that the unsigned short data type is 16-bits).
Most of the advice from experienced users (see news:comp.lang.c++.moderated) is to use the native integer size unless a smaller data type must be used. The problem with using short to save memory is that the values may exceed the limits of short. Also, this may be a performance hit on some 32-bit processors, as they have to fetch 32 bits near the 16-bit variable and eliminate the unwanted 16 bits.
My advice is to work on the quality of your programs first, and only worry about optimization if it is warranted and you have extra time in your schedule.

Using type short does not guarantee that the actual values will be smaller than those of type int. It allows for them to be smaller, and ensures that they are no bigger. Note too that short must be larger than or equal in size to type char.
The original question above contains actual sizes for the processor in question, but when porting code to a new environment, one can only rely on weak relative assumptions without verifying the implementation-defined sizes.
The C header <stdint.h> -- or, from C++, <cstdint> -- defines types of specified size, such as uint8_t for an unsigned integral type exactly eight bits wide. Use these types when attempting to conform to an externally-specified format such as a network protocol or binary file format.

The short type is very useful if you have a big array full of them and int is just way too big.
Given that the array is big enough, the memory saving will be important (instead of just using an array of ints).
Unicode arrays are also encoded in shorts (although other encode schemes exist).
On embedded devices, space still matters and short might be very beneficial.
Last but not least, some transmission protocols insists in using shorts, so you still need them there.

Maybe we should consider it in different situations. For example, x86 or x64 should consider more suitable type, not just choose int. In some cases, int have faster speed than short. The first floor have answered this question

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js