Related
I'm trying to implement a Huffman tree.
Content of my simple .txt file that I want to do a simple test:
aaaaabbbbccd
Frequencies of characters: a:5, b:4, c:2, d:1
Code Table: (Data type of 1s and 0s: string)
a:0
d:100
c:101
b:11
Result that I want to write as binary: (22 bits)
0000011111111101101100
How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)
Answer: You can't.
The minimum amount you can write to a file (or read from it), is a char or unsigned char. For all practical purposes, a char has exactly eight bits.
You are going to need to have a one char buffer, and a count of the number of bits it holds. When that number reaches 8, you need to write it out, and reset the count to 0. You will also need a way to flush the buffer at the end. (Not that you cannot write 22 bits to a file - you can only write 16 or 24. You will need some way to mark which bits at the end are unused.)
Something like:
struct BitBuffer {
FILE* file; // Initialization skipped.
unsigned char buffer = 0;
unsigned count = 0;
void outputBit(unsigned char bit) {
buffer <<= 1; // Make room for next bit.
if (bit) buffer |= 1; // Set if necessary.
count++; // Remember we have added a bit.
if (count == 8) {
fwrite(&buffer, sizeof(buffer), 1, file); // Error handling elided.
buffer = 0;
count = 0;
}
}
};
The OP asked:
How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)
You can not and here is why...
Memory model
Defines the semantics of a computer memory storage for the purpose of C++ abstract machine.
The memory available to a C++ program is one or more contiguous sequences of bytes. Each byte in memory has a unique address.
Byte
A byte is the smallest addressable unit of memory. It is defined as a contiguous sequence of bits, large enough to hold the value of any UTF-8 code unit (256 distinct values) and of (since C++14) any member of the basic execution character set (the 96 characters that are required to be single-byte). Similar to C, C++ supports bytes of sizes 8 bits and greater.
The types char, unsigned char, and signed char use one byte for both storage and value representation. The number of bits in a byte is accessible as CHAR_BIT or std::numeric_limits<unsigned char>::digits.
Compliments of cppreference.com
You can find this page here: cppreference:memory model
This comes from the 2017-03-21: standard
©ISO/IEC N4659
4.4 The C++ memory model [intro.memory]
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits,4 the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.
[ Note: The representation of types is described in 6.9. —end note ]
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. —end note ] Two or more threads of execution (4.7) can access separate memory locations without interfering
with each other.
[ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of nonzero width. —end note ]
[ Example: A structure declared as
struct {
char a;
int b:5,
c:11,
:0,
d:8;
struct {int ee:8;} e;
}
contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory
locations, and can be modified concurrently without interfering with each other. The bit-fields b and c
together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but
b and a, for example, can be. —end example ]
4) The number of bits in a byte is reported by the macro CHAR_BIT in the header <climits>.
This version of the standard can be found here:
www.open-std.org section § 4.4 on pages 8 & 9.
The smallest possible memory module that can be written to in a program is 8 contiguous bits or more for a standard byte. Even with bit fields, the 1 byte requirement still holds. You can manipulate, toggle, set, individual bits within a byte but you can not write individual bits.
What can be done is to have a byte buffer with a count of bits written. When your required bits are written you will need to have the rest of the unused bits marked as padding or un-used buffer bits.
Edit
[Note:] -- When using bit fields or unions one thing that you must take into consideration is the endian of the specific architecture.
Answer: You can, in a way.
Hello, from my experience I have found a way to do that simple. For the task you need to define yourself and array of characters (it just needs to be for instance 1 byte, it can be bigger). After that you must define functions to access a specific bit from any element. For example, how to write an expression to get the value of the 3th bit from a char in C++.
*/*position is [1,..,n], and bytes
are in little endian and index from 0`enter code here`*/
int bit_at(int position, unsigned char byte)
{
return (byte & (1 << (position - 1)));
}*
Now you can vision the array of bytes as this
[b1,...,bn]
Now what we actually have in memory is 8 * n bits of memory
We can try to visualize it like so.
NOTE: the arrays is zeroed!
|0000 0000|0000 0000|...|0000 0000|
Now from this you or whoever wants can figure how to manipulate it to get a specific bit from this array. Of course there will be some sort of converted but that is not such a problem.
In the end, for the encoding you provide, that is:
a:0
d:100
c:101
b:11
We can encode the message "abcd",
and make an array that holds the bits
of the message, using the elements
of the array as arrays for bits, like so:
|0111 0110|0000 0000|
You can write this to memory and you will have an excess of at most 7 bits.
This is a simple example, but it can be extended into much more.
I hope this gave some answers to your question.
I am working on translating a system from python to c++. I need to be able to perform actions in c++ that are generally performed by using Python's struct.unpack (interpreting binary strings as numerical values). For integer values, I am able to get this to (sort of) work, using the data types in stdint.h:
struct.unpack("i", str) ==> *(int32_t*) str; //str is a char* containing the data
This works properly for little-endian binary strings, but fails on big-endian binary strings. Basically, I need an equivalent to using the > tag in struct.unpack:
struct.unpack(">i", str) ==> ???
Please note, if there is a better way to do this, I am all ears. However, I cannot use c++11, nor any 3rd party libraries other than Boost. I will also need to be able to interpret floats and doubles, as in struct.unpack(">f", str) and struct.unpack(">d", str), but I'll get to that when I solve this.
NOTE I should point out that the endianness of my machine is irrelevant in this case. I know that the bitstream I receive in my code will ALWAYS be big-endian, and that's why I need a solution that will always cover the big-endian case. The article pointed out by BoBTFish in the comments seems to offer a solution.
For 32 and 16-bit values:
This is exactly the problem you have for network data, which is big-endian. You can use the the ntohl to turn a 32-bit into host order, little-endian in your case.
The ntohl() function converts the unsigned integer netlong from network byte order to
host byte order.
int res = ntohl(*((int32_t) str)));
This will also take care of the case where your host is big-endian and won't do anything.
For 64-bit values
Non-standardly on linux/BSD you can take a look at 64 bit ntohl() in C++?, which points to htobe64
These functions convert the byte encoding of integer values from the byte order that
the current CPU (the "host") uses, to and from little-endian and big-endian byte
order.
For windows try: How do I convert between big-endian and little-endian values in C++?
Which points to _byteswap_uint64 and as well as a 16 and 32-bit solution and a gcc-specific __builtin_bswap(32/64) call.
Other Sizes
Most systems don't have values that aren't 16/32/64 bits long. At that point I might try to store it in a 64-bit value, shift it and they translate. I'd write some good tests. I suspectt is an uncommon situation and more details would help.
Unpack the string one byte at a time.
unsigned char *str;
unsigned int result;
result = *str++ << 24;
result |= *str++ << 16;
result |= *str++ << 8;
result |= *str++;
First, the cast you're doing:
char *str = ...;
int32_t i = *(int32_t*)str;
results in undefined behavior due to the strict aliasing rule (unless str is initialized with something like int32_t x; char *str = (char*)&x;). In practical terms that cast can result in an unaligned read which causes a bus error (a crash) on some platforms and slow performance on others.
Instead you should be doing something like:
int32_t i;
std::memcpy(&i, c, sizeof(i));
There are a number of functions for swapping bytes between the host's native byte ordering and a host independent ordering: ntoh*(), hton*(), where * is nothing, l, or s for the different types supported. Since different hosts may have different byte orderings then this may be what you want to use if the data you're reading uses a consistent serialized form on all platforms.
ntoh(i);
You can also manually move bytes around in str before copying it into the integer.
std::swap(str[0],str[3]);
std::swap(str[1],str[2]);
std::memcpy(&i,str,sizeof(i));
Or you can manually manipulate the integer's value using shifts and bitwise operators.
std::memcpy(&i,str,sizeof(i));
i = (i&0xFFFF0000)>>16 | (i&0x0000FFFF)<<16;
i = (i&0xFF00FF00)>>8 | (i&0x00FF00FF)<<8;
This falls in the realm of bit twiddling.
for (i=0;i<sizeof(struct foo);i++) dst[i] = src[i ^ mask];
where mask == (sizeof type -1) if the stored and native endianness differ.
With this technique one can convert a struct to bit masks:
struct foo {
byte a,b; // mask = 0,0
short e; // mask = 1,1
int g; // mask = 3,3,3,3,
double i; // mask = 7,7,7,7,7,7,7,7
} s; // notice that all units must be aligned according their native size
Again these masks can be encoded with two bits per symbol: (1<<n)-1, meaning that in 64-bit machines one can encode necessary masks of a 32 byte sized struct in a single constant (with 1,2,4 and 8 byte alignments).
unsigned int mask = 0xffffaa50; // or zero if the endianness matches
for (i=0;i<16;i++) {
dst[i]=src[i ^ ((1<<(mask & 3))-1]; mask>>=2;
}
If your as received values are truly strings, (char* or std::string) and you know their format information, sscanf(), and atoi(), well, really ato() will be your friends. They take well formatted strings and convert them per passed-in formats (kind of reverse printf).
When storing "byte arrays" (blobs...) is it better to use char or unsigned char for the items (unsigned char a.k.a. uint8_t)? (Standard says that sizeof of both is precisely 1 Byte.)
Does it matter at all? Or one is more convenient or prevalent than the other? Maybe, what libraries like Boost do use?
If char is signed, then performing arithmetic on a byte value with the high bit set will result in sign extension when promoting to int; so, for example:
char c = '\xf0';
int res = (c << 24) | (c << 16) | (c << 8) | c;
will give 0xfffffff0 instead of 0xf0f0f0f0. This can be avoided by masking with 0xff.
char may still be preferable if you're interfacing with libraries that use it instead of unsigned char.
Note that a cast from char * to/from unsigned char * is always safe (3.9p2). A philosophical reason to favour unsigned char is that 3.9p4 in the standard favours it, at least for representing byte arrays that could hold memory representations of objects:
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).
Theoretically, the size of a byte in C++ is dependant on the compiler-settings and target platform, but it is guaranteed to be at least 8 bits, which explains why sizeof(uint8_t) is required to be 1.
Here's more precisely what the standard has to say about it
§1.71
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation-defined. The least
significant bit is called the low-order bit; the most significant bit
is called the high-order bit. The memory available to a C++ program
consists of one or more sequences of contiguous bytes. Every byte has
a unique address.
So, if you are working on some special hardware where bytes are not 8 bits, it may make a practical difference. Otherwise, I'd say that it's a matter of taste and what information you want to communicate via the choice of type.
One of the other problems with potentially using a signed value for blobs is that the value will depend on the sign representation, which is not part of the standard. So, it's easier to invoke undefined behavior.
For example...
signed char x = 0x80;
int y = 0xffff00ff;
y |= (x << 8); // UB
The actual arithmetic value would also strictly depend two's complement, which may give some people surprises. Using unsigned explicitly avoids these problems.
makes no practcial difference although maybe from a readability point of view it is more clear if the type is unsigned char implying values 0..255.
Say, i have binary protocol, where first 4 bits represent a numeric value which can be less than or equal to 10 (ten in decimal).
In C++, the smallest data type available to me is char, which is 8 bits long. So, within my application, i can hold the value represented by 4 bits in a char variable. My question is, if i have to pack the char value back into 4 bits for network transmission, how do i pack my char's value back into 4 bits?
You do bitwise operation on the char;
so
unsigned char packedvalue = 0;
packedvalue |= 0xF0 & (7 <<4);
packedvalue |= 0x0F & (10);
Set the 4 upper most bit to 7 and the lower 4 bits to 10
Unpacking these again as
int upper, lower;
upper = (packedvalue & 0xF0) >>4;
lower = packedvalue & 0x0F;
As an extra answer to the question -- you may also want to look at protocol buffers for a way of encoding and decoding data for binary transfers.
Sure, just use one char for your value:
std::ofstream outfile("thefile.bin", std::ios::binary);
unsigned int n; // at most 10!
char c = n << 4; // fits
outfile.write(&c, 1); // we wrote the value "10"
The lower 4 bits will be left at zero. If they're also used for something, you'll have to populate c fully before writing it. To read:
infile.read(&c, 1);
unsigned int n = c >> 4;
Well, there's the popular but non-portable "Bit Fields". They're standard-compliant, but may create a different packing order on different platforms. So don't use them.
Then, there are the highly portable bit shifting and bitwise AND and OR operators, which you should prefer. Essentially, you work on a larger field (usually 32 bits, for TCP/IP protocols) and extract or replace subsequences of bits. See Martin's link and Soren's answer for those.
Are you familiar with C's bitfields? You simply write
struct my_bits {
unsigned v1 : 4;
...
};
Be warned, various operations are slower on bitfields because the compiler must unpack them for things like addition. I'd imagine unpacking a bitfield will still be faster than the addition operation itself, even though it requires multiple instructions, but it's still overhead. Bitwise operations should remain quite fast. Equality too.
You must also take care with endianness and threads (see the wikipedia article I linked for details, but the issues are kinda obvious). You should leearn about endianness anyways since you said "binary protocol" (see this previous questions)
(sizeof) char always returns 1 in 32 bit GCC compiler.
But since the basic block size in 32 bit compiler is 4, How does char occupy a single byte when the basic size is 4 bytes???
Considering the following :
struct st
{
int a;
char c;
};
sizeof(st) returns as 8 as agreed with the default block size of 4 bytes (since 2 blocks are allotted)
I can never understand why sizeof(char) returns as 1 when it is allotted a block of size 4.
Can someone pls explain this???
I would be very thankful for any replies explaining it!!!
EDIT : The typo of 'bits' has been changed to 'bytes'. I ask Sorry to the person who made the first edit. I rollbacked the EDIT since I did not notice the change U made.
Thanks to all those who made it a point that It must be changed especially #Mike Burton for downvoting the question and to #jalf who seemed to jump to conclusions over my understanding of concepts!!
sizeof(char) is always 1. Always. The 'block size' you're talking about is just the native word size of the machine - usually the size that will result in most efficient operation. Your computer can still address each byte individually - that's what the sizeof operator is telling you about. When you do sizeof(int), it returns 4 to tell you that an int is 4 bytes on your machine. Likewise, your structure is 8 bytes long. There is no information from sizeof about how many bits there are in a byte.
The reason your structure is 8 bytes long rather than 5 (as you might expect), is that the compiler is adding padding to the structure in order to keep everything nicely aligned to that native word length, again for greater efficiency. Most compilers give you the option to pack a structure, either with a #pragma directive or some other compiler extension, in which case you can force your structure to take minimum size, regardless of your machine's word length.
char is size 1, since that's the smallest access size your computer can handle - for most machines an 8-bit value. The sizeof operator gives you the size of all other quantities in units of how many char objects would be the same size as whatever you asked about. The padding (see link below) is added by the compiler to your data structure for performance reasons, so it is larger in practice than you might think from just looking at the structure definition.
There is a wikipedia article called Data structure alignment which has a good explanation and examples.
It is structure alignment with padding. c uses 1 byte, 3 bytes are non used. More here
Sample code demonstrating structure alignment:
struct st
{
int a;
char c;
};
struct stb
{
int a;
char c;
char d;
char e;
char f;
};
struct stc
{
int a;
char c;
char d;
char e;
char f;
char g;
};
std::cout<<sizeof(st) << std::endl; //8
std::cout<<sizeof(stb) << std::endl; //8
std::cout<<sizeof(stc) << std::endl; //12
The size of the struct is bigger than the sum of its individual components, since it was set to be divisible by 4 bytes by the 32 bit compiler. These results may be different on different compilers, especially if they are on a 64 bit compiler.
First of all, sizeof returns a number of bytes, not bits. sizeof(char) == 1 tells you that a char is eight bits (one byte) long. All of the fundamental data types in C are at least one byte long.
Your structure returns a size of 8. This is a sum of three things: the size of the int, the size of the char (which we know is 1), and the size of any extra padding that the compiler added to the structure. Since many implementations use a 4-byte int, this would imply that your compiler is adding 3 bytes of padding to your structure. Most likely this is added after the char in order to make the size of the structure a multiple of 4 (a 32-bit CPU access data most efficiently in 32-bit chunks, and 32 bits is four bytes).
Edit: Just because the block size is four bytes doesn't mean that a data type can't be smaller than four bytes. When the CPU loads a one-byte char into a 32-bit register, the value will be sign-extended automatically (by the hardware) to make it fill the register. The CPU is smart enough to handle data in N-byte increments (where N is a power of 2), as long as it isn't larger than the register. When storing the data on disk or in memory, there is no reason to store every char as four bytes. The char in your structure happened to look like it was four bytes long because of the padding added after it. If you changed your structure to have two char variables instead of one, you should see that the size of the structure is the same (you added an extra byte of data, and the compiler added one fewer byte of padding).
All object sizes in C and C++ are defined in terms of bytes, not bits. A byte is the smallest addressable unit of memory on the computer. A bit is a single binary digit, a 0 or a 1.
On most computers, a byte is 8 bits (so a byte can store values from 0 to 256), although computers exist with other byte sizes.
A memory address identifies a byte, even on 32-bit machines. Addresses N and N+1 point to two subsequent bytes.
An int, which is typically 32 bits covers 4 bytes, meaning that 4 different memory addresses exist that each point to part of the int.
In a 32-bit machine, all the 32 actually means is that the CPU is designed to work efficiently with 32-bit values, and that an address is 32 bits long. It doesn't mean that memory can only be addressed in blocks of 32 bits.
The CPU can still address individual bytes, which is useful when dealing with chars, for example.
As for your example:
struct st
{
int a;
char c;
};
sizeof(st) returns 8 not because all structs have a size divisible by 4, but because of alignment. For the CPU to efficiently read an integer, its must be located on an address that is divisible by the size of the integer (4 bytes). So an int can be placed on address 8, 12 or 16, but not on address 11.
A char only requires its address to be divisible by the size of a char (1), so it can be placed on any address.
So in theory, the compiler could have given your struct a size of 5 bytes... Except that this wouldn't work if you created an array of st objects.
In an array, each object is placed immediately after the previous one, with no padding. So if the first object in the array is placed at an address divisible by 4, then the next object would be placed at a 5 bytes higher address, which would not be divisible by 4, and so the second struct in the array would not be properly aligned.
To solve this, the compiler inserts padding inside the struct, so its size becomes a multiple of its alignment requirement.
Not because it is impossible to create objects that don't have a size that is a multiple of 4, but because one of the members of your st struct requires 4-byte alignment, and so every time the compiler places an int in memory, it has to make sure it is placed at an address that is divisible by 4.
If you create a struct of two chars, it won't get a size of 4. It will usually get a size of 2, because when it contains only chars, the object can be placed at any address, and so alignment is not an issue.
Sizeof returns the value in bytes. You were talking about bits. 32 bit architectures are word aligned and byte referenced. It is irrelevant how the architecture stores a char, but to compiler, you must reference chars 1 byte at a time, even if they use up less than 1 byte.
This is why sizeof(char) is 1.
ints are 32 bit, hence sizeof(int)= 4, doubles are 64 bit, hence sizeof(double) = 8, etc.
Because of optimisation padding is added so size of an object is 1, 2 or n*4 bytes (or something like that, talking about x86). That's why there is added padding to 5-byte object and to 1-byte not. Single char doesn't have to be padded, it can be allocated on 1 byte, we can store it on space allocated with malloc(1). st cannot be stored on space allocated with malloc(5) because when st struct is being copied whole 8 bytes are being copied.
It works the same way as using half a piece of paper. You use one part for a char and the other part for something else. The compiler will hide this from you since loading and storing a char into a 32bit processor register depends on the processor.
Some processors have instructions to load and store only parts of the 32bit others have to use binary operations to extract the value of a char.
Addressing a char works as it is AFAIR by definition the smallest addressable memory. On a 32bit system pointers to two different ints will be at least 4 address points apart, char addresses will be only 1 apart.