I'm trying to implement a Huffman tree.
Content of my simple .txt file that I want to do a simple test:
aaaaabbbbccd
Frequencies of characters: a:5, b:4, c:2, d:1
Code Table: (Data type of 1s and 0s: string)
a:0
d:100
c:101
b:11
Result that I want to write as binary: (22 bits)
0000011111111101101100
How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)
Answer: You can't.
The minimum amount you can write to a file (or read from it), is a char or unsigned char. For all practical purposes, a char has exactly eight bits.
You are going to need to have a one char buffer, and a count of the number of bits it holds. When that number reaches 8, you need to write it out, and reset the count to 0. You will also need a way to flush the buffer at the end. (Not that you cannot write 22 bits to a file - you can only write 16 or 24. You will need some way to mark which bits at the end are unused.)
Something like:
struct BitBuffer {
FILE* file; // Initialization skipped.
unsigned char buffer = 0;
unsigned count = 0;
void outputBit(unsigned char bit) {
buffer <<= 1; // Make room for next bit.
if (bit) buffer |= 1; // Set if necessary.
count++; // Remember we have added a bit.
if (count == 8) {
fwrite(&buffer, sizeof(buffer), 1, file); // Error handling elided.
buffer = 0;
count = 0;
}
}
};
The OP asked:
How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)
You can not and here is why...
Memory model
Defines the semantics of a computer memory storage for the purpose of C++ abstract machine.
The memory available to a C++ program is one or more contiguous sequences of bytes. Each byte in memory has a unique address.
Byte
A byte is the smallest addressable unit of memory. It is defined as a contiguous sequence of bits, large enough to hold the value of any UTF-8 code unit (256 distinct values) and of (since C++14) any member of the basic execution character set (the 96 characters that are required to be single-byte). Similar to C, C++ supports bytes of sizes 8 bits and greater.
The types char, unsigned char, and signed char use one byte for both storage and value representation. The number of bits in a byte is accessible as CHAR_BIT or std::numeric_limits<unsigned char>::digits.
Compliments of cppreference.com
You can find this page here: cppreference:memory model
This comes from the 2017-03-21: standard
©ISO/IEC N4659
4.4 The C++ memory model [intro.memory]
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits,4 the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.
[ Note: The representation of types is described in 6.9. —end note ]
A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. —end note ] Two or more threads of execution (4.7) can access separate memory locations without interfering
with each other.
[ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of nonzero width. —end note ]
[ Example: A structure declared as
struct {
char a;
int b:5,
c:11,
:0,
d:8;
struct {int ee:8;} e;
}
contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory
locations, and can be modified concurrently without interfering with each other. The bit-fields b and c
together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but
b and a, for example, can be. —end example ]
4) The number of bits in a byte is reported by the macro CHAR_BIT in the header <climits>.
This version of the standard can be found here:
www.open-std.org section § 4.4 on pages 8 & 9.
The smallest possible memory module that can be written to in a program is 8 contiguous bits or more for a standard byte. Even with bit fields, the 1 byte requirement still holds. You can manipulate, toggle, set, individual bits within a byte but you can not write individual bits.
What can be done is to have a byte buffer with a count of bits written. When your required bits are written you will need to have the rest of the unused bits marked as padding or un-used buffer bits.
Edit
[Note:] -- When using bit fields or unions one thing that you must take into consideration is the endian of the specific architecture.
Answer: You can, in a way.
Hello, from my experience I have found a way to do that simple. For the task you need to define yourself and array of characters (it just needs to be for instance 1 byte, it can be bigger). After that you must define functions to access a specific bit from any element. For example, how to write an expression to get the value of the 3th bit from a char in C++.
*/*position is [1,..,n], and bytes
are in little endian and index from 0`enter code here`*/
int bit_at(int position, unsigned char byte)
{
return (byte & (1 << (position - 1)));
}*
Now you can vision the array of bytes as this
[b1,...,bn]
Now what we actually have in memory is 8 * n bits of memory
We can try to visualize it like so.
NOTE: the arrays is zeroed!
|0000 0000|0000 0000|...|0000 0000|
Now from this you or whoever wants can figure how to manipulate it to get a specific bit from this array. Of course there will be some sort of converted but that is not such a problem.
In the end, for the encoding you provide, that is:
a:0
d:100
c:101
b:11
We can encode the message "abcd",
and make an array that holds the bits
of the message, using the elements
of the array as arrays for bits, like so:
|0111 0110|0000 0000|
You can write this to memory and you will have an excess of at most 7 bits.
This is a simple example, but it can be extended into much more.
I hope this gave some answers to your question.
Related
I have a variable:
int64_t label : 40
I want to take the 32 lower bits and put them in a variable of type:
char nol[4]
How can I do that in c++?
Depends on what you mean by "lower" bits. The word "lower" normally implies lower memory address. But that's rarely useful. You may be thinking of least significant instead, which is more commonly useful.
You must also consider what order you want the bytes to be in the array. When copying the lower bytes, you typically want to keep the bytes in the same order as in the integer i.e. native endianness. When copying least significant bytes, you typically want a specific order which may differ from the native endianness i.e. either big or little endian. Big endian is conventionally used in network communication.
If the number of bits to copy is not a multiple of byte size, then copying the incomplete byte adds some complexity.
Copying the lower bytes in native order is very simple:
char nol[32 / CHAR_BIT];
std::memcpy(nol, &label, sizeof nol);
Here is an example of copying least significant bytes in big endian order:
for (int i = 0; i < sizeof nol; i++) {
nol[sizeof nol - i] = label >> CHAR_BIT * i & UCHAR_MAX;
}
Is bitfield a C concept or C++?
Can it be used only within a structure? What are the other places we can use them?
AFAIK, bitfields are special structure variables that occupy the memory only for specified no. of bits. It is useful in saving memory and nothing else. Am I correct?
I coded a small program to understand the usage of bitfields - But, I think it is not working as expected. I expect the size of the below structure to be 1+4+2 = 7 bytes (considering the size of unsigned int is 4 bytes on my machine), But to my surprise it turns out to be 12 bytes (4+4+4). Can anyone let me know why?
#include <stdio.h>
struct s{
unsigned int a:1;
unsigned int b;
unsigned int c:2;
};
int main()
{
printf("sizeof struct s = %d bytes \n",sizeof(struct s));
return 0;
}
OUTPUT:
sizeof struct s = 12 bytes
Because a and c are not contiguous, they each reserve a full int's worth of memory space. If you move a and c together, the size of the struct becomes 8 bytes.
Moreover, you are telling the compiler that you want a to occupy only 1 bit, not 1 byte. So even though a and c next to each other should occupy only 3 bits total (still under a single byte), the combination of a and c still become word-aligned in memory on your 32-bit machine, hence occupying a full 4 bytes in addition to the int b.
Similarly, you would find that
struct s{
unsigned int b;
short s1;
short s2;
};
occupies 8 bytes, while
struct s{
short s1;
unsigned int b;
short s2;
};
occupies 12 bytes because in the latter case, the two shorts each sit in their own 32-bit alignment.
1) They originated in C, but are part of C++ too, unfortunately.
2) Yes, or within a class in C++.
3) As well as saving memory, they can be used for some forms of bit twiddling. However, both memory saving and twiddling are inherently implementation dependent - if you want to write portable software, avoid bit fields.
Its C.
Your comiler has rounded the memory allocation to 12 bytes for alignment purposes. Most computer memory syubsystems can't handle byte addressing.
Your program is working exactly as I'd expect. The compiler allocates adjacent bitfields into the same memory word, but yours are separated by a non-bitfield.
Move the bitfields next to each other and you'll probably get 8, which is the size of two ints on your machine. The bitfields would be packed into one int. This is compiler specific, however.
Bitfields are useful for saving space, but not much else.
Bitfields are widely used in firmware to map different fields in registers. This save a lot of manual bitwise operations which would have been necessary to read / write fields without it.
One disadvantage is you can't take address of bitfields.
I am using Linux 32 bit os,
and GCC compiler.
I tried with three different type of structure.
in the first structure i have defined only one char variable. size of this structure is 1 that is correct.
in the second structure i have defined only one int variable. here size of the structure is showing 4 that is also correct.
but in the third structure when i defined one char and one int that means total size should be 5, but the output it is showing 8. Can anyone please explain how a structure is assigned?
typedef struct struct_size_tag
{
char c;
//int i;
}struct_size;
int main()
{
printf("Size of structure:%d\n",sizeof(struct_size));
return 0;
}
Output: Size of structure:1
typedef struct struct_size_tag
{
//char c;
int i;
}struct_size;
int main()
{
printf("Size of structure:%d\n",sizeof(struct_size));
return 0;
}
Output: Size of structure:4
typedef struct struct_size_tag
{
char c;
int i;
}struct_size;
int main()
{
printf("Size of structure:%d\n",sizeof(struct_size));
return 0;
}
Output:
Size of structure:8
The difference in size is due to alignment. The compiler is free to choose padding bytes, which make the total size of a structure not necessarily the sum of its individual elements.
If the padding of a structure is undesired, because it might have to interface with some hardware requirement (or other reasons), compilers usually support packing structures, so the padding is disabled.
You definitely get Data Structure Alignment
"Data alignment means putting the data at a memory offset equal to some
multiple of the word size, which increases the system's performance
due to the way the CPU handles memory. To align the data, it may be
necessary to insert some meaningless bytes between the end of the last
data structure and the start of the next, which is data structure
padding."
For more, take a look at this, Data Structure Alignment
The C standard allows a compiler to add padding bytes to structs after any field to allow the following field to be aligned according to any specific requirements of the compiler (or the user of the compiler). The standard does not specify, but typically a compiler will provide a command line argument to specify the (default) alignment. Good compilers also invariably support the de facto standard of #pragma pack, including push and pop options.
Padding bytes provide improved performance by reducing the amount of memory accesses required by suitably aligned data types. For example, on a 32-bit processor (more specifically a system which uses memory with 32 data lines) accessing a 32-bit integer will require two memory accesses when reading and writing the value rather than just one if it crosses a 4-byte boundary (ie, unless the bottom two bits of the address of the integer are zero).
See My Blog Post for more details (better than Wikipedia article).
The magic word is padding/memory alignment #see data structure alignment.
(sizeof) char always returns 1 in 32 bit GCC compiler.
But since the basic block size in 32 bit compiler is 4, How does char occupy a single byte when the basic size is 4 bytes???
Considering the following :
struct st
{
int a;
char c;
};
sizeof(st) returns as 8 as agreed with the default block size of 4 bytes (since 2 blocks are allotted)
I can never understand why sizeof(char) returns as 1 when it is allotted a block of size 4.
Can someone pls explain this???
I would be very thankful for any replies explaining it!!!
EDIT : The typo of 'bits' has been changed to 'bytes'. I ask Sorry to the person who made the first edit. I rollbacked the EDIT since I did not notice the change U made.
Thanks to all those who made it a point that It must be changed especially #Mike Burton for downvoting the question and to #jalf who seemed to jump to conclusions over my understanding of concepts!!
sizeof(char) is always 1. Always. The 'block size' you're talking about is just the native word size of the machine - usually the size that will result in most efficient operation. Your computer can still address each byte individually - that's what the sizeof operator is telling you about. When you do sizeof(int), it returns 4 to tell you that an int is 4 bytes on your machine. Likewise, your structure is 8 bytes long. There is no information from sizeof about how many bits there are in a byte.
The reason your structure is 8 bytes long rather than 5 (as you might expect), is that the compiler is adding padding to the structure in order to keep everything nicely aligned to that native word length, again for greater efficiency. Most compilers give you the option to pack a structure, either with a #pragma directive or some other compiler extension, in which case you can force your structure to take minimum size, regardless of your machine's word length.
char is size 1, since that's the smallest access size your computer can handle - for most machines an 8-bit value. The sizeof operator gives you the size of all other quantities in units of how many char objects would be the same size as whatever you asked about. The padding (see link below) is added by the compiler to your data structure for performance reasons, so it is larger in practice than you might think from just looking at the structure definition.
There is a wikipedia article called Data structure alignment which has a good explanation and examples.
It is structure alignment with padding. c uses 1 byte, 3 bytes are non used. More here
Sample code demonstrating structure alignment:
struct st
{
int a;
char c;
};
struct stb
{
int a;
char c;
char d;
char e;
char f;
};
struct stc
{
int a;
char c;
char d;
char e;
char f;
char g;
};
std::cout<<sizeof(st) << std::endl; //8
std::cout<<sizeof(stb) << std::endl; //8
std::cout<<sizeof(stc) << std::endl; //12
The size of the struct is bigger than the sum of its individual components, since it was set to be divisible by 4 bytes by the 32 bit compiler. These results may be different on different compilers, especially if they are on a 64 bit compiler.
First of all, sizeof returns a number of bytes, not bits. sizeof(char) == 1 tells you that a char is eight bits (one byte) long. All of the fundamental data types in C are at least one byte long.
Your structure returns a size of 8. This is a sum of three things: the size of the int, the size of the char (which we know is 1), and the size of any extra padding that the compiler added to the structure. Since many implementations use a 4-byte int, this would imply that your compiler is adding 3 bytes of padding to your structure. Most likely this is added after the char in order to make the size of the structure a multiple of 4 (a 32-bit CPU access data most efficiently in 32-bit chunks, and 32 bits is four bytes).
Edit: Just because the block size is four bytes doesn't mean that a data type can't be smaller than four bytes. When the CPU loads a one-byte char into a 32-bit register, the value will be sign-extended automatically (by the hardware) to make it fill the register. The CPU is smart enough to handle data in N-byte increments (where N is a power of 2), as long as it isn't larger than the register. When storing the data on disk or in memory, there is no reason to store every char as four bytes. The char in your structure happened to look like it was four bytes long because of the padding added after it. If you changed your structure to have two char variables instead of one, you should see that the size of the structure is the same (you added an extra byte of data, and the compiler added one fewer byte of padding).
All object sizes in C and C++ are defined in terms of bytes, not bits. A byte is the smallest addressable unit of memory on the computer. A bit is a single binary digit, a 0 or a 1.
On most computers, a byte is 8 bits (so a byte can store values from 0 to 256), although computers exist with other byte sizes.
A memory address identifies a byte, even on 32-bit machines. Addresses N and N+1 point to two subsequent bytes.
An int, which is typically 32 bits covers 4 bytes, meaning that 4 different memory addresses exist that each point to part of the int.
In a 32-bit machine, all the 32 actually means is that the CPU is designed to work efficiently with 32-bit values, and that an address is 32 bits long. It doesn't mean that memory can only be addressed in blocks of 32 bits.
The CPU can still address individual bytes, which is useful when dealing with chars, for example.
As for your example:
struct st
{
int a;
char c;
};
sizeof(st) returns 8 not because all structs have a size divisible by 4, but because of alignment. For the CPU to efficiently read an integer, its must be located on an address that is divisible by the size of the integer (4 bytes). So an int can be placed on address 8, 12 or 16, but not on address 11.
A char only requires its address to be divisible by the size of a char (1), so it can be placed on any address.
So in theory, the compiler could have given your struct a size of 5 bytes... Except that this wouldn't work if you created an array of st objects.
In an array, each object is placed immediately after the previous one, with no padding. So if the first object in the array is placed at an address divisible by 4, then the next object would be placed at a 5 bytes higher address, which would not be divisible by 4, and so the second struct in the array would not be properly aligned.
To solve this, the compiler inserts padding inside the struct, so its size becomes a multiple of its alignment requirement.
Not because it is impossible to create objects that don't have a size that is a multiple of 4, but because one of the members of your st struct requires 4-byte alignment, and so every time the compiler places an int in memory, it has to make sure it is placed at an address that is divisible by 4.
If you create a struct of two chars, it won't get a size of 4. It will usually get a size of 2, because when it contains only chars, the object can be placed at any address, and so alignment is not an issue.
Sizeof returns the value in bytes. You were talking about bits. 32 bit architectures are word aligned and byte referenced. It is irrelevant how the architecture stores a char, but to compiler, you must reference chars 1 byte at a time, even if they use up less than 1 byte.
This is why sizeof(char) is 1.
ints are 32 bit, hence sizeof(int)= 4, doubles are 64 bit, hence sizeof(double) = 8, etc.
Because of optimisation padding is added so size of an object is 1, 2 or n*4 bytes (or something like that, talking about x86). That's why there is added padding to 5-byte object and to 1-byte not. Single char doesn't have to be padded, it can be allocated on 1 byte, we can store it on space allocated with malloc(1). st cannot be stored on space allocated with malloc(5) because when st struct is being copied whole 8 bytes are being copied.
It works the same way as using half a piece of paper. You use one part for a char and the other part for something else. The compiler will hide this from you since loading and storing a char into a 32bit processor register depends on the processor.
Some processors have instructions to load and store only parts of the 32bit others have to use binary operations to extract the value of a char.
Addressing a char works as it is AFAIR by definition the smallest addressable memory. On a 32bit system pointers to two different ints will be at least 4 address points apart, char addresses will be only 1 apart.
What is the relation between word length, character size, integer size, and byte in C++?
The standard requires that certain types have minimum sizes (short is at least 16 bits, int is at least 16 bits, etc), and that some groups of type are ordered (sizeof(int) >= sizeof(short) >= sizeof(char)).
In C++ a char must be large enough to hold any character in the implemetation's basic character set.
int has the "natural size suggested by the architecture of the execution environment". Note that this means that an int does not need to be at least 32-bits in size. Implementations where int is 16 bits are common (think embedded ot MS-DOS).
The following are taken from various parts of the C++98 and C99 standards:
long int has to be at least as large as int
int has to be at least as large as short
short has to be at least as large as char
Note that they could all be the same size.
Also (assuming a two's complement implementation):
long int has to be at least 32-bits
int has to be at least 16-bits
short has to be at least 16-bits
char has to be at least 8 bits
The Standard doesn't know this "word" thingy used by processors. But it says the type "int" should have the natural size for a execution environment. But even for 64 bit environments, int is usually only 32 bits. So "word" in Standard terms has pretty much no common meaning (except for the common English "word" of course).
Character size is the size of a character. Depends on what character you talk about. Character types are char, unsigned char and signed char. Also wchar_t is used to store characters that can have any size (determined by the implementation - but must use one of the integer types as its underlying type. Much like enumerations), while char/signed char or unsigned char has to have one byte. That means that one byte has as much bits as one char has. If an implementation says one object of type char has 16 bits, then a byte has 16 bits too.
Now a byte is the size that one char occupies. It's a unit, not some specific type. There is not much more about it, just that it is the unit that you can access memory. I.e you do not have pointer access to bit-fields, but you have access to units starting at one byte.
"Integer size" now is pretty wide. What do you mean? All of bool, char, short, int, long and their unsinged counterparts are integers. Their range is what i would call "integer size" and it is documented in the C standard - taken over by the C++ Standard. For signed char the range is from -127 <-> 127, for short and int it is the same and is -2^15+1 <-> 2^15-1 and for long it is -2^31+1 <-> 2^31-1. Their unsigned counterparts range from 0 up to 2^8-1, 2^16-1 and 2^32-1 respectively. Those are however minimal sizes. That is, an int may not have maximal size 2^14 on any platform, because that is less than 2^15-1 of course. It follows for those values that a minimum of bits is required. For char that is 8, for short/int that is 16 and for long that is 32. Two's-complement representation for negative numbers is not required, which is why the negative value is not -128 instead of -127 for example for signed char.
Standard C++ doesn't have a datatype called word or byte. The rest are well defined as ranges. The base is a char which has of CHAR_BITS bits. The most commonly used value of CHAR_BITS is 8.
sizeof( char ) == 1 ( one byte ) (in c++, in C - not specified)
sizeof( int ) >= sizeof( char )
word - not c++ type, usualy in computer architecture it mean 2 bytes
Kind of depends on what you mean by relation. The size of numeric types is generally a multiple of the machine word size. A byte is a byte is a byte -- 8 bits, no more, no less. A character is defined in the standard as a single unsigned byte I believe (check your ARM for details).
The general rule is, don't make any assumptions about the actual size of data types. The standard specifies relationships between the types such as a "long" integer will be either the same size or larger than an "int". Individual implementations of the language will pick specific sizes for the types that are convenient for them. For example, a compiler for a 64-bit processor will choose different sizes than a compiler for a 32-bit processor.
You can use the sizeof() operator to examine the specific sizes for the compiler you are using (on the specific target architecture).