What is the difference between Char and Word8? - sml

The Standard ML Basis Library has a Char structure and a Word8 structure. A Char.char always represents 8 bits and a Word8.word always represents 8 bits. There is a one-to-one mapping between the values of these two types, so why are both needed? It seems rather unnecessary to have to explicitly convert between these two essentially equivalent types by using Byte.byteToChar and Byte.charToByte. A similar situation exists for String.string and Word8Vector.vector.

Related

Parsing long strings of binary data with C++

I am looking for idea how to parse long binary data, so for example :"10100011111000111001"
bits: 0-4 are the id
bits 5-15 are the data
etc etc...
the binary data structure can be change so I need to build a kind of data-base will store the data how to parse each string.
illustration (it could be 200~ bits) :
Ideas how to implement it?
Thanks
Edit
What am I missing here?
struct Bitfield {
uint16_t a : 10 , b:6;};
void diag(){
uint16_t t= 61455;
struct Bitfield test = {t};
cout<<"a: "<<test.a<<endl;
cout<<"b: "<<test.b<<endl;
return;}
and the output is:
a: 15
b: 0
Options available
To manage a large structured set of bits, you have the following options:
C++ bit-fields: you define a structure with bitfield members. You can have as many members as you want, provided that each single one has no more bits than an unsigned long long.
It's super easy to use; The compiler manages the access to bits or groups of bits for you. The major inconvenience is that the bit layout is implementation dependent. So this is not an option for writing portable code that exchanges data in a binary format.
Container of unsigned integral type: you define an array large enough to hold the all the bits, and access bits or groups of bits using a combination of logical operations.
It requires to be at ease with binary operations and is not practical if groups of bits are split over consecutive elements. For exchanging data in binary format with the outside world in a protable way, you'd need to either take care of differences between big and little endian architectures or use arrays of uint8_t.
std::vector<bool>: gives you total flexibility to manage you bits. The main constraint is that you need to address each bit separately. Moreover, there's no data() member that could give direct access to the binary data .
std::bitset: is very similar to vector<bool> for accessing bits. It has a fixed size at compile time, but offers useful features such as reading and writing binary in ascci from strings or streams]5, converting from binary values of integral types, and logical operations on the full bitset.
A combination of these techniques
Make your choice
To communicate with the outside world in a portable way, the easiest approach is to use bitsets. Bitsets offer easy input/output/string conversion in a format using ascci '0' or '1' (or any substitutes thereof)
bitset<msg_header_size> bh,bh2;
bitset<msg_body_size> bb,bb2;
cin>>bh>>bb; // reads a string od ascii 0 and 1
cout<<bh<<"-"<<bb<<endl<<endl; // writes a string of ascii 0 and 1
You can also convert from/to binary data (but a single element, large enough for the bitset size):
bitset<8> b(static_cast<uint8_t>(c));
cout<<b<<endl;
cout<<b.to_ulong()<<endl;
For reading/writing large sets, you'd need to read small bitsets and use logical operators to aggregate them in a larger bitset. It this seems time consuming, it's in fact very close to what you'd do in containers of integrals, but without having to care about byte boundaries.
In your case, with a fixed size header and a maximum size, the bitset seems to be a good choice (be careful however because the variable part is right justified) for exchanging binary data with the external world.
For working the data content, it's easy to access a specific bit, but you have to use some logical operations (shift, and) to access to groups of bits. Moreover, if you want readable and maintainable code, it's better to abstract the bit layout.
Conclusion:
I would therefore strongly advise to use internally a bit-field structure for working with the data and keep a comparable memory footprint than the original data and at the same time, use bitsets just to convert from/to this structure for the purpose of external data exchanges.
The "best way" depends on the details of the problem.
If the whole number fits into the largest integer type available (usually long long), convert the string into an integer first (for example with stoi/stol/stoll functions, assuming C++11 is available). Then use bit-shifting combined with binary and (&) to extract the sections of the value you are interested in.
If the whole number does not fit into the largest integer type available, chop it up as a string (using the substr function) and then convert the substrings into integers one by one.

C++11 data types confusion

I am trying to write a solid summary for C++ datatypes, but I have some confusion about the new datatypes.
As I understood from my readings about C++ data types, char16_t and char_32_t are fundamental data types and part of the core language since C++11.
It is mentioned that they are distinct data types.
Q1: What exactly does "distinct" mean here?
Q2: Why intxx_t type family like int32_t was chosen not to be a fundamental datatype? And how can they be beneficial when choosing them instead of int?
To answer the second part of the question:
The fixed size integer types are inherited from C, where they are typedefs. It was decided to keep them as typedefs to be compatible. Note that the C language doesn't have overloaded functions, so the need for "distinct" types is lower there.
One reason for using int32_t is that you need one or more of its required properties:
Signed integer type with width of exactly 32 bits
with no padding bits and using 2's complement for negative values.
If you use an int it might, for example, be 36 bits and use 1's complement.
However, if you don't have very specific requirements, using a normal int will work fine. One advantage is that an int will be available on all systems, while the 36-bit machine (or a 24 bit embedded processor) might not have any int32_t at all.
The charXX_t types were introduced in N2249. They are created as a distinct type from uintXX_t to allow overloading:
Define char16_t to be a distinct new type, that has the same size and representation as uint_least16_t. Likewise, define char32_t to be a distinct new type, that has the same size and representation as uint_least32_t.
[N1040 defined char16_t and char32_t as typedefs to uint_least16_t and uint_least32_t, which make overloading on these characters impossible.]
To answer your Q1:
Distinct type means std::is_same<char16_t,uint_least16_t>::value is equal to false.
So overloaded functions are possible.
(There is no difference in size, signedness, and alignment, though.)
Other way to express "distinct types" is that you can create two overloaded functions for each type. For instance:
typedef int Int;
void f(int) { impl_1; }
void f(Int) { impl_2; }
If you try to compile a code snippet containing both functions, the compiler will complain about a ODR violation: you are trying to redefine the same function twice, since their arguments are the same. That's because typedefs doesn't create types, just aliases.
However, when types are truly distinct, both versions will be seen as two different overloads by the compiler.

C++ How can I assign a datatype to a binary sequence?

I have a binary sequence. This sequence represents an arbitrary precision integer but as far as the computer is concerned, it's just a binary sequence. I'm working in C++, with the multiprecision library. I only know how to assign values to the arbitrary precision datatype:
mp::cpp_int A = 51684861532215151;
How can I take a binary sequence and directly assign it to the datatype mp::cpp_int? I realize I can go through each bit and add 2^bit where ever I hit a 1, but I'm trying to avoid doing this.
REPLY:
Galik: My compiler (visual studio 2013) isn't liking that for some reason.
mp::cpp_int A = 0b0010011;
It keeps putting the red squigly after the first 0.
Also yup, boost multiprecision.
How to construct a particular type of big integer from a sequence of raw bits depends on that particular type, on the various constructors/methods that it offers for the purpose and/or what operator overloads are available.
The only generic mechanisms involve constructing a big integer with one word's worth of low-order bits (since such a constructor is almost universally available) and then using arithmetic to push the bits in, one bit at a time or one word's worth of bits at a time. This reduces the dependence on particulars of the given type to a minimum and it may work across a wide range of types completely unchanged, but it is rather cumbersome and not very efficient.
The particular type of big integer that is shown in your code splinter looks like boost::multiprecision::cpp_int, and Olaf Dietsche has already provided a link to its main documentation page. Conversion to and from raw binary formats for this type is documented on the page Importing and Exporting Data to and from cpp_int and cpp_bin_float, including code examples like initialising a cpp_int from a vector<byte>.

Difference between object and value representation by example

N3797::3.9/4 [basic.types] :
The object representation of an object of type T is the sequence of N
unsigned char objects taken up by the object of type T, where N equals
sizeof(T). The value representation of an object is the set of bits
that hold the value of type T. For trivially copyable types, the value
representation is a set of bits in the object representation that
determines a value, which is one discrete element of an
implementation-defined set of values
N3797::3.9.1 [basic.fundamental] says:
For narrow character types, all bits of the object representation
participate in the value representation.
Consider the following struct:
struct A
{
char a;
int b;
}
I think for A not all bits of the object representation participate in the value representation because of padding added by implementation. But what about others fundamentals type?
The Standard says:
N3797::3.9.1 [basic.fundamental]
For narrow character types, all bits of the object representation
participate in the value representation.
These requirements do not hold for other types.
I can't imagine why it doesn't hold for say int or long. What's the reason? Could you clarify?
An example might be the Unisys mainframes, where an int has 48
bits, but only 40 participate in the value representation (and INT_MAX is 2^39-1); the
others must be 0. I imagine that any machine with a tagged
architecture would have similar issues.
EDIT:
Just some further information: the Unisys mainframes are
probably the only remaining architectures which are really
exotic: the Unisys Libra (ex-Burroughs) have a 48 bit word, use signed
magnitude for integers, and have a tagged architecture, where
the data itself contains information concerning its type. The
Unisys Dorado are the ex-Univac: 36 bit one's complement (but no
reserved bits for tagging) and 9 bit char's.
From what I understand, however, Unisys is phasing them out (or
has phased them out in the last year) in favor of Intel based
systems. Once they disappear, pretty much all systems will be
2's complement, 32 or 64 bits, and all but the IBM mainframes
will use IEEE floating poing (and IBM is moving or has moved in
that direction as well). So there won't be any motivation for
the standard to continue with special wording to support them;
in the end, in a couple of years at lesat, C/C++ could probably
follow the Java path, and impose a representation on all of its
basic data types.
This is probably meant to give the compiler headroom for optimizations on some platforms.
Consider for example a 64 bit platform where handling non-64 bit values incurs a large penalty, then it would make sense to have e.g. short only use 16 bits (value repr), but still use 64 bit storage (obj repr).
Similar rationale applies to the Fastest minimum-width integer types mandated by <stdint>. Sometimes larger types are not slower, but faster to use.
As far as I understand at least one case for this is dealing with trap representations, usually on exotic architectures. This issue is covered in N2631: Resolving the difference between C and C++ with regards to object representation of integers. It is is very long but I will quote some sections(The author is James Kanze, so if we are lucky maybe he will drop by and comment further) which says (emphasis mine).
In recent discussions in comp.lang.c++, it became clear that C and C++ have different requirements concerning the object representation of integers, and that at least one real implementation of C does not meet the C++ requirements. The purpose of this paper is to suggest wording to align the C++ standard with C.
It should be noted that the issue only concerns some fairly “exotic” hardware. In this regard, it raises a somewhat larger issue
and:
If C compatibility is desired, it seems to me that the simplest and surest way of attaining this is by incorporating the exact words from the C standard, in place of the current wording. I thus propose that we adopt the wording from the C standard, as follows
and:
Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation.
and:
For signed integer types [...] Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for one's complement), is a trap representation or a normal value. In the case of sign and magnitude and one's complement, if this representation is a normal value it is called a negative zero.

How the bits are interpreted in memory

I'm studying the C++ programming language and i have a problem with my book (Programming priciples and practice using C++). What my book says is :
the meaning of the bit in memory is completely dependent on the type used to access it.Think of it this way : computer memory doesn't know about our types, it's just memory. The bits of memory get meaning only when we decide how that memory is to be interpreted.
Can you explain me what does it mean ? Please do it in a simple way because I'm only a beginner who is learning C++ by 3 weeks.
The computer's memory only stores bits and bytes - how those values are interpreted is up to the programmer (and his programming language).
Consider, e.g., the value 01000001. If you interpret it as a number, it's 65 (e.g., in the short datatype). If you interpret it as an ASCII character (e.g., in the char datatype), it's the character 'A'.
A simple example: take the byte 01000001. It contains (as all bytes) 8 bits. There are 2 bits set on (with the value of 1), the second and the last bits. The second has a corresponding decimal value of 64 in the byte, and the last has the value of 1. So they are interpreted as different powers of 2 by convention (in this case 2ˆ6 and 2ˆ0). This byte will have a decimal value of 64 + 1 = 65. For the byte 01000001 itself, there is also an interpretation convention. For instance, it can be the number 65 or the letter 'A' (according to the ASCII Table). Or, the byte can be part of a data that has a larger representation than just one byte.
As a few people have noted, bits are just a way of representing information. We need to have a way to interpret the bits to derive meaning out of them. It's kind of like a dictionary. There are many different "dictionaries" out there for many different types of data. ASCII, 2s complement integers, and so forth.
C++ variables must have a type, so each is assigned to a category, like int, double, float, char, string and so forth. The data type tells the compiler how much space to allocate in memory for your variable, how to assign it a value, and how to modify it.