System where 1 byte != 8 bit? [duplicate]

System where 1 byte != 8 bit? [duplicate] - c++

This question already has answers here:
What platforms have something other than 8-bit char?
(14 answers)
Closed 8 years ago.
All the time I read sentences like
don't rely on 1 byte being 8 bit in size
use CHAR_BIT instead of 8 as a constant to convert between bits and bytes
et cetera. What real life systems are there today, where this holds true?
(I'm not sure if there are differences between C and C++ regarding this, or if it's actually language agnostic. Please retag if neccessary.)

On older machines, codes smaller than 8 bits were fairly common, but most of those have been dead and gone for years now.
C and C++ have mandated a minimum of 8 bits for char, at least as far back as the C89 standard. [Edit: For example, C90, §5.2.4.2.1 requires CHAR_BIT >= 8 and UCHAR_MAX >= 255. C89 uses a different section number (I believe that would be §2.2.4.2.1) but identical content]. They treat "char" and "byte" as essentially synonymous [Edit: for example, CHAR_BIT is described as: "number of bits for the smallest object that is not a bitfield (byte)".]
There are, however, current machines (mostly DSPs) where the smallest type is larger than 8 bits -- a minimum of 12, 14, or even 16 bits is fairly common. Windows CE does roughly the same: its smallest type (at least with Microsoft's compiler) is 16 bits. They do not, however, treat a char as 16 bits -- instead they take the (non-conforming) approach of simply not supporting a type named char at all.

TODAY, in the world of C++ on x86 processors, it is pretty safe to rely on one byte being 8 bits. Processors where the word size is not a power of 2 (8, 16, 32, 64) are very uncommon.
IT WAS NOT ALWAYS SO.
The Control Data 6600 (and its brothers) Central Processor used a 60-bit word, and could only address a word at a time. In one sense, a "byte" on a CDC 6600 was 60 bits.
The DEC-10 byte pointer hardware worked with arbitrary-size bytes. The byte pointer included the byte size in bits. I don't remember whether bytes could span word boundaries; I think they couldn't, which meant that you'd have a few waste bits per word if the byte size was not 3, 4, 9, or 18 bits. (The DEC-10 used a 36-bit word.)

Unless you're writing code that could be useful on a DSP, you're completely entitled to assume bytes are 8 bits. All the world may not be a VAX (or an Intel), but all the world has to communicate, share data, establish common protocols, and so on. We live in the internet age built on protocols built on octets, and any C implementation where bytes are not octets is going to have a really hard time using those protocols.
It's also worth noting that both POSIX and Windows have (and mandate) 8-bit bytes. That covers 100% of interesting non-embedded machines, and these days a large portion of non-DSP embedded systems as well.

From Wikipedia:
The size of a byte was at first
selected to be a multiple of existing
teletypewriter codes, particularly the
6-bit codes used by the U.S. Army
(Fieldata) and Navy. In 1963, to end
the use of incompatible teleprinter
codes by different branches of the
U.S. government, ASCII, a 7-bit code,
was adopted as a Federal Information
Processing Standard, making 6-bit
bytes commercially obsolete. In the
early 1960s, AT&T introduced digital
telephony first on long-distance trunk
lines. These used the 8-bit µ-law
encoding. This large investment
promised to reduce transmission costs
for 8-bit data. The use of 8-bit codes
for digital telephony also caused
8-bit data "octets" to be adopted as
the basic data unit of the early
Internet.

As an average programmer on mainstream platforms, you do not need to worry too much about one byte not being 8 bit. However, I'd still use the CHAR_BIT constant in my code and assert (or better static_assert) any locations where you rely on 8 bit bytes. That should put you on the safe side.
(I am not aware of any relevant platform where it doesn't hold true).

Firstly, the number of bits in char does not formally depend on the "system" or on "machine", even though this dependency is usually implied by common sense. The number of bits in char depends only on the implementation (i.e. on the compiler). There's no problem implementing a compiler that will have more than 8 bits in char for any "ordinary" system or machine.
Secondly, there are several embedded platforms where sizeof(char) == sizeof(short) == sizeof(int) , each having 16 bits (I don't remember the exact names of these platforms). Also, the well-known Cray machines had similar properties with all these types having 32 bits in them.

I do a lot of embedded and currently working on DSP code with CHAR_BIT of 16

In history, there's existed a bunch of odd architectures that where not using native word sizes that where multiples of 8. If you ever come across any of these today, let me know.
The first commerical CPU by Intel was the Intel 4004 (4-bit)
PDP-8 (12-bit)
The size of the byte has historically
been hardware dependent and no
definitive standards exist that
mandate the size.
It might just be a good thing to keep in mind if your doing lots of embedded stuff.

Adding one more as a reference, from Wikipedia entry on HP Saturn:
The Saturn architecture is nibble-based; that is, the core unit of data is 4 bits, which can hold one binary-coded decimal (BCD) digit.

Related

Why and how are C++ bitfields non-portable?

I've come across many comments on various questions regarding bitfields asserting that bitfields are non-portable, but I've never been able to find a source explaining precisely why.
At face value, I would have presumed all bitfields merely compile to variations of the same bitshifting code, but evidently there must be more too it than that or there would not be such vehement dislike for them.
So my question is what is it that makes bitfields non-portable?

Bit fields are non-portable in the same sense as integers are non-portable. You can use integers to write a portable program, but you cannot expect to send a binary representation of int as is to a remote machine and expect it to interpret the data correctly.
This is because 1. word lengths of processors differ, and because of that, the sizes of integer types differ (1.1 byte length can differ too, but that is these days rare outside embedded systems). And because 2. the byte endianness differs across processors.
These problems are easy to overcome. Native endianness can be easily converted to agreed upon endianness (big endian is de facto standard for network communication), and the size can be inspected at compile time and fixed length integer types are available these days. Therefore integers can be used to communicate across network, as long as these details are taken care of.
Bit fields build upon regular integer types, so they have the same problems with endianness and integer sizes. But they have even more implementation specified behaviour.
Everything about the actual allocation details of bit fields within the class object
For example, on some platforms, bit fields don't straddle bytes, on others they do
Also, on some platforms, bit fields are packed left-to-right, on others right-to-left
Whether char, short, int, long, and long long bit fields are signed or unsigned (when not declared so explicitly).
Unlike endianness, it is not trivial to convert "everything about the actual allocation details" to a canonical form.
Also, while endianness is cpu architecture specific, the bit field details are specific to the compiler implementer. So, bit fields are not portable for communication even between separate processes within the same computer, unless we can guarantee that they were compiled using the same (or binary compatible) compiler.
TL;DR bit fields are not a portable way to communicate between computers. Integers aren't either, but their non-portability is easy to work around.

Bit fields are non-portable in the sense that the ordering of the bit is unspecified. So the bit at index 0 with one compiler could very well be the last bit with another compiler.
This prevents the use of bit fields in applications like toggling bits in memory-mapped hardware registers.
But, you will see hardware vendor use bitfields in the code they release (like microchip for instance). Usually, it's because they also release the compiler with it or target a single compiler. In microchip case for instance, the licence for their source code mandates you to use their own compiler (for 8 bits low-end devices)
The link pointed to by #Pharap contains an extract of the (c++14) norm related to this un-specified ordering: is-there-a-portable-alternative-to-c-bitfields

What does performing a byteswap mean? [duplicate]

This question already has answers here:
Difference between Big Endian and little Endian Byte order
(5 answers)
Closed 8 years ago.
There is dozens of places discussing how to do different kinds of byteswapping, but I couldn't easily find a place explaining the concept and some typical examples of how the need to byteswap occurs.
So here is the question: what is byteswapping and why/when do I need to do it?
If examples are a good way to explain, I would be happy if they where in standard C++. Book references are appreciated, preferentially to Lippman's or Pratas C++ primers since those are the one I have available.

If I understand your question correctly, you're talking about big endian to little endian conversion and back.
That occurs because some microprocessors use little endian format to refer to memory, while others use big endian format.
The bytestreams on the internet are for example, big endian while your intel CPU works with little endian format.
Hence to translate from network to CPU or CPU to network, we need a conversion mechanism called byteswapping.
OSes provide ntohl() and htonl() functions for doing this.

As mentioned in the comments, byteswapping is the process of changing a values endianess from one to another. Lets say you have a value in your memory (left address is lowest):
DE AD BE EF <- big endian
This valu econsists of 4 bytes - in hexadecimal representation two digits are one byte.
If we now assume that the above value is encoded in big endian, then this would mean that the lowest order byte if the very first byte in memory - here the DE. The Intel x86 processor architecture works with little endian, that means the same value as above would look like this in memory:
FE BE AD DE <- little endian
These two values represent the same value, but have a different endianess.

c++ binary data layout guaranteed by the standard

This is purely a theoretical problem, nothing I have really found myself in, but it has piqued my curiosity and wanted to see if anyone has a better solution for it:
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
Say we have a file format that uses a 64 bit header struct immediately followed by a variable length array of 32 bit structures:
Header: magic : 32 bit
count : 32 bit
Field : id : 16 bit
data : 16 bit
My first instinct would be to write something like:
struct Field
{
uint16_t id ;
uint16_t data ;
};
Except that our compiler may decide that padding is advisable and we end up with a 64 bit structure. So our next bet is:
using Field = uint16_t[2];
and work on that.
That is, unless someone has carefully read the standard and noticed that uint16_t is optional. At this point our next best friend is uint_least16_t, which is guaranteed to be at least 16 bits long, but for all we know could be 20 bits long in a 10 bit / char processor.
At this point, the only real solution I can come up with is some sort of bit stream, capable of reading and writing specific amounts of bits, and adaptable by std::numeric_limits.
So, is there someone out there who has very carefully read the standard and found the point I'm missing? Or it is this the only real way of having a portable guarantee.
Notes:
- I've just realized that endianness would probably add another layer of complexity.
- I'm using the current working draft of the ISO standard (N3797).

How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
You can't. Not in C++, which was standardized against an abstract platform where little more than the existence of a "byte" that is made up of bits can be assumed. We can't even say for certain, in looking only at the Standard, how many bits are in a char. You can use bitfields for everything, as bits are indivsible, but then you'll have padding to contend with at the least.
Sometimes it is best to give up on the idea of absolute Standards conformance for the sake of conformance, and look to other means to get the job done efficiently and effectively. In this case, platform specifics in combination with almost absolute Standards conformance (aka, good programming practices) will set you free.
Every platform I work on regularly (linux & windows) provides a means to regulate the padding the compiler will actually apply. For network communications, under Linux & Windows I use:
#pragma pack (push, 1)
as a preface to all the data structures I'm going to send over the wire. Endianness is indeed another challenge, but one more or less easily dealt with using other resources provided by every platform: ntohl and the like.
Standards conformance is a laudable goal, and indeed in a code review I would reject most code that is non-conformant. The lack of conformance is really just a moniker for the rejection however; not the reason itself. The actual reason for the rejection is in large part difficulty in maintaining and porting non-conformant code when moving to another platform, or indeed even just upgrading the compiler on the same platform. Non-conformant code might compile and even appear to work, but it will very often fail in subtle and miserable ways when you least expect it, even after thorough testing.
The moral of the story is:
You should always write Standards-conformant code, except when you
shouldn't.
This really is just a re-imagining of Einstein's articulation of Occam's Razor:
Make everything as simple as possible, but no simpler.

If you want to ensure portability to everything standard-conforming, including platforms for which CHAR_BITS isn't 8, well, you've got your work cut out for you.
If you are comfortable limiting yourself to 98% of the computers you'll ever program, I recommend writing explicit serialization for anything that has to adhere to a particular wire-format. That includes breaking integers into bytes, etc.
Write appropriate abstractions around things and the code won't be too bad. Don't put shifts and masks everywhere. Encapsulate it.

I would use network types and network byte orders. See this link.http://www.beej.us/guide/bgnet/output/html/multipage/htonsman.html. The example uses uint16_t. You can write the values a field at a time to prevent padding.
Or if you want to read and write the entire structure at one see this link C++ struct alignment question

Make the structure easy for the program to use.
Provide input methods that extract data from the input and write to the data members. This removes the issue of padding, alignment boundaries and endianness. Similarly with output.
For example, if your input data is 16-bits wide, but your platform is 32-bits wide, declare the structure using 32-bit fields. Copy the 16 bits from the input into the 32-bit fields.
Most programs read into a structure fewer times than they access the data members. Your program is not reading the input 100% of the time.

What is faster? two ints or __int64?

What is faster: (Performance)
__int64 x,y;
x=y;
or
int x,y,a,b;
x=a;
y=b;
?
Or they are equal?

__int64 is a non-standard compiler extension, so whilst it may or may not be faster, you don't want to use it if you want cross platform code. Instead, you should consider using #include <cstdint> and using uint64_t etc. These derive from the C99 standard which provides stdint.h and inttypes.h for fixed width integer arithmetic.
In terms of performance, it depends on the system you are on - x86_64 for example should not see any performance difference in adding 32- and 64- bit integers, since the add instruction can handle 32 or 64 bit registers.
However, if you're running code on a 32-bit platform or compiling for a 32-bit architecture, the addition of a 64-bit integers will actually require adding two 32-bit registers, as opposed to one. So if you don't need the extra space, it would be wasteful to allocate it.
I have no idea if compilers can or do optimise the types down to a smaller size if necessary. I expect not, but I'm no compiler engineer.

I hate these sort of questions.
1) If you don't know how to measure for yourself then you almost certainly don't need to know the answer.
2) On modern processors it's very hard to predict how fast something will be based on single instructions, it's far more important to understand your program's cache usage and overall performance than to worry about optimising silly little snippets of code that use an assignment. Let the compiler worry about that and spend your time improving the algorithm used or other things that will have a much bigger impact.
So in short, you probably can't tell and it probably doesn't matter. It's a silly question.

The compiler's optimizer will remove all the code in your example so in that way there is no difference. I think you want to know whether it is faster to move data 32 bits at a time or 64 bits at a time. If your data is aligned to 8 bytes and you are on a 64 bit machine then it should be faster to move data at 8 bytes at a time. There are several caveat to that, however. You may find that your compiler is already doing this optimization for you (you would have to look at the emitted assembly code to be sure), in which case you would see no difference. Also, consider using memcpy instead of rolling your own if you are moving a lot of data. If you are considering casting an array of 32 bit ints to 64 bit in order to copy faster or do some other operation faster (i.e. in half the number of instructions), be sure to Google for the strict aliasing rule.

__int64 should be faster on most platforms, but be careful - some of the architectures require alignment to 8 for this to take effect, and some would even crash your app if it's not aligned.

Exotic architectures the standards committees care about

I know that the C and C++ standards leave many aspects of the language implementation-defined just because if there was an architecture with other characteristics, a standard confirming compiler for that architecture would need to emulate those parts of the language, resulting in inefficient machine code.
Surely, 40 years ago every computer had its own unique specification. However, I don't know of any architectures used today where:
CHAR_BIT != 8
signed is not two's complement (I heard Java had problems with this one).
Floating point is not IEEE 754 compliant (Edit: I meant "not in IEEE 754 binary encoding").
The reason I'm asking is that I often explain to people that it's good that C++ doesn't mandate any other low-level aspects like fixed sized types†. It's good because unlike 'other languages' it makes your code portable when used correctly (Edit: because it can be ported to more architectures without requiring emulation of low-level aspects of the machine, like e.g. two's complement arithmetic on sign+magnitude architecture). But I feel bad that I cannot point to any specific architecture myself.
So the question is: what architectures exhibit the above properties?
† uint*_ts are optional.

Take a look at this one
Unisys ClearPath Dorado Servers
offering backward compatibility for people who have not yet migrated all their Univac software.
Key points:
36-bit words
CHAR_BIT == 9
one's complement
72-bit non-IEEE floating point
separate address space for code and data
word-addressed
no dedicated stack pointer
Don't know if they offer a C++ compiler though, but they could.
And now a link to a recent edition of their C manual has surfaced:
Unisys C Compiler Programming Reference Manual
Section 4.5 has a table of data types with 9, 18, 36, and 72 bits.

None of your assumptions hold for mainframes. For starters, I don't know
of a mainframe which uses IEEE 754: IBM uses base 16 floating point, and
both of the Unisys mainframes use base 8. The Unisys machines are a bit
special in many other respects: Bo has mentioned the 2200 architecture,
but the MPS architecture is even stranger: 48 bit tagged words.
(Whether the word is a pointer or not depends on a bit in the word.)
And the numeric representations are designed so that there is no real
distinction between floating point and integral arithmetic: the floating
point is base 8; it doesn't require normalization, and unlike every
other floating point I've seen, it puts the decimal to the right of the
mantissa, rather than the left, and uses signed magnitude for the
exponent (in addition to the mantissa). With the results that an
integral floating point value has (or can have) exactly the same bit
representation as a signed magnitude integer. And there are no floating
point arithmetic instructions: if the exponents of the two values are
both 0, the instruction does integral arithmetic, otherwise, it does
floating point arithmetic. (A continuation of the tagging philosophy in
the architecture.) Which means that while int may occupy 48 bits, 8
of them must be 0, or the value won't be treated as an integer.

Full IEEE 754 compliance is rare in floating-point implementations. And weakening the specification in that regard allows lots of optimizations.
For example the subnorm support differers between x87 and SSE.
Optimizations like fusing a multiplication and addition which were separate in the source code slightly change the results too, but is nice optimization on some architectures.
Or on x86 strict IEEE compliance might require certain flags being set or additional transfers between floating point registers and normal memory to force it to use the specified floating point type instead of its internal 80bit floats.
And some platforms have no hardware floats at all and thus need to emulate them in software. And some of the requirements of IEEE 754 might be expensive to implement in software. In particular the rounding rules might be a problem.
My conclusion is that you don't need exotic architectures in order to get into situations were you don't always want to guarantee strict IEEE compliance. For this reason were few programming languages guarantee strict IEEE compliance.

I found this link listing some systems where CHAR_BIT != 8. They include
some TI DSPs have CHAR_BIT == 16
BlueCore-5 chip (a Bluetooth
chip from Cambridge Silicon Radio) which has CHAR_BIT ==
16.
And of course there is a question on Stack Overflow: What platforms have something other than 8-bit char
As for non two's-complement systems there is an interesting read on
comp.lang.c++.moderated. Summarized: there are platforms having ones' complement or sign and magnitude representation.

I'm fairly sure that VAX systems are still in use. They don't support IEEE floating-point; they use their own formats. Alpha supports both VAX and IEEE floating-point formats.
Cray vector machines, like the T90, also have their own floating-point format, though newer Cray systems use IEEE. (The T90 I used was decommissioned some years ago; I don't know whether any are still in active use.)
The T90 also had/has some interesting representations for pointers and integers. A native address can only point to a 64-bit word. The C and C++ compilers had CHAR_BIT==8 (necessary because it ran Unicos, a flavor of Unix, and had to interoperate with other systems), but a native address could only point to a 64-bit word. All byte-level operations were synthesized by the compiler, and a void* or char* stored a byte offset in the high-order 3 bits of the word. And I think some integer types had padding bits.
IBM mainframes are another example.
On the other hand, these particular systems needn't necessarily preclude changes to the language standard. Cray didn't show any particular interest in upgrading its C compiler to C99; presumably the same thing applied to the C++ compiler. It might be reasonable to tighten the requirements for hosted implementations, such as requiring CHAR_BIT==8, IEEE format floating-point if not the full semantics, and 2's-complement without padding bits for signed integers. Old systems could continue to support earlier language standards (C90 didn't die when C99 came out), and the requirements could be looser for freestanding implementations (embedded systems) such as DSPs.
On the other other hand, there might be good reasons for future systems to do things that would be considered exotic today.

CHAR_BITS
According to gcc source code:
CHAR_BIT is 16 bits for 1750a, dsp16xx architectures.
CHAR_BIT is 24 bits for dsp56k architecture.
CHAR_BIT is 32 bits for c4x architecture.
You can easily find more by doing:
find $GCC_SOURCE_TREE -type f | xargs grep "#define CHAR_TYPE_SIZE"
or
find $GCC_SOURCE_TREE -type f | xargs grep "#define BITS_PER_UNIT"
if CHAR_TYPE_SIZE is appropriately defined.
IEEE 754 compliance
If target architecture doesn't support floating point instructions, gcc may generate software fallback witch is not the standard compliant by default. More than, special options (like -funsafe-math-optimizations witch also disables sign preserving for zeros) can be used.

IEEE 754 binary representation was uncommon on GPUs until recently, see GPU Floating-Point Paranoia.
EDIT: a question has been raised in the comments whether GPU floating point is relevant to the usual computer programming, unrelated to graphics. Hell, yes! Most high performance thing industrially computed today is done on GPUs; the list includes AI, data mining, neural networks, physical simulations, weather forecast, and much much more. One of the links in the comments shows why: an order of magnitude floating point advantage of GPUs.
Another thing I'd like to add, which is more relevant to the OP question: what did people do 10-15 years ago when GPU floating point was not IEEE and when there was no API such as today's OpenCL or CUDA to program GPUs? Believe it or not, early GPU computing pioneers managed to program GPUs without an API to do that! I met one of them in my company. Here's what he did: he encoded the data he needed to compute as an image with pixels representing the values he was working on, then used OpenGL to perform the operations he needed (such as "gaussian blur" to represent a convolution with a normal distribution, etc), and decoded the resulting image back into an array of results. And this still was faster than using CPU!
Things like that is what prompted NVidia to finally make their internal data binary compatible with IEEE and to introduce an API oriented on computation rather than image manipulation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js