C++: a scalable number class - bitset<1>* or unsigned char* - c++

I'm planning on creating a number class. The purpose is to hold any amount of numbers without worrying about getting too much (like with int, or long). But at the same time not USING too much. For example:
If I have data that only really needs 1-10, I don't need a int (4 bytes), a short(2 bytes) or even a char(1 byte). So why allocate so much?
If i want to hold data that requires an extremely large amount (only integers in this scenario) like past the billions, I cannot.
My goal is to create a number class that can handle this problem like strings do, sizing to fit the number. But before I begin, I was wondering..
bitset<1>, bitset is a template class that allows me to minipulate bits in C++, quite useful, but is it efficient?, bitset<1> would define 1 bit, but do I want to make an array of them? C++ can allocate a byte minimum, does bitset<1> allocate a byte and provide 1 bit OF that byte? if thats the case I'd rather create my number class with unsigned char*'s.
unsigned char, or BYTE holds 8 bytes, anything from 0 - 256 would only need one, more would require two, then 3, it would simply keep expanding when needed in byte intervals rather than bit intervals.
Which do you think is MORE efficient?, the bits would be if bitset actually allocated 1 bit, but I have a feeling that it isn't even possible. In fact, it may actually be more efficient to allocate in bytes until 4 bytes, (32 bits), on a 32 bit processor 32 bit allocation is most efficient thus I would use 4 bytes at a time from then on out.
Basically my question is, what are your thoughts? how should I go about this implementation, bitset<1>, or unsigned char (or BYTE)??

Optimizing for bits is silly unless you're target architecture is a DigiComp-1. Reading individual bits is always slower than reading ints - 4 bits isn't more efficient than 8.
Use unsigned char if you want to store it as a decimal number. This will be the most efficient.
Or, you could just use GMP.

The bitset template requires a compile-time const integer for its template argument. This could be a drawback when you have to determine the max bits size at run-time. Another thing is that most of the compilers / libraries use unsigned int or unsigned long long to store the bits for faster memory access. If your application would run in a environment with limited memory, you should create a new class like bitset or use a different library.

While it won't directly help you with arithmetic on giant numbers, if this kind of space-saving is your goal then you might find my Nstate library useful (boost license):
http://hostilefork.com/nstate/
For instance: if you have a value that can be between 0 and 2...then so long as you are going to be storing a bunch of these in an array you can exploit the "wasted" space for the unused 4th state (3) to pack more values. In that particular case, you can get 20 tristates in a 32-bit word instead of the 16 that you would get with 2-bits per tristate.

Related

Are there any performance differences between representing a number using a (4 byte) `int` and a 4 element unsigned char array?

Assuming an int in C++ is represented by 4 bytes, and an unsigned char is represented by 1 byte, you could represent an int with an array of unsigned char with 4 elements right?
My question is, are there any performance downsides to representing a number with an array of unsigned char? Like if you wanted to add two numbers together would it be just as fast to do int + int compared to adding each element in the array and dealing with carries manually?
This is just me trying to experiment and to practice working with bytes rather than some practical application.
There will be many performance downsides on any kind of manipulation using the 4-byte array. For example, take simple addition: almost any CPU these days will have a single instruction that adds two 32-bit integers, in one (maybe two) CPU cycle(s). To emulate that with your 4-byte array, you would need at least 4 separate CPU instructions.
Further, many CPUs actually work faster with 32- or 64-bit data than they do with 8-bit data - because their internal registers are optimized for 32- and 64-bit operands.
Let's scale your question up. Is there any performance difference between single addition of two 16 byte variables compared to four separate additions of 4 byte variables? And here comes the concept of vector registers and vector instructions (MMX, SSE, AVX). It's pretty much the same story, SIMD is always faster, because there is literally less instructions to execute and the whole operation is done by dedicated hardware. On top of that, in your question you also have to take into account that modern CPUs don't work with 1 byte variables, instead they still process 32 or 64 bits at once anyway. So effectively you will do 4 individual additions using 4 byte registers, only to use single lower byte each time and then manually handle carry bit. Yeah, that will be very slow.

how is word size in computer related to int or long

I have seen the link What does it mean by word size in computer? . It defines what word size is.
I am trying to represent very long string in bits where each character is represented by 4 bits and save it in long or integer array so that I can extract my string when required.
I can save the bits either in integer array or long array.
If I use long array (8 bytes) I will be able to save 8*4=32 bits in one long array.
But if I use int I will be able to save 4*4=16 bits only.
Now, if I am given my Word Size=32 then is it the case that I should use int only and not long.
To answer your direct question: There is no guaranteed relationship between the natural word-size of the processor and the C and C++ types int or long. Yes, quite often int will be the same as the size of a register in the processor, but most 64-bit processors do not follow this rule, as it makes data unnecessarily large. On the other hand, an 8-bit processor would have a register size of 8 bits, but int according to the C and C++ standards needs to be at least 16 bits in size, so the compiler would have to use more than one register to represent one integer [in some fashion].
In general, if you want to KNOW how many bits or bytes some type is, it's best to NOT rely on int, long, size_t or void *, since they are all likely to be different for different processor architectures or even different compilers on the same architecture. An int or long may be the same size or different sizes. Only rule that the standard says is that long is at least 32 bits.
So, to have control of the number of bits, use #include <cstdint> (or in C, stdint.h), and use the types for example uint16_t or uint32_t - then you KNOW that it will hold a given number of bits.
On a processor that has 36-bit "wordsize", the type uint32_t for example, will not exist, since there is no type that holds exactly 32-bits [most likely]. Alternatively, the compiler may add extra instructions to "behave as if it's a 32-bit type" (in other words, sign extending if necessary, and masking off the top bits as needed)

Store a digit larger than 9 in array

I am wondering if it is possible to store a digit larger than 9 in an array. For example is
Int myarray[0] = 1234567898
From what I know about arrays they can only be integers in which the highest number will be a 9 digit number.
Is it possible to change the data type or have a larger digit. If not what are other ways to do it
It is quite possible. C++11 adds support for long long which is typically 64bit (much larger) that can store up to 19 digits and many compilers supported it before this as an extension. You can store arbitrarily large numbers in an array provided the array is large enough and you have the requisite mathematics. However, it's not pleasant and an easier bet is to download a library to do it for you. These are known as bignum or arbitrary size integer libraries and they use arrays to store integers of any size your machine can handle.
It's still not possible to express a literal above long long though.
A 32bit int has a max value of 2147483647 - that is 10 digits. If you need values higher than that, you need to use a 64bit int (long long, __int64, int64_t, etc depending on your compiler).
I am afraid that your idea that ints can only represent a single decimal digit is incorrect. In addition your idea that arrays can only hold ints is incorrect. An int is usually the computer's most efficient integral type. This is on most machines 16, 32 or 64 bits of binary data, thus can go up to 2^16 (65k), 2^32 (4million), or 2^64(very big). It is most likely 32 bits on your machine, thus numbers up to just over 4million can be stored. If you need more than that types such as long and long long are available.
Finally, arrays can hold any type, you just declare them as:
type myarray[arraysize];

Is there a vector that can handle integers of nonstandard bit-lengths?

I'm looking for something resembling an STL vector but can handle integers that are, for example, 12, 16, 20, 24, 32, and 40 bits long. The 16-bit and 32-bit cases are handled wonderfully by vector<uint16_t> and vector<uint32_t>, but I haven't been able to find any way to handle the other ones. Note that the whole purpose of going this way is to save memory and bandwidth, so padding is not an option.
My data structure can infer the most significant bits of the integers (which are int64s), hence I only want to store the LSBs. The bits-per-integer and number of integers are known at creation time, but not compile time. Ideally bits per integer can be any value between 12 and 40, but tiers are ok for performance reasons or to work with a structure where bits-per-integer needs to be set at compile time.
vector<bool> and dynamic_bitset can create bitfields, but they're limited to 1-bit integers. Anyone know of something else out there?
There is none in the STL.
I understand you performance concern, and notably the memory issue. However unaligned reads (requiring bit-shifting) can be slower than regular ones (on bytes boundaries), so I would suggest keeping to whole bytes. This means:
12 bits, 16 bits: uint16_t
20 bits, 24 bits: either 3 uint8_t or 1 uint32_t (tradeoff speed/memory)
32 bits: uint32_t
40 bits: either 5 uint8_t, 3 uint16_t or 1 uint64_t
Assuming you choose the easier solution (ie, always bumping to the next available integer), you could use something like:
boost::variant<
std::vector<uint16_t>,
std::vector<uint32_t>,
std::vector<uint64_t>
>
This let you choose the exact vector to be used at runtime, based on the number of bytes you need to store.
If you want to spare as much as possible, then you simply need to add a std::vector<uint8_t> to the list. But I would refrain from doing so and perhaps only "pack" the 40 bits case (in uint16_t) as the others never waste much memory anyway... and I would probably profile to check both alternatives.
Note: you might also wish to tune the vector memory usage by checking that the capacity does not exceed the size too much.

Usage of 'short' in C++

Why is it that for any numeric input we prefer an int rather than short, even if the input is of very few integers.
The size of short is 2 bytes on my x86 and 4 bytes for int, shouldn't it be better and faster to allocate than an int?
Or I am wrong in saying that short is not used?
CPUs are usually fastest when dealing with their "native" integer size. So even though a short may be smaller than an int, the int is probably closer to the native size of a register in your CPU, and therefore is likely to be the most efficient of the two.
In a typical 32-bit CPU architecture, to load a 32-bit value requires one bus cycle to load all the bits. Loading a 16-bit value requires one bus cycle to load the bits, plus throwing half of them away (this operation may still happen within one bus cycle).
A 16-bit short makes sense if you're keeping so many in memory (in a large array, for example) that the 50% reduction in size adds up to an appreciable reduction in memory overhead. They are not faster than 32-bit integers on modern processors, as Greg correctly pointed out.
In embedded systems, the short and unsigned short data types are used for accessing items that require less bits than the native integer.
For example, if my USB controller has 16 bit registers, and my processor has a native 32 bit integer, I would use an unsigned short to access the registers (provided that the unsigned short data type is 16-bits).
Most of the advice from experienced users (see news:comp.lang.c++.moderated) is to use the native integer size unless a smaller data type must be used. The problem with using short to save memory is that the values may exceed the limits of short. Also, this may be a performance hit on some 32-bit processors, as they have to fetch 32 bits near the 16-bit variable and eliminate the unwanted 16 bits.
My advice is to work on the quality of your programs first, and only worry about optimization if it is warranted and you have extra time in your schedule.
Using type short does not guarantee that the actual values will be smaller than those of type int. It allows for them to be smaller, and ensures that they are no bigger. Note too that short must be larger than or equal in size to type char.
The original question above contains actual sizes for the processor in question, but when porting code to a new environment, one can only rely on weak relative assumptions without verifying the implementation-defined sizes.
The C header <stdint.h> -- or, from C++, <cstdint> -- defines types of specified size, such as uint8_t for an unsigned integral type exactly eight bits wide. Use these types when attempting to conform to an externally-specified format such as a network protocol or binary file format.
The short type is very useful if you have a big array full of them and int is just way too big.
Given that the array is big enough, the memory saving will be important (instead of just using an array of ints).
Unicode arrays are also encoded in shorts (although other encode schemes exist).
On embedded devices, space still matters and short might be very beneficial.
Last but not least, some transmission protocols insists in using shorts, so you still need them there.
Maybe we should consider it in different situations. For example, x86 or x64 should consider more suitable type, not just choose int. In some cases, int have faster speed than short. The first floor have answered this question