base64 to reduce digits required to encode a decimal number

base64 to reduce digits required to encode a decimal number - c++

I have to manage an ID of some objects.
I need these ID be unique.
I have the constraint that these ID can't be too long in term of digits required
Is base64 is a nice way to reduce the number of digits required to encoding an ID ?
EDIT:
langage : c++
data type : integer , then convert in a std::string

Each character in Base64 can represent 6 bits, so divide your ID length by 6 to see how many characters it will be. Binary data is 8 bits per byte so it will always be shorter, but the bytes won't all be readable.
Base64 will make the ID readable, but it still won't be good if the ID needs to be hand entered, like a key. For that you'll want to restrict the character set further.

Base64 is a nice way to transport binary data over ASCII. It doesn't usually decrease the size of anything. In my experience it increases it by 66% 33% (thanks for the correction).

If you care just about the length of the output string and not the actual byte size. Then by converting from decimal numeric system (base 10) to any numerical system with base higher then 10 the output string will be shorter
see example here
http://www.translatorscafe.com/cafe/units-converter/numbers/calculator/octal-to-decimal/
for example in their case
decimal 9999999999 <- 10 chars long
in base 32 numerical system
will be 4LDQPDR <- 7 chars long
with up to 95 printable ascii charecters you could use your own
base 95 numerical system and get even shorter string
used this approach in one of my projects and it was enough to squeeze "long" numerical ids in short string fields

Related

Long Numeric to Character in SAS

When I try to convert a 10 digit number which is stored as 8. into a character using put(,8.) it gives me the character value as something like 1.2345e10. I want the character string to be as 1234567890. Can someone please help ?

8. is the width (number of characters) you'd like to use. So of course it is 1.2345e9; that's what it can fit in 8 characters.
x_char = put(x,10.);
That asks for it to be put in 10 characters. Keep extending it if you want it more than 10. Just be aware you may need to use the optional alignment option (put(x,10. -l)) if you aren't happy with the default alignment of right aligned for shorter-than-maximum-length values.
Do note that when you originally describe the number as 8., I suspect you're actually referring to length; length is not the display size but the storage size (bytes in memory set aside). For character variables in a SBCS system they're identical, but for numeric variables length and format are entirely unrelated.

Unless very sure of your inputs, I find it best to use best.:
data have;
x=1234567890;output;
x=12345.67890;output;
x=.1234567890;output;
x=12345678901234;output;
run;
data want;
set have;
length ten best $32;
ten=put(x,10.);
best=put(x,best32.);
run;
Note that using 10. here would wipe out any decimals, as well as larger numbers:

SAS stores numbers as IEEE 64bit floating point numbers. So when you see that the LENGTH of the variable is 8 that just means you are storing the full 8 bytes of the floating point representation. If the length is less than 8 then you will lose some of the ability to store more digits of precision, but when it pulls the data from the dataset to be used by a data step or proc it will always convert it back to the full 64 bit IEEE format.
The LENGTH that you use to store the data has little to do with how many characters you should use to display the value for humans to read. How to display the value is controlled by the FORMAT that you use. What format to use will depend on the type of values your variable will contain.
So if you know your values are always integers and the maximum value is 9,999,999,999 then you can use the format 10. (also known as F10.).
charvar= put(numvar,F10.);

Store SHA-1 in database in less space than the 40 hex digits

I am using a hash algorithm to create a primary key for a database table. I use the SHA-1 algorithm which is more than fine for my purposes. The database even ships an implementation for SHA-1. The function computing the hash is returning a hex value as 40 characters. Therefore I am storing the hex characters in a char(40) column.
The table will have lots of rows, >= 200 Mio. rows which is why I am looking for less data intensive ways of storing the hash. 40 characters times ~200 Mio. rows will require some GB of storage... Since hex is base16 I thought I could try to store it in base 256 in hope to reduce the amount of characters needed to around 20 characters. Do you have tips or papers on implementations of compression with base 256?

Store it as a blob: storing 8 bits of data per character instead of 4 is a 2x compression (you need some way to convert it though),
Cut off some characters: you have 160 bits, but 128 bits is enough for unique keys even if the universe ends, and for most purposes 80 bits would even be enough (you don't need cryptographic protection). If you have an anti-collision algorithm, use 36 or 40 bits is enough.

A SHA-1 value is 20 bytes. All the bits in these 20 bytes are significant, there's no way to compress them. By storing the bytes in their hexadecimal notation, you're wasting half the space — it takes exactly two hexadecimal digits to store a byte. So you can't compress the underlying value, but you can use a better encoding than hexadecimal.
Storing as a blob is the right answer. That's base 256. You're storing each byte as that byte with no encoding that would create some overhead. Wasted space: 0.
If for some reason you can't do that and you need to use a printable string, then you can do better than hexadecimal by using a more compact encoding. With hexadecimal, the storage requirement is twice the minimum (assuming that each character is stored as one byte). You can use Base64 to bring the storage requirements to 4 characters per 3 bytes, i.e. you would need 28 characters to store the value. In fact, given that you know that the length is 20 bytes and not 21, the base64 encoding will always end with a =, so you only need to store 27 characters and restore the trailing = before decoding.
You could improve the encoding further by using more characters. Base64 uses 64 code points out of the available 256 byte values. ASCII (the de facto portable) has 95 printable characters (including space), but there's no common “base95” encoding, you'd have to roll your own. Base85 is an intermediate choice, it does get some use in practice, and lets you store the 20-byte value in 25 printable ASCII characters.

Assigning a distinct number to a string

Lets say that I have a VIN like this: SB164ABN10E082986.
Now, I want to assign an integer to each possible VIN (without WMI, which is the first three digits -> 64ABN10E082986) in a manner that I retrieve the VIN from this integer afterwards.
What would be the best way of doing so? It can be used to the advantage of such algorithm that the first 10 digits can be composed from those values:
1234567890 ABCDEFGH JKLMN P RSTUVWXYZ
and the last 4 can be composed of all one-digit numbers (0-9).
Background: I want to be able to save memory. So, in a sense I'm searching for a special way of compression. I calculated that an 8 Byte integer would suffice under those conditions. I am only missing the way of doing "the mapping".
This is how it should work:
VIN -> ALGORITHM -> INDEX
INDEX -> ALGORITHM REVERSED -> VIN

Each character becomes a digit in a variable-base integer. Then convert those digits to the integer.
Those that can be digits or one of 23 letters is base 33. Those that can only be digits are base 10. The total number of possible combinations is 3310 times 104. The logarithm base two of that is 63.73, so it will just fit in a 64-bit integer.
You start with zero. Add the first digit. Multiply by the base of the next digit (33 or 10). Add that digit. Continue until all digits processed. You have the integer. Each digit is 0..32 or 0..9. Take care to properly convert the discontiguous letters to the contiguous numbers 0..32.
Your string 64ABN10E082986 is then encoded as the integer 2836568518287652986. (I gave the digits the values 0..9, and the letters 10..32.)
You can reverse the process by taking the integer and both dividing it by the last base and taking the modulo the last base. The result of the modulo is the last digit. Continue with the quotient from dividing for the next digit.
By the way, in the US anyway the last five characters of the VIN must be numeric digits. I don't know why you are only considering four.

Assign a 6 bit number to each valid character/digit and encode all ten in less than 64 bits. This means it would fit in an 8 bytes ie uint64_t in C/C++ and would be easy to store in a database etc.
Count valid bytes
echo -n "1234567890ABCDEFGHJKLMNPRSTUVWXYZ"| wc -c
33
Minimum number of bits to allow 33 is 6. 10 * 6 = 60
If the idea is to make it as small as possible where the length may vary based on VIN then that would be a different answer and looking at the actual wikipedia page for VIN there are likely quite a few ways to do that.

SAS why invalid data length

data temp;
length a 1 b 3 x;
infile '';
input a b x;
run;
The answer said "The data set TEMP is not created because variable A has an invalid length".
Why it's invalid in this small program?

It's invalid because SAS doesn't let you create numeric variables with a length of less then 3 or greater then 8.

Length for numeric variables is not related to the display width (that is controlled solely by format); it is the storage used to hold the variable. In character variables it can be used in that manner, because characters take up 1 byte each, so $7 length is equivalent to $7. format directly. If you want to limit how a number is represented on the screen, use the format statement to control that (format a 1.;). If you want to tell SAS how many characters to input into a number, use informat (informat a 1.;).
However, for numeric variables, there is not the same relationship. Most numbers are 8 bytes, which stores the binary representation of the number as a double precision floating point number. So, a number with format 1. still typically takes up those 8 bytes, just as a number with format 16.3.
Now, you could limit the length somewhat if you wanted to, subject to some considerations. If you limit the length of a numeric variable, you risk losing some precision. In a 1. format number, the odds are that's not a concern; you can store up to 8192 (as an integer) precisely in a three byte numeric (3 digits of precision), so one digit is safe.
In general, unless dealing with very large amounts of data where storage is very costly, it is safer not to manipulate the length of numbers, as you may encounter problems with calculation accuracy (for example, division will very likely cause problems). The limit is not the integer size, but the precision; so for example, while 8192 is the maximum integer storable in a 3 byte number, 8191.5 is not storable in 3 bytes. In fact, 9/8 is, but 11/8 is not storable precisely - 8.192 is the maximum with 3 digits after the decimal, so 8.125 is storable but 8.375 is not.
You can read this article for more details on SAS numeric precision in Windows.
Numeric length can be 3 to 8. SAS uses nearly all of the first two bytes to store the sign and the exponent (the first bit is the sign and the next 11 bits are the exponent), so a 2 byte numeric would only have 5 bits of precision. While some languages have a type this small, SAS chooses not to.

What encryption scheme meets requirement of decimal plaintext & ciphertext and preserves length?

I need an encryption scheme where the plaintext and ciphertext are composed entirely of decimal digits.
In addition, the plaintext and ciphertext must be the same length.
Also the underlying encryption algorithm should be an industry-standard.
I don't mind if its symmetric (e.g AES) or asymmetric (e.g RSA) - but it must be a recognized algorithm for which I can get a FIPS-140 approved library. (Otherwise it won't get past the security review stage).
Using AES OFB is fine for preserving the length of hex-based input (i.e. where each byte has 256 possible values: 0x00 --> 0xFF). However, this will not work for my means as plaintext and ciphertext must be entirely decimal.
NB: "Entirely decimal" may be interpreted two ways - both of which are acceptable for my requirements:
Input & output bytes are characters '0' --> '9' (i.e. byte values: 0x30 -> 0x39)
Input & output bytes have the 100 (decimal) values: 0x00 --> 0x99 (i.e. BCD)
Some more info:
The max plaintext & ciphertext length is likely to be 10 decimal digits.
(I.e. 10 bytes if using '0'-->'9' or 5 bytes if using BCD)
Consider following sample to see why AES fails:
Input string is 8 digit number.
Max 8-digit number is: 99999999
In hex this is: 0x5f5e0ff
This could be treated as 4 bytes: <0x05><0xf5><0xe0><0xff>
If I use AES OFB, I will get 4 byte output.
Highest possible 4-byte ciphertext output is <0xFF><0xFF><0xFF><0xFF>
Converting this back to an integer gives: 4294967295
I.e. a 10-digit number.
==> Two digits too long.
One last thing - there is no limit on the length any keys / IVs required.

Use AES/OFB, or any other stream cipher. It will generate a keystream of pseudorandom bits. Normally, you would XOR these bits with the plaintext. Instead:
For every decimal digit in the plaintext
Repeat
Take 4 bits from the keystream
Until the bits form a number less than 10
Add this number to the plaintext digit, modulo 10
To decrypt, do the same but subtract instead in the last step.
I believe this should be as secure as using the stream cipher normally. If a sequence of numbers 0-15 is indistinguishable from random, the subsequence of only those of the numbers that are smaller than 10 should still be random. Using add/subtract instead of XOR should still produce random output if one of the inputs are random.

One potential candidate is the FFX encryption mode, which has recently been submitted to NIST.

Stream ciphers require a nonce for security; the same key stream state must never be re-used for different messages. That nonce adds to the effective ciphertext length.
A block cipher used in a streaming mode has essentially the same issue: a unique initialization vector must be included with the cipher text.
Many stream ciphers are also vulnerable to ciphertext manipulation, where flipping a bit in the ciphertext undetectably flips the corresponding bit in the plaintext.
If the numbers are randomly chosen, and each number is encrypted only once, and the numbers are shorter than the block size, ECB offers good security. Under those conditions, I'd recommend AES in ECB mode as the solution that minimizes ciphertext length while providing strong privacy and integrity protection.
If there is some other information in the context of the ciphertext that could be used as an initialization vector (or nonce), then this could work. This could be something explicit, like a transaction identifier during a purchase, or something implicit like the sequence number of a message (which could be used as the counter in CTR mode). I guess that VeriShield is doing something like this.

I am not a cipher guru, but an obvious question comes to mind: would you be allowed to use One Time Pad encryption? Then you can just include a large block of truly random bits in your decoding system, and use the random data to transform your decimal digits in a reversible way.
If this would be acceptable, we just need to figure out how the decoder knows where in the block of randomness to look to get the key to decode any particular message. If you can send a plaintext timestamp with the ciphertext, then it's easy: convert the timestamp into a number, say the number of seconds since an epoch date, modulus that number by the length of the randomness block, and you have an offset within the block.
With a large enough block of randomness, this should be uncrackable. You could have the random bits be themselves encrypted with strong encryption, such that the user must type in a long password to unlock the decoder; in this way, even if the decryption software was captured, it would still not be easy to break the system.
If you have any interest in this and would like me to expand further, let me know. I don't want to spend a lot of time on an answer that doesn't meet your needs at all.
EDIT: Okay, with the tiny shred of encouragement ("you might be on to something") I'm expanding my answer.
The idea is that you get a block of randomness. One easy way to do this is to just pull data out of the Linux /dev/random device. Now, I'm going to assume that we have some way to find an index into this block of randomness for each message.
Index into the block of randomness and pull out ten bytes of data. Each byte is a number from 0 to 255. Add each of these numbers to the respective digit from the plaintext, modulo by 10, and you have the digits of the ciphertext. You can easily reverse this as long as you have the block of random data and the index: you get the random bits and subtract them from the cipher digits, modulo 10.
You can think of this as arranging the digits from 0 to 9 in a ring. Adding is counting clockwise around the ring, and subtracting is counting counter-clockwise. You can add or subtract any number and it will work. (My original version of this answer suggested using only 3 bits per digit. Not enough, as pointed out below by #Baffe Boyois. Thank you for this correction.)
If the plain text digit is 6, and the random number is 117, then: 6 + 117 == 123, modulo 10 == 3. 3 - 117 == -114, modulo 10 == 6.
As I said, the problem of finding the index is easy if you can use external plaintext information such as a timestamp. Even if your opponent knows you are using the timestamp to help decode messages, it does no good without the block of randomness.
The problem of finding the index is also easy if the message is always delivered; you can have an agreed-upon system of generating a series of indices, and say "This is the fourth message I have received, so I use the fourth index in the series." As a trivial example, if this is the fourth message received, you could agree to use an index value of 16 (4 for fourth message, times 4 bytes per one-time pad). But you could also use numbers from an approved pseudorandom number generator, initialized with an agreed constant value as a seed, and then you would get a somewhat unpredictable series of indexes within the block of randomness.
Depending on your needs, you could have a truly large chunk of random data (hundreds of megabytes or even more). If you use 10 bytes as a one-time pad, and you never use overlapping pads or reuse pads, then 1 megabyte of random data would yield over 100,000 one-time pads.

You could use the octal format, which uses digits 0-7, and three digits make up a byte. This isn't the most space-efficient solution, but it's quick and easy.
Example:
Text: Hello world!
Hexadecimal: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21
Octal: 110 145 154 154 157 040 167 157 162 154 144 041
(spaces added for clarity to separate bytes)

I don't believe your requirement can be met (at all easily anyway), though it's possible to get pretty close close.
AES (like most encryption algorithms) is written to work with octets (i.e. 8-bit bytes), and it's going to produce 8-bit bytes. Once it's done its thing, converting the result to use only decimal digits or BCD values won't be possible. Therefore, your only choice is to convert the input from decimal or BCD digits to something that fills an octet as completely as possible. You can then encrypt that, and finally re-encode the output to use only decimal or BCD digits.
When you convert the ASCII digits to fill the octets, it'll "compress" the input somewhat. The encryption will then produce the same size of output as the input. You'll then encode that to use only decimal digits, which will expand it back to roughly the original size.
The problem is that neither 10 nor 100 is a number that you're going to easily fit exactly into a byte. Numbers from 1 to 100 can be encoded in 7 bits. So, you'll basically treat those as a bit-stream, putting them in 7 bits at a time, but taking them out 8 bits at a time to get bytes to encrypt.
That uses the space somewhat better, but it's still not perfect. 7 bits can encode values from 0 to 127, not just 0 to 99, so even though you'll use all 8 bits, you won't use every possible combination of those 8 bits. Likewise, in the result, one byte will turn into three decimal digits (0 to 255), which clearly wastes some space. As a result, your output will be slightly larger than your input.
To get closer than that, you could compress your input with something like Huffman or an LZ* compression (or both) before encrypting it. Then you'd do roughly the same thing: encrypt the bytes, and encode the bytes using values from 0 to 9 or 0 to 99. This will give better usage of the bits in the bytes you encrypt, so you'd waste very little space in that transformation, but does nothing to improve the encoding on the output side.

For those doubting FFX mode AES, please feel free to contact me for further details. Our implementation is a mode of AES that effectively sits on top of existing ciphers. the specification with proof/validation is up on the NIST modes website. FFSEM mode AES is included under FFX mode.
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ffx/ffx-spec.pdf
If its meaningful, you can also have a conversation with NIST directly about their status in respect to modes submission/AES modes acceptance to address your FIPS question. FFX has security proofs, independent cryptographic review and is not a "new cipher". It is however based on methods that go back 20+ years - proven techniques. In implementation we ability to encrypt data whilst preserving length, structure, integrity, type and format. For example specify an explicit format policy that the output will be NNN-NN-NNNN.
So, as a mode of AES we can for example on a mainframe environment for implementation we simple use the native AES processor on a z10. Similar on open systems with HSM devices- we can sit on top of an existing AES implementation.
Format Preserving Encryption (as its often referred to) in this way is already being used in industry and available in off-the-shelf products and rather quick to deploy - already used in POS devices etc, Payments systems, Enterprise deployments etc.
Mark Bower
VP Product Management
Voltage Security
Drop a note to info#voltage.com for more info or take a look at our website for more info.

Something like a Feistel cipher should fit your requirements. Split your input number into two parts (say 8 digits each), pass one part through a not-necessarily-reversible-or-bijective function, and subtract the other part from the result of that function (modulo e.g. 100,000,000). Then rearrange the digits somehow and repeat a bunch of times. Ideally, one should slightly vary the function which is used each time. Decryption is similar to encryption except that one starts by undoing the last rearrangement step, then subtracts the second part of the message from the result of using the last function one used on the first part of the message (again, modulo 100,000,000), then undoes the previous rearrangement step, etc.
The biggest difficulties with a Feistel cipher are finding a function which achieve good encryption with a reasonable number of rounds, and figuring out how many rounds are required to achieve good encryption. If speed is not important, one could probably use something like AES to perform the scrambling function (since it doesn't have to be bijective, you could arbitrarily pad the data before each AES step, and interpret the result as a big binary number modulo 100,000,000). As for the number of rounds, 10 is probably too few, and 1000 is probably excessive. I don't know what value between there would be best.

Using only 10 digits as input/output is completely insecure. It is so insecure, that in very likely that will be cracked in real application, so consider using at least 39 digits (128 bits equivalent) If you are going to use only 10 digits there is no point in using AES in this case you have chance to invent your own (insecure) algorithm.
The only way you might get out of this is using STREAM cipher. Use 256 bit key "SecureKey" and initialisation vector IV which should be different at each beginning of starting season.
Translate this number into 77 digit (decimal) number and use "addition whit overflow" over modulo 10 whit each digit.
For instance
AES(IV,KEY) = 4534670 //and lot more
secret_message = 01235
+ and mod 10
---------------------------------------------
ciphertext = 46571 // and you still have 70 for next message
when you run out for digits from stream cipher -> AES(IV,KEY) increase IV and repeat IV=IV+1
Keep in mind that you should absolutely never use same IV twice, so you should have some scheme over this to prevent this.
Another concern is also in generating Streams. If you generate number that is higher than 10^77 you should discard that number increase IV and try again whit new IV. Other way there is high probability that you will have biased numbers and vulnerability.
There also very likely that there is flaw in this scheme or there will be in your implementation

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js