What encryption scheme meets requirement of decimal plaintext & ciphertext and preserves length?

What encryption scheme meets requirement of decimal plaintext & ciphertext and preserves length? - c++

I need an encryption scheme where the plaintext and ciphertext are composed entirely of decimal digits.
In addition, the plaintext and ciphertext must be the same length.
Also the underlying encryption algorithm should be an industry-standard.
I don't mind if its symmetric (e.g AES) or asymmetric (e.g RSA) - but it must be a recognized algorithm for which I can get a FIPS-140 approved library. (Otherwise it won't get past the security review stage).
Using AES OFB is fine for preserving the length of hex-based input (i.e. where each byte has 256 possible values: 0x00 --> 0xFF). However, this will not work for my means as plaintext and ciphertext must be entirely decimal.
NB: "Entirely decimal" may be interpreted two ways - both of which are acceptable for my requirements:
Input & output bytes are characters '0' --> '9' (i.e. byte values: 0x30 -> 0x39)
Input & output bytes have the 100 (decimal) values: 0x00 --> 0x99 (i.e. BCD)
Some more info:
The max plaintext & ciphertext length is likely to be 10 decimal digits.
(I.e. 10 bytes if using '0'-->'9' or 5 bytes if using BCD)
Consider following sample to see why AES fails:
Input string is 8 digit number.
Max 8-digit number is: 99999999
In hex this is: 0x5f5e0ff
This could be treated as 4 bytes: <0x05><0xf5><0xe0><0xff>
If I use AES OFB, I will get 4 byte output.
Highest possible 4-byte ciphertext output is <0xFF><0xFF><0xFF><0xFF>
Converting this back to an integer gives: 4294967295
I.e. a 10-digit number.
==> Two digits too long.
One last thing - there is no limit on the length any keys / IVs required.

Use AES/OFB, or any other stream cipher. It will generate a keystream of pseudorandom bits. Normally, you would XOR these bits with the plaintext. Instead:
For every decimal digit in the plaintext
Repeat
Take 4 bits from the keystream
Until the bits form a number less than 10
Add this number to the plaintext digit, modulo 10
To decrypt, do the same but subtract instead in the last step.
I believe this should be as secure as using the stream cipher normally. If a sequence of numbers 0-15 is indistinguishable from random, the subsequence of only those of the numbers that are smaller than 10 should still be random. Using add/subtract instead of XOR should still produce random output if one of the inputs are random.

One potential candidate is the FFX encryption mode, which has recently been submitted to NIST.

Stream ciphers require a nonce for security; the same key stream state must never be re-used for different messages. That nonce adds to the effective ciphertext length.
A block cipher used in a streaming mode has essentially the same issue: a unique initialization vector must be included with the cipher text.
Many stream ciphers are also vulnerable to ciphertext manipulation, where flipping a bit in the ciphertext undetectably flips the corresponding bit in the plaintext.
If the numbers are randomly chosen, and each number is encrypted only once, and the numbers are shorter than the block size, ECB offers good security. Under those conditions, I'd recommend AES in ECB mode as the solution that minimizes ciphertext length while providing strong privacy and integrity protection.
If there is some other information in the context of the ciphertext that could be used as an initialization vector (or nonce), then this could work. This could be something explicit, like a transaction identifier during a purchase, or something implicit like the sequence number of a message (which could be used as the counter in CTR mode). I guess that VeriShield is doing something like this.

I am not a cipher guru, but an obvious question comes to mind: would you be allowed to use One Time Pad encryption? Then you can just include a large block of truly random bits in your decoding system, and use the random data to transform your decimal digits in a reversible way.
If this would be acceptable, we just need to figure out how the decoder knows where in the block of randomness to look to get the key to decode any particular message. If you can send a plaintext timestamp with the ciphertext, then it's easy: convert the timestamp into a number, say the number of seconds since an epoch date, modulus that number by the length of the randomness block, and you have an offset within the block.
With a large enough block of randomness, this should be uncrackable. You could have the random bits be themselves encrypted with strong encryption, such that the user must type in a long password to unlock the decoder; in this way, even if the decryption software was captured, it would still not be easy to break the system.
If you have any interest in this and would like me to expand further, let me know. I don't want to spend a lot of time on an answer that doesn't meet your needs at all.
EDIT: Okay, with the tiny shred of encouragement ("you might be on to something") I'm expanding my answer.
The idea is that you get a block of randomness. One easy way to do this is to just pull data out of the Linux /dev/random device. Now, I'm going to assume that we have some way to find an index into this block of randomness for each message.
Index into the block of randomness and pull out ten bytes of data. Each byte is a number from 0 to 255. Add each of these numbers to the respective digit from the plaintext, modulo by 10, and you have the digits of the ciphertext. You can easily reverse this as long as you have the block of random data and the index: you get the random bits and subtract them from the cipher digits, modulo 10.
You can think of this as arranging the digits from 0 to 9 in a ring. Adding is counting clockwise around the ring, and subtracting is counting counter-clockwise. You can add or subtract any number and it will work. (My original version of this answer suggested using only 3 bits per digit. Not enough, as pointed out below by #Baffe Boyois. Thank you for this correction.)
If the plain text digit is 6, and the random number is 117, then: 6 + 117 == 123, modulo 10 == 3. 3 - 117 == -114, modulo 10 == 6.
As I said, the problem of finding the index is easy if you can use external plaintext information such as a timestamp. Even if your opponent knows you are using the timestamp to help decode messages, it does no good without the block of randomness.
The problem of finding the index is also easy if the message is always delivered; you can have an agreed-upon system of generating a series of indices, and say "This is the fourth message I have received, so I use the fourth index in the series." As a trivial example, if this is the fourth message received, you could agree to use an index value of 16 (4 for fourth message, times 4 bytes per one-time pad). But you could also use numbers from an approved pseudorandom number generator, initialized with an agreed constant value as a seed, and then you would get a somewhat unpredictable series of indexes within the block of randomness.
Depending on your needs, you could have a truly large chunk of random data (hundreds of megabytes or even more). If you use 10 bytes as a one-time pad, and you never use overlapping pads or reuse pads, then 1 megabyte of random data would yield over 100,000 one-time pads.

You could use the octal format, which uses digits 0-7, and three digits make up a byte. This isn't the most space-efficient solution, but it's quick and easy.
Example:
Text: Hello world!
Hexadecimal: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21
Octal: 110 145 154 154 157 040 167 157 162 154 144 041
(spaces added for clarity to separate bytes)

I don't believe your requirement can be met (at all easily anyway), though it's possible to get pretty close close.
AES (like most encryption algorithms) is written to work with octets (i.e. 8-bit bytes), and it's going to produce 8-bit bytes. Once it's done its thing, converting the result to use only decimal digits or BCD values won't be possible. Therefore, your only choice is to convert the input from decimal or BCD digits to something that fills an octet as completely as possible. You can then encrypt that, and finally re-encode the output to use only decimal or BCD digits.
When you convert the ASCII digits to fill the octets, it'll "compress" the input somewhat. The encryption will then produce the same size of output as the input. You'll then encode that to use only decimal digits, which will expand it back to roughly the original size.
The problem is that neither 10 nor 100 is a number that you're going to easily fit exactly into a byte. Numbers from 1 to 100 can be encoded in 7 bits. So, you'll basically treat those as a bit-stream, putting them in 7 bits at a time, but taking them out 8 bits at a time to get bytes to encrypt.
That uses the space somewhat better, but it's still not perfect. 7 bits can encode values from 0 to 127, not just 0 to 99, so even though you'll use all 8 bits, you won't use every possible combination of those 8 bits. Likewise, in the result, one byte will turn into three decimal digits (0 to 255), which clearly wastes some space. As a result, your output will be slightly larger than your input.
To get closer than that, you could compress your input with something like Huffman or an LZ* compression (or both) before encrypting it. Then you'd do roughly the same thing: encrypt the bytes, and encode the bytes using values from 0 to 9 or 0 to 99. This will give better usage of the bits in the bytes you encrypt, so you'd waste very little space in that transformation, but does nothing to improve the encoding on the output side.

For those doubting FFX mode AES, please feel free to contact me for further details. Our implementation is a mode of AES that effectively sits on top of existing ciphers. the specification with proof/validation is up on the NIST modes website. FFSEM mode AES is included under FFX mode.
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ffx/ffx-spec.pdf
If its meaningful, you can also have a conversation with NIST directly about their status in respect to modes submission/AES modes acceptance to address your FIPS question. FFX has security proofs, independent cryptographic review and is not a "new cipher". It is however based on methods that go back 20+ years - proven techniques. In implementation we ability to encrypt data whilst preserving length, structure, integrity, type and format. For example specify an explicit format policy that the output will be NNN-NN-NNNN.
So, as a mode of AES we can for example on a mainframe environment for implementation we simple use the native AES processor on a z10. Similar on open systems with HSM devices- we can sit on top of an existing AES implementation.
Format Preserving Encryption (as its often referred to) in this way is already being used in industry and available in off-the-shelf products and rather quick to deploy - already used in POS devices etc, Payments systems, Enterprise deployments etc.
Mark Bower
VP Product Management
Voltage Security
Drop a note to info#voltage.com for more info or take a look at our website for more info.

Something like a Feistel cipher should fit your requirements. Split your input number into two parts (say 8 digits each), pass one part through a not-necessarily-reversible-or-bijective function, and subtract the other part from the result of that function (modulo e.g. 100,000,000). Then rearrange the digits somehow and repeat a bunch of times. Ideally, one should slightly vary the function which is used each time. Decryption is similar to encryption except that one starts by undoing the last rearrangement step, then subtracts the second part of the message from the result of using the last function one used on the first part of the message (again, modulo 100,000,000), then undoes the previous rearrangement step, etc.
The biggest difficulties with a Feistel cipher are finding a function which achieve good encryption with a reasonable number of rounds, and figuring out how many rounds are required to achieve good encryption. If speed is not important, one could probably use something like AES to perform the scrambling function (since it doesn't have to be bijective, you could arbitrarily pad the data before each AES step, and interpret the result as a big binary number modulo 100,000,000). As for the number of rounds, 10 is probably too few, and 1000 is probably excessive. I don't know what value between there would be best.

Using only 10 digits as input/output is completely insecure. It is so insecure, that in very likely that will be cracked in real application, so consider using at least 39 digits (128 bits equivalent) If you are going to use only 10 digits there is no point in using AES in this case you have chance to invent your own (insecure) algorithm.
The only way you might get out of this is using STREAM cipher. Use 256 bit key "SecureKey" and initialisation vector IV which should be different at each beginning of starting season.
Translate this number into 77 digit (decimal) number and use "addition whit overflow" over modulo 10 whit each digit.
For instance
AES(IV,KEY) = 4534670 //and lot more
secret_message = 01235
+ and mod 10
---------------------------------------------
ciphertext = 46571 // and you still have 70 for next message
when you run out for digits from stream cipher -> AES(IV,KEY) increase IV and repeat IV=IV+1
Keep in mind that you should absolutely never use same IV twice, so you should have some scheme over this to prevent this.
Another concern is also in generating Streams. If you generate number that is higher than 10^77 you should discard that number increase IV and try again whit new IV. Other way there is high probability that you will have biased numbers and vulnerability.
There also very likely that there is flaw in this scheme or there will be in your implementation

Related

Is there any regex to find that a value is hashmap or not?

Is there any regex to find that a value is hashmap or not?
I want to build an PL/SQL function to say that a value is hashed or not?
For e.g
1. TIM 2. F6099C0932D0E2B13286218F99C265975B33FD84
My Regex should have intelligence to tell me that expression 1 (Tim) is not hashmaped.
Whereas expression 2 (F6099C0932D0E2B13286218F99C265975B33FD84) is hashmaped.

A hash is just a number of bits of a particular size. Cryptographic hashes generally have a 256 to 512 bit output size to achieve a security of about 128-256 bits to achieve collision resistance.
Other hashes used in a hash map may be smaller, as collision resistance is usually not required; instead the hash just needs to be well distributed, so that hashed values are distributed equally.
Computers generally only address bytes, not bits. So commonly the hashes are a multiple of 8 bits. Even more generally, they are commonly a power of two, or two or three powers of two added together (160 bits for 128 + 32 bits).
Now to view those well distributed bytes we need to have some kind of way to view these bit values using printable characters. One way to do that is base 64. However, for these relatively short values hexadecimals are usually preferred, and that's what you have in the question.
So can you see if it is a hash value or not? Well, yes and no. You can see with a pretty good likelihood that it is a 40 character hexadecimal value, which represents a 20 bytes or 20 * 8 = 160 bit value. We can also "see" that it is pretty well distributed and that it doesn't encode printable ASCII (as there are values above 7E hex).
Testing with a regex that the contents are (upper or lowercase) hexadecimals is easy enough. That it is 40 characters for 160 bits should be easy as well. However, to test that it is indeed a well distributed value is not really possible with regular expressions. It is even not easy for any program code, as "random" values may now and then look surprisingly non-random. Besides that, not only hashes consist of well distributed byte values. Ciphertext and - of course - random byte values have similar properties.
So, yeah, you can verify that the output format is compatible with a hash value, but testing if it is a hash value is not really possible.
The regexp:
[0-9A-Fa-f]{40}
would of course wipe "Tim" out. You can be 100% sure that "Tim" is not a 160 bit hash value encoded in hexadecimals after all.

How do I represent an LZW output in bytes?

I found an implementation of the LZW algorithm and I was wondering how can I represent its output, which is an int list, to a byte array.
I had tried with one byte but in case of long inputs the dictionary has more than 256 entries and thus I cannot convert.
Then I tried to add an extra byte to indicate how many bytes are used to store the values, but in this case I have to use 2 bytes for each value, which doesn't compress enough.
How can I optimize this?

As bits, not bytes. You just need a simple routine that writes an arbitrary number of bits to a stream of bytes. It simply keeps a one-byte buffer into which you put bits until you have eight bits. Then write than byte, clear the buffer, and start over. The process is reversed on the other side.
When you get to the end, just write the last byte buffer if not empty with the remainder of the bits set to zero.
You only need to figure out how many bits are required for each symbol at the current state of the compression. That same determination can be made on the other side when pulling bits from the stream.

In his 1984 article on LZW, T.A. Welch did not actually state how to "encode codes", but described mapping "strings of input characters into fixed-length codes", continuing "use of 12-bit codes is common". (Allows bijective mapping between three octets and two codes.)
The BSD compress(1) command didn't literally follow, but introduced a header, the interesting part being a specification of the maximum number if bits to use to encode an LZW output code, allowing decompressors to size decompression tables appropriately or fail early and in a controlled way. (But for the very first,) Codes were encoded with just the number of integral bits necessary, starting with 9.
An alternative would be to use Arithmetic Coding, especially if using a model different from every code is equally probable.

Store SHA-1 in database in less space than the 40 hex digits

I am using a hash algorithm to create a primary key for a database table. I use the SHA-1 algorithm which is more than fine for my purposes. The database even ships an implementation for SHA-1. The function computing the hash is returning a hex value as 40 characters. Therefore I am storing the hex characters in a char(40) column.
The table will have lots of rows, >= 200 Mio. rows which is why I am looking for less data intensive ways of storing the hash. 40 characters times ~200 Mio. rows will require some GB of storage... Since hex is base16 I thought I could try to store it in base 256 in hope to reduce the amount of characters needed to around 20 characters. Do you have tips or papers on implementations of compression with base 256?

Store it as a blob: storing 8 bits of data per character instead of 4 is a 2x compression (you need some way to convert it though),
Cut off some characters: you have 160 bits, but 128 bits is enough for unique keys even if the universe ends, and for most purposes 80 bits would even be enough (you don't need cryptographic protection). If you have an anti-collision algorithm, use 36 or 40 bits is enough.

A SHA-1 value is 20 bytes. All the bits in these 20 bytes are significant, there's no way to compress them. By storing the bytes in their hexadecimal notation, you're wasting half the space — it takes exactly two hexadecimal digits to store a byte. So you can't compress the underlying value, but you can use a better encoding than hexadecimal.
Storing as a blob is the right answer. That's base 256. You're storing each byte as that byte with no encoding that would create some overhead. Wasted space: 0.
If for some reason you can't do that and you need to use a printable string, then you can do better than hexadecimal by using a more compact encoding. With hexadecimal, the storage requirement is twice the minimum (assuming that each character is stored as one byte). You can use Base64 to bring the storage requirements to 4 characters per 3 bytes, i.e. you would need 28 characters to store the value. In fact, given that you know that the length is 20 bytes and not 21, the base64 encoding will always end with a =, so you only need to store 27 characters and restore the trailing = before decoding.
You could improve the encoding further by using more characters. Base64 uses 64 code points out of the available 256 byte values. ASCII (the de facto portable) has 95 printable characters (including space), but there's no common “base95” encoding, you'd have to roll your own. Base85 is an intermediate choice, it does get some use in practice, and lets you store the 20-byte value in 25 printable ASCII characters.

32 bit CRC with some inputs set to zero. Is this less accurate than dummy data?

Sorry if I should be able to answer this simple question myself!
I am working on an embedded system with a 32bit CRC done in hardware for speed. A utility exists that I cannot modify that initially takes 3 inputs (words) and returns a CRC.
If a standard 32 bit was implemented, would generating a CRC from a 32 bit word of actual data and 2 32 bit words comprising only of zeros produce a less reliable CRC than if I just made up/set some random values for the last 2 32?
Depending on the CRC/polynomial, my limited understanding of CRC would say the more data you put in the less accurate it is. But don't zero'd data reduce accuracy when performing the shifts?

Using zeros will be no different than some other value you might pick. The input word will be just as well spread among the CRC bits either way.

I agree with Mark Adler that zeros are mathematically no worse than other numbers. However, if the utility you can't change does something bad like set the initial CRC to zero, then choose non-zero pad words. An initial CRC=0 + Data=0 + Pads=0 produces a final CRC=0. This is technically valid, but routinely getting CRC=0 is undesirable for data integrity checking. You could compensate for a problem like this with non-zero pad characters, e.g. pad = -1.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.

If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.

Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).

You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...

Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.

Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.

Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.

"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js