Save a big array of numbers in a cookie - cookies

I have to save 400 numbers in a cookie on the client's side. Every numbers is from 1 to 10. Any suggestions how to organize the data in the cookie to make it compact? (I'm going to work with the cookies using Python at the server side, and JavaScript at the client's side.)

Another solution would be to use `localStorage':
for(i=0; i <400; i++){
localStorage[i] = Math.floor((Math.random()*10)+1);
}
console.log(localStorage);
alert(JSON.stringify(localStorage));
http://jsfiddle.net/S2uUn/

OK, if you want to do this entirely on the client side, then let's try another approach. For every three integers (from 1 to 10) that you want to store, there would be 1000 combinations (10*10*10=1000). 12 bits give you 1024 combinations (2^12=1024). So, you can store 3 integers (each 1-10) using 12 bits. So, 400 integers (each 1-10) could be stored using 1600 bits (400 / 3 * 12). The logic for storing the integers bitwise this way can be implemented easily on the client side using javascript. 1600 bits = 200 bytes (1 byte = 8 bits), so you can store 400 integers (each 1-10) using 200 bytes. This is binary data, so to store this info in a cookie, it would have to be converted to ascii text. base64 encoding is one way to do that, and this can be done client side using the functions at How can you encode a string to Base64 in JavaScript?. base64 encoding produces 4 characters for every 3 bytes encoded, so the resulting string that would be stored in the cookie would be 267 characters in length (20 * 4 / 3). The whole thing can be done this way on the client-side in javascript, and 400 integers (each 1-10) could be stored in a cookie 267 characters in length.

How about writing the list of integers to a comma-separated list, then compressing the list using gzip (which will produce binary output), then piping the binary output from gzip through a base-64 encoder, which will produce text that can be stored as a cookie. I would guess that the result would end up being only about 100 bytes in size, which can easily be stored in a cookie.

Related

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.
Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.
You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

How to check formatting of a SHA-1 message-digest [duplicate]

This question already has answers here:
A Regex to match a SHA1
(6 answers)
Closed 5 years ago.
I need some basic validation (sanitation checks) to determine if some input is a valid SHA1 sum or just a (random) string. If possible with simple parsing rules or a Regex.
Are there any rules to what a SHA1 sum should adhere? I cannot find any, but from quick tests, all seem to be hexadecimal and around 40 characters long[1].
I am not interested in tests that prove whether or not the SHA-1 sum was made in a secure, properly random or other manner. Just that the format is correct.
I am also not interested in testing that the digest is an actual representation of some message; Just that it has the format of digest in the first place.
For the curious: this is for an application where I build avatars for users based on a.o. their uuid. I don't, however, want to place those uuids in the URL, but obfuscate them a little. So instead of avatars/baa4833d-b962-4ab1-87c5-283c9820eac4.png, we request avatars/5f2a13cb1d84a2e019842cdb8d0c8b03c9e1e414.png. Where 5f2a... is e.g. Digest::SHA1.hexdigest(uuid + "secrect").
On the receiving side, I am adding some basic protection that sends back a 400 bad request whenever something is obviously false. Such as avatars/haxor.png or avatars/traversal../../../../attempt.png. Note that this is a very much simplified example.
[1] Two tests with different outcome:
Using sha1sum on Ubuntu Linux:
$ echo "hello" | sha1sum | cut -d" " -f1 | wc -c
41
using Ruby's Digest:
Digest::SHA1.hexdigest("hello").length
=> 40
Edit: turns out this is me, being stupid, wc-c includes the newline, as kennytm points out in the comments. Still: is it safe to assume it will be 40 characters, always?
SHA-1 has a 160 bits digest
160 bits rendered is 160 / 8 = 20 bytes.
20 bytes rendered in hexadecimal format has a length of 40 chars (digits), two chars for each byte.
Digits can be [0-9a-f]
So the following regex should correctly validate the Sha1sum rendered as a string in hexadecimal format:
/^[0-9a-f]{40}$/

Why does base64-encoded data compress so poorly?

I was recently compressing some files, and I noticed that base64-encoded data seems to compress really bad. Here is one example:
Original file: 429,7 MiB
compress via xz -9:
13,2 MiB / 429,7 MiB = 0,031 4,9 MiB/s 1:28
base64 it and compress via xz -9:
26,7 MiB / 580,4 MiB = 0,046 2,6 MiB/s 3:47
base64 the original compressed xz file:
17,8 MiB in almost no time = the expected 1.33x increase in size
So what can be observed is that:
xz compresses really good ☺
base64-encoded data doesn't compress well, it is 2 times larger than the unencoded compressed file
base64-then-compress is significantly worse and slower than compress-then-base64
How can this be? Base64 is a lossless, reversible algorithm, why does it affect compression so much? (I tried with gzip as well, with similar results).
I know it doesn't make sense to base64-then-compress a file, but most of the time one doesn't have control over the input files, and I would have thought that since the actual information density (or whatever it is called) of a base64-encoded file would be nearly identical to the non-encoded version, and thus be similarily compressible.
Most generic compression algorithms work with a one-byte granularity.
Let's consider the following string:
"XXXXYYYYXXXXYYYY"
A Run-Length-Encoding algorithm will say: "that's 4 'X', followed by 4 'Y', followed by 4 'X', followed by 4 'Y'"
A Lempel-Ziv algorithm will say: "That's the string 'XXXXYYYY', followed by the same string: so let's replace the 2nd string with a reference to the 1st one."
A Huffman coding algorithm will say: "There are only 2 symbols in that string, so I can use just one bit per symbol."
Now let's encode our string in Base64. Here's what we get:
"WFhYWFlZWVlYWFhYWVlZWQ=="
All algorithms are now saying: "What kind of mess is that?". And they're not likely to compress that string very well.
As a reminder, Base64 basically works by re-encoding groups of 3 bytes in (0...255) into groups of 4 bytes in (0...63):
Input bytes : aaaaaaaa bbbbbbbb cccccccc
6-bit repacking: 00aaaaaa 00aabbbb 00bbbbcc 00cccccc
Each output byte is then transformed into a printable ASCII character. By convention, these characters are (here with a mark every 10 characters):
0 1 2 3 4 5 6
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
For instance, our example string begins with a group of three bytes equal to 0x58 in hexadecimal (ASCII code of character "X"). Or in binary: 01011000. Let's apply Base64 encoding:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
6-bit repacking : 00010110 00000101 00100001 00011000
As decimal : 22 5 33 24
Base64 characters: 'W' 'F' 'h' 'Y'
Output bytes : 0x57 0x46 0x68 0x59
Basically, the pattern "3 times the byte 0x58" which was obvious in the original data stream is not obvious anymore in the encoded data stream because we've broken the bytes into 6-bit packets and mapped them to new bytes that now appear to be random.
Or in other words: we have broken the original byte alignment that most compression algorithms rely on.
Whatever compression method is used, it usually has a severe impact on the algorithm performance. That's why you should always compress first and encode second.
This is even more true for encryption: compress first, encrypt second.
EDIT - A note about LZMA
As MSalters noticed, LZMA -- which xz is using -- is working on bit streams rather than byte streams.
Still, this algorithm will also suffer from Base64 encoding in a way which is essentially consistent with my earlier description:
Input bytes : 0x58 0x58 0x58
As binary : 01011000 01011000 01011000
(see above for the details of Base64 encoding)
Output bytes : 0x57 0x46 0x68 0x59
As binary : 01010111 01000110 01101000 01011001
Even by working at the bit level, it's much easier to recognize a pattern in the input binary sequence than in the output binary sequence.
Compression is necessarily an operation that acts on multiple bits. There's no possible gain in trying to compress an individual "0" and "1". Even so, compression typically works on a limited set of bits at a time. The LZMA algorithm in xz isn't going to consider all of the 3.6 billion bits at once. It looks at much smaller strings (<273 bytes).
Now look at what base-64 encoding does: It replaces a 3 byte (24 bit) word with a 4 byte word, using only 64 out of 256 possible values. This gives you the x1.33 growth.
Now it is fairly clear that this growth must cause some substrings to grow past the maximum substring size of the encoder. This causes them to be no longer compressed as a single substring, but as two separate substrings indeed.
As you have a lot of compression (97%), you apparently have the situation that very long input substrings are compressed as a whole. this means that you will also have many substrings being base-64 expanded past the maximum length the encoder can deal with.
It's not Base64. its them memory requirements of libraries "The presets 7-9 are like the preset 6 but use bigger dictionaries and have higher compressor and decompressor memory requirements."https://tukaani.org/xz/xz-javadoc/org/tukaani/xz/LZMA2Options.html

Base64 algorithm with multiple result of a factor in C++

I have the following problem: I need to convert a certain content to base64 to prevent some problems with characters, and after this conversion I need to encrypt this data with one Aes algorithm with a key length of 16. The problem occurs when the result of the base64 algorithm returns a response with size that is not a multiple of 16 causing problems on encryption, considering that the size of the original content are multiples of 16. How can I avoid this problem?
Pad the result of the base4 encoding to be a multiple of 16.
base64 encoder encodes your 8-bit data using 6-bits per character, so for every 3 bytes of input data you get 4 bytes of output data (plus padding to 4-byte boundary). So data size passed to AES algorithm may have length which is not multiple of 16.
Please check documentation for your AES library - chances are that it can handle this internally if you call special function for last chunk of data (e.g. EVP_EncryptFinal_ex in OpenSSL). Another solution is to pad data to 16-byte boundary in your code before encrypting it.
Most AES implementations support padding such as PKCS#7 (née PKCS#5). This will add the required padding bytes on encryption and remove then on decryption.
AES is a data based function and as such accepts any bytes so it's not necessary to Base64 encode prior to encryption. Encrypted data is just bytes, not characters, so the output may need to be encoded with a method such as Base64 or hexadecimal based on the usage such as the need for printable characters.

Where can I get sample Reed-Solomon encoded data?

I want to write a Reed-Solomon decoder and experiment with performance improvements. Where could I find sample data with appended Reed-Solomon parity bytes?
I am aware that Reed-Solomon is used in all kinds of 1D and 2D bar codes, but I would like to have the raw data (an array of bytes) with clear separation of payload and parity bytes.
Any help is appreciated.
Basically, a Reed-Solomon code will be composed of characters with a value between 0 and (m-1), where m is the exposant of the Galois Field used to generate the RS code. For example, in GF(2^8) (2^8 = 256), you will get a RS code composed of characters between 0 and 255 (compatible with ASCII, UTF-8 and usual binary encoding). In GF(2^16), you will get characters encoded between 0 and 65535 (compatible with UTF-16 or with UTF-8 if you encode 2 characters as one).
Other than the range of values of each character of a RS code, all the rest can be basically considered random from an external POV (it's not if you have the generator polynomial and Galois field, but for your purpose of getting a sample, you can assume a random distribution of values in the valid range).
If you want to generate real samples of a RS code with the corresponding data block, you can use the Python library pyFileFixity (disclaimer, I am the author). By default, each ecc block is separated by a md5 digest, so that you can clearly separate them. The original data is not stored, but you can easily do that by modifying the script structural_adaptive_ecc.py or header_ecc.py (the latter will be easier to modify) to also store the original data (it's just a file.write() to edit). If Python is not your thing, you can probably find a Reed-Solomon library for your language of choice, and just do a slight modification to print or save into a file the original data along the ecc blocks.