convert a string to a sequence of binary

convert a string to a sequence of binary - python-2.7

i want to convert any char to its binary representation(not a string like my cuurent code does now) it needs to be a sequence of binary numbers
after that i will take every 16 bits from what i've done and calculate their sum
i cant use numpy or any other package
this is what i got now and it
def checksum(st):
data = ' '.join(map(bin,bytearray(st)))
binar = [data[i:i+16] for i in range(0, len(data), 16)]
check = 0xffff
for hex in binar:
check += int(hex,2)
return check
my current code gets a string (for example:'10100/01') and i want to sum every 16 bits of the string therefor i need to convert the string to binary numbers and then sum every 16 bits together

This answers your question, assuming I understood it properly. The first two lines of your code don't seem to do what you want to achieve, but maybe you just forgot mentioning something.
Anyhow.
def checksum2(st):
dummy = 0xFFFF
for count in xrange(0,len(st),2):
dummy += ord(st[count])+ord(st[count+1])*256
return dummy
This code steps through every second char of your string and adds the value of one char to the value of the next char times 256, which creates a word. Remove the *256 if you didn't actually want to create a proper 16bit value and instead only wanted to add two 8bit values together. And if you rather need a big endian, instead of little endian, then just move the *256 the the other ord().

Related

How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow. it's long but I have explained it in detail and I think it's understandable.
I'm writing huffman code by c++ and saved characters and codes in a table like this:
Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE
Table: (Made by huffman tree)
Table
Now, I want to save this table to a file in the best way.
I can't save like this: A1B001C010D001E000
When it change to bits: 01000001101000010001010000110100100010000101000101000
Because I can't decode this.
If I save table in normal way, every character use 8 bit for saving it's code.
While my characters have 1bit or 3bit code. (In this case.)
this way use much storage.
My idea is add a separator character and set a code for it.
If we add a separator character and make huffman tree and write codes, have a table like this.
table2
Now, we can write codes in this way.
A0SepB110SepC100SepD1111sepE1110sep.
binary= 0100000101010100001011010101000011100101010001001111101010001011110101
I decode it in this way:
sep = 101.
Read 8 bit : 01000001 -> it's A.
rest = 01010100001011010101000011100101010001001111101010001011110101.
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1), Read 1 bit : 0 (like sep2), Read 1 bit : 1 (like sep3(end))
Sep was found so A = everything was befor sep = 0;
rest = 0100001011010101000011100101010001001111101010001011110101.
Read 8 bit : 01000010 -> it's B.
rest = 11010101000011100101010001001111101010001011110101.
Read 1 bit : 1 (like sep1)- Read 1 bit : 1 (unlike sep2)
Read 1 bit : 0 (unlike sep1)
Read 1 bit : 1 (like sep1) - Read 1 bit : 0 (like sep2) - Read 1 bit :1 (like sep3(end))
Sep was found so B = everything was befor sep = 110;
And so on ...
This way still use a little storage for separator ( number of characters * separator size )
My question: Is there a way to save first table in a file and use less storage?
For example like this: A1B001C010D001E000.

Don't save the table with the codes. Just save the lengths. See Canonical Huffman Code.

You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data. Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.
The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.
However, if compressing English text, there is a better way to store the table.
Store the shape of the tree, 0s for nodes and 1s for leaves. Then, after you store the nodes and the leaves, you store the values of the leaves.
The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this:
*
/ \
* A
/ \
* *
/ \ / \
E D C B
So you would store the shape of the tree as such:
000110111EDCBA
The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits. Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.
If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.

JPEG Huffman Table

I have a question regarding the JPEG Huffman Table and using the Huffman Table to construct the symbol/binary string from a Tree. Suppose, that in an Huffman Table for 3-Bit code Length the number of codes is greater than 6, then how do we add all those codes in the Tree? If I am correct only 6 codes can be added at the 3-bit level/depth of the tree. So, how do we add the remaining codes if they won't fit in that level? Do we just ignore them?
Example
code length | Total Codes | Codes
3-Bit | 10 | 25 43 34 53 92 A2 B2 63 73 C2
In the above example if we go by order of constructing symbols/binary string for the code then up 'til A2 we can add codes in the tree at level 3-Bit, but what about B2,63,73,C2 etc? It's not possible to add them at 3-Bit level of the tree? So what do we do with them?

Well, clearly, the absolutely highest number of "things" that can be represented in 3 bits is 8 - (000, 001, 010, 011, 100, 101, 110, 111).
In Huffman encoding, bits represent "left" or "right" in a trie data-structure, to be able to "continue", you have to use SOME codes for "this continues another level", which is why not all 8 values can be encoded in 3 bits. If you have more values to encode, you need to use more bits (for some values - this is the whole point of Huffman coding, that SOME combinations are short, others are longer, and sometimes even longer than the original, but because it's based on what is the most common, it's fine, because they will be rare...)
How to construct and decode a Huffman tree is about four-five pages in your typical Algorithms book, and if you haven't got one of those, you probably want to find one - either a real paper one, or an e-book. There are LOTS of them - I'm not going to recommend one, since the ones I have are all about 15+ years old.
I should add that I think your question is missing something. Clearly, 3 bits can not possibly represent 10 values. And you can't build a [meaningful] Huffman tree on 10 values that all different - unless the idea is to split the values into pairs of {2,5}, {4,3}, {3,4}, {5,3}, {9,2}, {A,2}, {B,2}, {6,3}, {7,3}, {C,2} - which gives a fair number of repeated values - frequency of those are:
2 : 5
3 : 5
4 : 2
5 : 2
6 : 1
7 : 1
9 : 1
A : 1
B : 1
C : 1
But that's stil too many to represent anything meaningful...
Or is it the other way around, that we are supposed to use the bit values of those to decode? In which case we'd need the tree built from the original data to decode it...

In JPEG, a Huffman code can be up to 16-bits. The DHT market contains an array of 16 elements giving the number of codes for each length.
The JPEG standard explains how to use the code counts to do the Huffman translation. It is one of the few things explained in detail.
This book explains how it is done from a programmers perspective.
JPEG Book
The number of codes that exists at any code length depends upon the counts for other lengths.
I am wondering if you are really looking at the count of codes for length 4 rather than 3.

It looks like you're not following the correct procedure when creating your Huffman codes from the JPEG table. The count provided will fit in the number of bits unless the table has been corrupted. The reading out of the codes from a DHT marker is really simple. The more complicated part is how you define your lookup table from that data. A logical (but not practical) way is to create a reverse lookup table that's the maximum code length in size (16-bits = 65536 entries in the table). Then to decode your JPEG data, just pick up 16-bits of compressed data from the input stream and use it as an index in the table where you'll have the symbol and actual length of the code. I came up with a way to use a single, much smaller lookup table. I'm not going to share my specific code table method. What I will share is the basic format of the loop to create the codes from a DHT marker:
int iCurrentCode; // the current Huffman code
int iLength; // the code length in bits that you're working on
int i;
int iCount; // the number of codes defined for this length
int iSymbol; // JPEG symbol defined for each Huffman code
unsigned char *pData; // pointer to the data in the DHT marker
iCurrentCode = 0; // start with a Huffman code of 0
for (iLength = 1; iLength <= 16; iLength++)
{
iCount = *pData++; // get number of symbols for this bit length
for (i=0; i<iCount; i++) // read each of the codes for this bit length
{
iSymbol = *pData++; // get the JPEG symbol value (e.g. RRRR/SSSS value)
// It's up to you to create a lookup table from the code and its value
iCurrentCode++; // the Huffman bit pattern just increments for each code value
} // for each code defined at this bit length
iCurrentCode <<= 1; // shift the code left 1 bit to advance to the next bit length
} // for each bit length

UDF decimal to binary

I wrote a decimal to binary converter function in order to practice my manipulation of number systems and arrays. I took the int a converted it to binary and stored each character, or so I beleive, in an array, then displayed to the screen, however it is displaying characters I do not know i looked them up on the aski table and do not recognize them, so i would like to ask for your assistance, here is a picture of the code, and console app.
Thanks in advance.

You likely want to insert number chars (such as '1') in your result, but you assign the char value. Try adding the value of '0' to get a readable result (remainder + '0').
If you interpret the result array as a string (that's what i suggested), you should also set the last char to the value 0 (not '0'!) to mark the end of the c string.

Your output function not correct output your binary text because:
1) cout output characters until '\0', so your function will correct output until get first 0 in binary representation of int (for example for 5 = 101 it will output only one smile with code 0x01).
2) your last character in array is not '\0', so cout will output garbage until '\0' or memory access exception.

How to modify a value in textual data file using C

while(!feof(fp))
{
fscanf(fp,"%d %s %d %d",&res[i].id,res[i].title,&res[i].price,&res[i].qty);
i++;
}
while(j<i)
{
printf("\nID:|%d|\tNAME:|%s|\tPRICE:|%d|\tQTY:|%d|",res[j].id,res[j].title,res[j].price,res[j].qty);
j++;
}
I have this piece of code which is collecting data from the file. Now I want to know if get an input from a user like res[id] and I want to decrease the quantity of that particular id how to do that?

If the file is in binary format it is easier to do what you want.
What is the difference between the text and the binary format? If the file is written in binary format, then a 32-bit integer will be represented as 32 consecutive bits in the file. While in text format the number will be represented as sequence of digits for instance 32.
So what's the big deal in that difference? Well if you replace 32 with 1243, in binary format the number will still take the same 32 bits so nothing else needs to be moved, all you change is these 4 bytes. While in the second case you add 2 more digits which will cause all the subsequent contents of the file to shift with two bytes.
In order to shift everything as needed, you will need to read the current contents of the file change the value and then write the contents back. I mean all the contents following the change you are doing.

Compression algorithms for numbers only

I am to compress location data (latitude,longitude, date,time). All the numbers are in fixed format. 2 of them (latitude,longitude) are with decimal format. Other 2 are integers.
Now these numbers are in fixed format string.
What are the algorithms for compressing numbers in fixed format?
Is number only compressions (if there any) better than string compression?
Should I directly compress string without converting it to numbers and then compress?
Thanks in advance.

This is one of these places where a little theory is helpful. You need to think about several things:
what is the resolution of your measurements: 0.1° or 0.001°? 1 second or one microsecond?
are the measurements associated and in some order, or tossed together randomly?
Let's say, just for example, that the resolution is 0.01°. Them you know that your values range from -180° to +180°, or 35900 different values. Lg(35900) ≈ 16 so you need 16 bits; 14 bits for -90°–+90°. Clearly, if you're storing this kind of value as floating-point, you can compress the data by half immediately.
Similarly with date time, what's the range; how many bits must you have?
Now, if the data is in some order (like, samples taken sequentially aboard a single ship) then all you need is a start value and a delta; that can make a big difference. With a ship traveling at 30 knots, the position can't change any more that about 0.03 degrees an hour or about 0.0000083 degrees a second. Those deltas are going to be very small values, so you can store them in a very few bits.
The point is that there are a number of things you can do, but you have to know more about the data than we do to make a recommendation.
Update: Oh, wait, fixed point strings?!
Okay, this is (relatively) easy. Just to start with, yes, you want to convert your strings into some binary representation. Just making up a data item, you might have
040.00105.0020090518212100Z
which you could convert to
| 4000 | short int, 16 bits |
| 10500 | short int, 16 bits |
| 20090518212100Z | 64 bits |
So that's 96 bits, 12 bytes versus 26 bytes.

Compression typically works on a byte stream. When a stream has a non-uniform distribution of byte values (for instance text, or numbers stored as text), the compression ratio you can achieve will be higher, since fewer bits are used to store the bytes which appear more frequently (in Huffman compression).
Typically, the data you are talking about will simply be stored as binary numbers (not text), and that's usually space and retrieval efficient.
I recommend you have a look at The Data Compression Book

What kind of data are you compressing? How is it distributed? Is it ordered in any way? All of these things can affect how well it compresses, and perhaps allow you to convert the data in to something more easily compressed, or simply smaller right out the gate.
Data compress works poorly on "random" data. If your data is within a smaller range, you may well be able to leverage that.
In truth, you should simply try running any of the common algorithms and see if the data is "compressed enough". If not, and you know more about the data than can be "intuited" by the compression algorithms, you should leverage that information.
An example is say that your data is not just Lat's and Long's, but they're assumed to be "close" to each other. Then you could probably store an "origin" Lat and Long, and the rest can be differential. Perhaps these differences are small enough to fit in to a single, signed byte.
That's just a simple example of things you can do with knowledge of the data vs what some generic algorithm may not be able to figure out.

It depends on what you are going to do with the data, and how much precision you need.
Lat/long is traditionally given in degrees, minutes, and seconds, with 60 seconds to the minute, 60 minutes to the degree,and 1 degree of latitude nominally equal to 60 nautical miles (nmi). 1 minute is then 1 nmi, and 1 second is just over 100 ft.
Latitude goes from -90 to +90 degrees. Representing latitude as integer seconds gives you a range of -324000..+324000, or about 20 bits. Longitude goes -180 to +180, so representing longitude the same way requires 1 more bit.
So you can represent a complete lat/long position, to +/- 50 ft, in 41 bits.
Obviously, if you don't need that much precision, you can back down your bit count.
Observe that a traditional single-precision 32-bit float uses about 24 bits of mantissa, so you are down to about +/- 6 feet if you just convert your lat/long in seconds to float. It is kind of hard to beat two single-precision floats for this kind of thing.

Depending on the available characters, you could make something quite easily.
For example, if the input is only digits (0..9), here's a solution that will encode and decode them, in Kotlin (similar thing on Java) :
fun encodeDigitsOnlyString(stringWithDigitsOnly: String): ByteArray {
//we couple each 2 digits together into a single byte.
//For the last digit, if it has no digit to pair with, it's paired with something that's not a digit
val result = ArrayList<Byte>()
val length = stringWithDigitsOnly.length
var lastDigit: Byte? = null
for (i in 0 until length) {
val char = stringWithDigitsOnly[i]
val digitAsByte = char.toString().toInt().toByte()
if (lastDigit == null) {
if (i == length - 1) {
//last digit
val newByte = (digitAsByte + 0xf0).toByte()
result.add(newByte)
} else {
//more to go
lastDigit = digitAsByte
}
} else {
val newByte = (digitAsByte + lastDigit.toInt().shl(4)).toByte()
result.add(newByte)
lastDigit = null
}
}
return result.toByteArray()
}
fun decodeByteArrayToDigitsOnlyString(encodedDigitsOnlyByteArray: ByteArray): String {
val sb = StringBuilder(encodedDigitsOnlyByteArray.size * 2)
for (byte in encodedDigitsOnlyByteArray) {
val hex = Integer.toHexString(byte.toInt()).takeLast(2).padStart(2, '0')
if (hex[0].isLetter())
sb.append(hex.last())
else
sb.append(hex)
}
return sb.toString()
}
Example usage:
val inputString="12345"
val byteArray=encodeDigitsOnlyString(inputString) //produces a byte array of size 3
val outputString=decodeByteArrayToDigitsOnlyString(byteArray) //should be the same as the input

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js