I implemented an LZW compressor which encodes the strings into integers with the help of a hash function. I stored the coded string in a text file. Now I need to decompress the same code. I am confused how to differentiate between a two digit integer and a single digit integer while reading from the text file.
For example, my dictionary is:
0 c
1 bba
3 aa
5 ac
7 bb
8 aab
9 a
10 b
and so on.
Now, suppose I encoded a string 'aaabbbac' into "9 3 10 7 9 0" which gets stored in the text file as 9310790. How to differentiate between 0, 1 and 10 while reading from a file?
Some options:
Store them in binary format rather than text format. That might be a little challenge to read and write but it might be worthy the learning. The problem is if you want to visualize the numbers using a text editor but you can find some tool to visualize binary files. Assuming 2 bytes per integer (type short), your example would be in hexa (not considering endian): 00 09 00 03 00 0a 00 07 00 09 00 00
Store them with fixed length per number. Example: printf("%03d", number) will always create numbers with 3 digits. Your example would be: 009003010007009000
Use a comma or semi-colon separator: 9,3,10,7,9,0
Related
I just started to learn web assembly . I found this text
"In binary format The first four bytes represent the Wasm binary magic
number \0asm; the next four bytes represent the Wasm binary version in
a 32-bit format"
I am not able to understand this . Can anyone explain me this
\0 is a character with code 0 (the first 00 in 00617369), the remaining three are literal characters a, s and m. With codes 97, 115 and 109 respectively, or 61, 73 and 6d in hex.
For one value I am getting two different values when I am casting nvarchar to binary.
For the value A12875 I am getting a result like 0x4131323837350000000000000000000000000000
I get this when I am using a select statement.
If the same statement is executed in a stored procedure I am getting a result like this
0x2000410031003200380000000000000000000000
I am using the same query
select cast('A12875' as binary(20))
What is the difference?`
The difference is character encoding. A character encoding is a way to represent characters as bytes.
The characters you have, and their unicode code points are
A code point 41
1 code point 31
2 code point 32
8 code point 38
7 code point 37
5 code point 35
If you use UTF-8 (or Latin-1 or even ASCII) to encode this string you will get, left padded in a field of 20 bytes:
41 31 32 38 37 35 00 00 ... 00
But there are other character encodings. It looks like when you run a stored procedure, it is choosing UTF-16LE as the encoding, and that somehow a space character ends up in front. In UTF-16LE the code point 41 is represented as
41 00
because it would normally be 0041 but the bytes are reversed. So you would expect:
41 00 31 00 32 00 38 00 37 00 35 00 ... 00 00
The space character is code point 20 so it is represented as 20 00. I don't know why they put the space up front; it could be a funny way of making a byte order mark, i.e. 2000 for little endian and 0020 for big endian.
At any rate you should look at the SQL Server documentation to see how to use character encodings when characters are converted to bytes. You know, whenever you try to covert characters to bytes, you must specify an encoding. Maybe a default is there, but in general characters->bytes make no sense without an encoding. In your scenario, the two different environments used two different defaults.
I need detect whether file is MPEG ADTS file. I've searched for it around but whether I seek badly or something else but I can't find signature using which I could have said surely that certain file has MPEG ADTS format.
E.g. we can say for sure that file is MP4 if it begins with such signature 00 00 00 nn 66 74 79 70 6D 70 34.
How can it be done with MPEG ADTS?
Thanks in advance for any help!
ADTS header is typically used in stand alone aac,mpeg-ts file.(streaming scenario)
ADIF is used mainly in MP4 file
adts file header starts with 12bits "sync work" which is always (111111111111)
next 1 bits is ID -
next 2 bits (always 0)
http://developer.longtailvideo.com/trac/browser/providers/adaptive/doc/adts.pdf?rev=1460 (provide the full header)
so your algo to detect would be -
search for 12 bits sync work
validate that next fields contain valid values
I have a text file with two non-ascii bytes (0xFF and 0xFE):
??58832520.3,ABC
348384,DEF
The hex for this file is:
FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46
It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).
I am trying to strip these bytes out with sed, but nothing I do seems to match them.
$ sed 's/[^a-zA-Z0-9\,]//g' test.csv
??588325203,ABC
348384,DEF
$ sed 's/[a-zA-Z0-9\,]//g' test.csv
??.
Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?
Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:
$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF
FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A
Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.
Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.
A solution: This same task seems easy enough via Perl:
$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF
However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.
sed 's/[^ -~]//g'
or as the other answer implies
sed 's/[\x80-\xff]//g'
See section 3.9 of the sed info pages. The chapter entitled escapes.
Edit for OSX, the native lang setting is en_US.UTF-8
try
LANG='' sed 's/[^ -~]//g' myfile
This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8
This will strip out all lines that begin with the specific bytes FF FE
sed -e 's/\xff\xfe//g' hexquestion.txt
The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.
The FF and FE bytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text. FF FE indicates UTF-16 in Little Endian
Here's an excerpt from the FAQ:
Q: How I should deal with BOMs?
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
References
unicode.org/FAQ/UTF BOM
See also
Wikipedia/Byte order mark
Wikipedia/Endianness
Related questions
Why would I use a Unicode Signature Byte-Order-Mark (BOM)?
Difference between Big Endian and little Endian Byte order
To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:
Show all the bytes:
$ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C
00000000 31 32 33 20 61 62 63 ff fe 7f 80 |123 abc....|
Have sed remove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:
$ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C
00000000 31 32 33 61 62 63 ff fe 80 |123abc...|
Have sed remove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:
$ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C
00000000 31 32 33 61 62 63 |123abc|
On OS X, the Byte Order Mark is probably being read as a single word. Try either sed 's/^\xfffe//g' or sed 's/^\xfeff//g' depending on endianess.
You can get the hex codes with \xff \xfE and replace it by nothing.
As an alternative you may used ed(1):
printf '%s\n' H $'g/[\xff\xfe]/s///g' ',p' | ed -s test.csv
printf '%s\n' H $'g/[\xff\xfe]/s///g' wq | ed -s test.csv # in-place edit
Working with exclusive-OR on bits is something which is clear to me. But here, XOR is working on individual characters. So does this mean the byte which makes up the character is being XORed? What does this look like?
#include <iostream.h>
int main()
{
char string[11]="A nice cat";
char key[11]="ABCDEFGHIJ";
for(int x=0; x<10; x++)
{
string[x]=string[x]^key[x];
cout<<string[x];
}
return 0;
}
I know bits XORed look like this:
1010
1100
0110
XOR has the nice property that if you XOR something twice using the same data, you obtain the original. The code you posted is some rudimentary encryption function, which "encrypts" a string using a key. The resulting ciphertext can be fed through the same program to decrypt it.
In C and C++ strings are usually stored in memory as 8-bit char values where the value stored is the ASCII value of the character.
Your code is therefore XORing the ASCII values. For example, the second character in your output is calculated as follows:
'B' ^ ' '
= 66 ^ 32
= 01000010 ^ 00100000
= 01100010
= 98
= 'b'
You could get a different result if you ran this code on a system which uses EBCDIC instead of ASCII.
The xor on characters performs the xor operation on each corresponding bit of the two characters (one byte each).
So does this mean the byte which makes up the character is being XORed?
Exactly.
What does this look like?
As any other XOR :) . In ASCII "A nice cat" is (in hexadecimal)
41 20 6E 69 63 65 20 63 61 74
and ABCDEFGHIJ
41 42 43 44 45 46 47 48 49 4A
so, if you XOR each byte with each other, you get
00 62 2D 2D 26 23 67 2B 28 3E
, which is the hexadecimal representation of "\0b--&#g+(>", i.e. the string that is displayed when you run that code.
Notice that if you XOR again the resulting text you get back the text with which you started; this the reason why XOR is used often in encoding and cyphering.
This is a simple demonstration of one time pad encryption, which as you can see is quite simple and also happens to be the only provably unbreakable form of encryption. Due to it being symmetric and having a key as large as the message, it's often not practical, but it still has a number of interesting applications.. :-)
One fun thing to notice if you're not already familiar with it is the symmetry between the key and the ciphertext. After generating them, there's no distinction of which one is which, i.e. which one was created first and which was based on the plaintext xor'd with the other. Aside from basic encryption this also leads to applications in plausible deniability.