I have a text file with two non-ascii bytes (0xFF and 0xFE):
??58832520.3,ABC
348384,DEF
The hex for this file is:
FF FE 35 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 33 34 38 33 38 34 2C 44 45 46
It's coincidental that FF and FE happen to be the leading bytes (they exist throughout my file, although seemingly always at the beginning of a line).
I am trying to strip these bytes out with sed, but nothing I do seems to match them.
$ sed 's/[^a-zA-Z0-9\,]//g' test.csv
??588325203,ABC
348384,DEF
$ sed 's/[a-zA-Z0-9\,]//g' test.csv
??.
Main question: How do I strip these bytes?
Bonus question: The two regex's above are direct negations, so one of them logically has to filter out these bytes, right? Why do both of these regex's match the 0xFF and 0xFE bytes?
Update: the direct approach of stripping out a range of hex byte (suggested by two answers below) seems to strip out the first "legit" byte from each line and leave the bytes I'm trying to get rid of:
$sed 's/[\x80-\xff]//' test.csv
??8832520.3,ABC
48384,DEF
FF FE 38 38 33 32 35 32 30 2E 33 2C 41 42 43 0A 34 38 33 38 34 2C 44 45 46 0A
Notice the missing "5" and "3" from the beginning of each line, and the new 0A added to the end of the file.
Bigger Update: This problem seems to be system-specific. The problem was observed on OSX, but the suggestions (including my original sed statement above) work as I expect them to on NetBSD.
A solution: This same task seems easy enough via Perl:
$ perl -pe 's/^\xFF\xFE//' test.csv
58832520.3,ABC
348384,DEF
However, I'll leave this question open since this is only a workaround, and doesn't explain what the problem was with sed.
sed 's/[^ -~]//g'
or as the other answer implies
sed 's/[\x80-\xff]//g'
See section 3.9 of the sed info pages. The chapter entitled escapes.
Edit for OSX, the native lang setting is en_US.UTF-8
try
LANG='' sed 's/[^ -~]//g' myfile
This works on an osx machine here, I'm not entirely sure why it does not work when in UTF-8
This will strip out all lines that begin with the specific bytes FF FE
sed -e 's/\xff\xfe//g' hexquestion.txt
The reason that your negated regexes aren't working is that the [] specifies a character class. sed is assuming a particular character set, probably ascii. These characters in your file aren't 7 bit ascii characters, as they both begin with F. sed doesn't know how to deal with these. The solution above doesn't use character classes, so it should be more portable between platforms and character sets.
The FF and FE bytes at the beginning of your file is what is called a "byte order mark (BOM)". It can appear at the start of Unicode text streams to indicate the endianness of the text. FF FE indicates UTF-16 in Little Endian
Here's an excerpt from the FAQ:
Q: How I should deal with BOMs?
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
References
unicode.org/FAQ/UTF BOM
See also
Wikipedia/Byte order mark
Wikipedia/Endianness
Related questions
Why would I use a Unicode Signature Byte-Order-Mark (BOM)?
Difference between Big Endian and little Endian Byte order
To show that this isn't an issue of the Unicode BOM, but an issue of eight-bit versus seven-bit characters and tied to the locale, try this:
Show all the bytes:
$ printf '123 abc\xff\xfe\x7f\x80' | hexdump -C
00000000 31 32 33 20 61 62 63 ff fe 7f 80 |123 abc....|
Have sed remove characters that aren't alpha-numeric in the user's locale. Notice that the space and 0x7f are removed:
$ printf '123 abc\xff\xfe\x7f\x80'|sed 's/[^[:alnum:]]//g' | hexdump -C
00000000 31 32 33 61 62 63 ff fe 80 |123abc...|
Have sed remove characters that aren't alpha-numeric in the C locale. Notice that only "123abc" remains:
$ printf '123 abc\xff\xfe\x7f\x80'|LANG=C sed 's/[^[:alnum:]]//g' | hexdump -C
00000000 31 32 33 61 62 63 |123abc|
On OS X, the Byte Order Mark is probably being read as a single word. Try either sed 's/^\xfffe//g' or sed 's/^\xfeff//g' depending on endianess.
You can get the hex codes with \xff \xfE and replace it by nothing.
As an alternative you may used ed(1):
printf '%s\n' H $'g/[\xff\xfe]/s///g' ',p' | ed -s test.csv
printf '%s\n' H $'g/[\xff\xfe]/s///g' wq | ed -s test.csv # in-place edit
Related
I just started to learn web assembly . I found this text
"In binary format The first four bytes represent the Wasm binary magic
number \0asm; the next four bytes represent the Wasm binary version in
a 32-bit format"
I am not able to understand this . Can anyone explain me this
\0 is a character with code 0 (the first 00 in 00617369), the remaining three are literal characters a, s and m. With codes 97, 115 and 109 respectively, or 61, 73 and 6d in hex.
I implemented an LZW compressor which encodes the strings into integers with the help of a hash function. I stored the coded string in a text file. Now I need to decompress the same code. I am confused how to differentiate between a two digit integer and a single digit integer while reading from the text file.
For example, my dictionary is:
0 c
1 bba
3 aa
5 ac
7 bb
8 aab
9 a
10 b
and so on.
Now, suppose I encoded a string 'aaabbbac' into "9 3 10 7 9 0" which gets stored in the text file as 9310790. How to differentiate between 0, 1 and 10 while reading from a file?
Some options:
Store them in binary format rather than text format. That might be a little challenge to read and write but it might be worthy the learning. The problem is if you want to visualize the numbers using a text editor but you can find some tool to visualize binary files. Assuming 2 bytes per integer (type short), your example would be in hexa (not considering endian): 00 09 00 03 00 0a 00 07 00 09 00 00
Store them with fixed length per number. Example: printf("%03d", number) will always create numbers with 3 digits. Your example would be: 009003010007009000
Use a comma or semi-colon separator: 9,3,10,7,9,0
For one value I am getting two different values when I am casting nvarchar to binary.
For the value A12875 I am getting a result like 0x4131323837350000000000000000000000000000
I get this when I am using a select statement.
If the same statement is executed in a stored procedure I am getting a result like this
0x2000410031003200380000000000000000000000
I am using the same query
select cast('A12875' as binary(20))
What is the difference?`
The difference is character encoding. A character encoding is a way to represent characters as bytes.
The characters you have, and their unicode code points are
A code point 41
1 code point 31
2 code point 32
8 code point 38
7 code point 37
5 code point 35
If you use UTF-8 (or Latin-1 or even ASCII) to encode this string you will get, left padded in a field of 20 bytes:
41 31 32 38 37 35 00 00 ... 00
But there are other character encodings. It looks like when you run a stored procedure, it is choosing UTF-16LE as the encoding, and that somehow a space character ends up in front. In UTF-16LE the code point 41 is represented as
41 00
because it would normally be 0041 but the bytes are reversed. So you would expect:
41 00 31 00 32 00 38 00 37 00 35 00 ... 00 00
The space character is code point 20 so it is represented as 20 00. I don't know why they put the space up front; it could be a funny way of making a byte order mark, i.e. 2000 for little endian and 0020 for big endian.
At any rate you should look at the SQL Server documentation to see how to use character encodings when characters are converted to bytes. You know, whenever you try to covert characters to bytes, you must specify an encoding. Maybe a default is there, but in general characters->bytes make no sense without an encoding. In your scenario, the two different environments used two different defaults.
Working with exclusive-OR on bits is something which is clear to me. But here, XOR is working on individual characters. So does this mean the byte which makes up the character is being XORed? What does this look like?
#include <iostream.h>
int main()
{
char string[11]="A nice cat";
char key[11]="ABCDEFGHIJ";
for(int x=0; x<10; x++)
{
string[x]=string[x]^key[x];
cout<<string[x];
}
return 0;
}
I know bits XORed look like this:
1010
1100
0110
XOR has the nice property that if you XOR something twice using the same data, you obtain the original. The code you posted is some rudimentary encryption function, which "encrypts" a string using a key. The resulting ciphertext can be fed through the same program to decrypt it.
In C and C++ strings are usually stored in memory as 8-bit char values where the value stored is the ASCII value of the character.
Your code is therefore XORing the ASCII values. For example, the second character in your output is calculated as follows:
'B' ^ ' '
= 66 ^ 32
= 01000010 ^ 00100000
= 01100010
= 98
= 'b'
You could get a different result if you ran this code on a system which uses EBCDIC instead of ASCII.
The xor on characters performs the xor operation on each corresponding bit of the two characters (one byte each).
So does this mean the byte which makes up the character is being XORed?
Exactly.
What does this look like?
As any other XOR :) . In ASCII "A nice cat" is (in hexadecimal)
41 20 6E 69 63 65 20 63 61 74
and ABCDEFGHIJ
41 42 43 44 45 46 47 48 49 4A
so, if you XOR each byte with each other, you get
00 62 2D 2D 26 23 67 2B 28 3E
, which is the hexadecimal representation of "\0b--&#g+(>", i.e. the string that is displayed when you run that code.
Notice that if you XOR again the resulting text you get back the text with which you started; this the reason why XOR is used often in encoding and cyphering.
This is a simple demonstration of one time pad encryption, which as you can see is quite simple and also happens to be the only provably unbreakable form of encryption. Due to it being symmetric and having a key as large as the message, it's often not practical, but it still has a number of interesting applications.. :-)
One fun thing to notice if you're not already familiar with it is the symmetry between the key and the ciphertext. After generating them, there's no distinction of which one is which, i.e. which one was created first and which was based on the plaintext xor'd with the other. Aside from basic encryption this also leads to applications in plausible deniability.
I am trying to work out the format of a password file which is used by a LOGIN DLL of which the source cannot be found. The admin tool was written in AFX, so I hope that it perhaps gives a clue as to the algorithm used to encode the passwords.
Using the admin tool, we have two passwords that are encoded. The first is "dinosaur123456789" and the hex of the encryption is here:
The resulting hex values for the dinosaur password are
00h: 4A 6E 3C 34 29 32 2E 59 51 6B 2B 4E 4F 20 47 75 ; Jn<4)2.YQk+NO Gu
10h: 6A 33 09 ; j3.
20h: 64 69 6E 6F 73 61 75 72 31 32 33 34 35 36 37 38 ; dinosaur12345678
30h: 39 30 ; 90
Another password "gertcha" is encoded as
e8h: 4D 35 4C 46 53 5C 7E ; GROUT M5LFS\~
I've tried looking for a common XOR, but failed to find anything. The passwords are of the same length in the password file so I assume that these are a reversible encoding (it was of another age!). I'm wondering if the AFX classes may have had a means that would be used for this sort of thing?
If anyone can work out the encoding, then that would be great!
Thanks, Matthew
[edit:]
Okay, first, I'm moving on and going to leave the past behind in the new solution. It would have been nice to use the old data still. Indeed, if someone wants to solve it as a puzzle, then I would still like to be able to use it.
For those who want to have a go, I got two passwords done.
All 'a' - a password with 19 a's:
47 7D 47 38 58 57 7C 73 59 2D 50 ; G}G8XW|sY-P
79 68 29 3E 44 52 31 6B 09 ; yh)>DR1k.
All 'b' - a password with 16 b's.
48 7D 2C 71 78 67 4B 46 49 48 5F ; H},qxgKFIH_
69 7D 39 79 5E 09 ; i}9y^.
This convinced me that there is no simple solution involved, and that there is some feedback.
Well, I did a quick cryptanalysis on it, and so far, I can tell you that each password appears to start off with it's ascii value + 26. The next octet seems to be the difference between the first char of the password and the second, added to it's ascii value. The 3d letter, I haven't figured out yet. I think it's safe to say you are dealing with some kind of feedback cipher, which is why XOR turns up nothing. I think each octets value will depend on the previous.
I can go on, but this stuff takes a lot of time. Hopefully this may give you a start, or maybe give you a couple of ideas.
But since the output is equal in length with the input this looks like some fixed key cipher. It may be a trivial xor.
I suggest testing the following passwords:
* AAAAAAAA
* aaaaaaaa
* BBBBBBBB
* ABABABAB
* BABABABA
* AAAABBBB
* BBBBAAAA
* AAAAAAAAAAAAAAAA
* AAAAAAAABBBBBBBB
* BBBBBBBBAAAAAAAA
This should maybe allow us to break the cipher without reverse engineering the DLL.
Can the dll encode single character passwords? Or even a zero-character password?
You're going to want to start with the most trivial test cases.
You may be looking at this problem from the wrong angle. I would think that the best why to figure out how the password hashes are created is to reverse engineer the login dll.
I would recommend IDA Pro for this task. It's well worth the price for the help is gives you is reversing executable code into readable assembler. There are other disassemblers that are free if you don't want to pay money but I haven't come across anything as powerful as IDA Pro. A free static disassembler / debugger that I would recommend would be PEBrowse from SmidgeonSoft as it's good for quickly poking around a live running system and has good PDB support for loading debugging symbols.