I want to parse the given binary files and extract the information (text) from that using C++.
What methods are available?
You might use strings(1) to extract the strings of printable characters in files to a file or pipe, then process these lines. For example:
$ strings werl.exe
!This program cannot be run in DOS mode.
Rich
.text
`.rdata
#.data
.rsrc
QRVh
Could not load module %s.
win_erlexec
Could not find entry point "win_erlexec" in %s.
Could not find key %s in section %s of file %s
Cannot find erlexec.exe
erts-*
\bin
erts-
To save this output to a file out.txt, you use redirection:
$ strings werl.exe > out.txt
You can scan for strings of printable characters - most bytes in a binary code are non-printable, so when there is and uninterrupted string of for example 6 or more printable characters, there is good chance it is real string value. Plus strings are often terminated with \0, so you can look for a string of printable characters terminated by \0.
Related
I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.
You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.
TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.
I am keeping a large repository of strings in a character-delimited file. Currently, I am reading the strings into string variables, and then later printing them.
The problem I'm facing is how to store and print new line characters. In the file, if the string, for example, is:
"Hello this is \n\n a new line"
then the literal '\n' is printed in my program terminal when I print the string, however I would like to print new lines.
Is this a matter of processing the strings character by character, or is there a proper way to read the strings into the string variables that will allow this to work?
I'm working with text-files (UTF-8) on Windows and want to read them using C++.
To open the file corrently, I use fopen. As described here, there are two options for opening the file:
Text mode "rt" (Carriage return + Linefeed will automatically be converted into Linefeed; Short "\r\n" becomes "\n").
Binary mode "rb" (The file will be read byte by byte).
Now it becomes tricky. I don't want to open the file in binary mode, since I would lose the correct handling of my UTF-8 characters (and there are special characters in my text-files, which are corrupted when interpreted as ANSI-character). But I also don't want fopen to convert all my CR+LF into LF.
Is there a way to combine the two modes, to read a text-file into a string without tampering with the linefeeds, while still being able to read UTF-8 correctly?
I am aware, that the reverse conversion would happen, if I write it through the same file, but the string is sent to another application that expects Windows-style line-endings.
The difference between opening files in text and binary mode is exactly the handling of line end sequences in text mode or not touching them in binary mode. Nothing more nothing less. Since the ASCII characters use the same code points in Unicode and UTF-8 retains the encoding of ASCII characters (i.e., every ASCII file happens to be a UTF-8 encoded Unicode file) whether you use binary or text mode won't affect the other bytes.
It may be worth to have a look at James McNellis "Unicode in C++" presentation at C++Now 2014.
I made a program wich use the Huffman Coding to compress and decompress .txt files (ANSI, Unicode, UTF-8, Big Endian Unicode...).
In the decompression I take characters from a binary tree and I put them into a .txt in binary mode:
Ofstream F;
F.open("example.txt", ios::binary);
I have to write into .txt file in binary mode because I need to decompress every type of .txt file (not only ANSI) so my simbols are the single bytes.
On Windows it puts every simbol but doesn't care about the Enter character!
For example, if I have this example.txt file:
Hello
World!
=)
I compress it into example.dat file and I save the Huffman tree into another file (exampletree.dat).
Now to decompress example.dat I take characters from the tree saved in exampletree.dat and I put them into a new .txt file through put() or fwrite(), but on Windows it will be like this:
HelloWorld!=)
On Ubuntu it works perfectly and saves also the Enter character!
It isn't a code error because if I print in the console the decompressed .txt file, it also prints the enter characters! So there is a problem in Windows! Could someone help me?
Did you try opening the file using a wordpad or any other advanced text editor(Notepad++) which identify LF as newline character. The default editor notepad would put it in a single line like you described.
This may not be the solution you are looking for. But the problem looks to be due to having LF as the line break instead of windows default CR/LF.
It looks like it will be the difference in handling EndOfLine on Linux vs. Windows. The EOL can be just "\n" or "\r\n" - i.e. Windows usually puts 0x0d,0x0a at the end of lines.
On Windows there's a difference between:
fopen( "filename", "w" );
fopen( "filename", "tw" );
quote:
In text mode, carriage return–linefeed combinations are translated into single linefeeds on input, and linefeed characters are translated to carriage return–linefeed combinations on output
I'm working in a software that need to save a utf-8 file with special characters (like 'çäüëé').
I received the content to save (regular string with special characters encoded) from a webservice (with gsoap). When I try to save, using ofstream, the file saves a square and other strange characters instead of special characters.
When I try to convert the regular string to a wide string, it's lost the special characters (it is replaced by different ones). And, using wofstream, the file not saves the file when there are special characters.
I try to use also utf8-cpp, but the file it wasn't wrote correct too.