How to customize the encoding of a text file in c++? - c++

I am working on huffman file compression project. Till now I know it works something like:
file.txt(original) -> file.huf(encoded, compressed) -> file.txt(decoded)
What I have to do is to open the txt file, generate the huffman code, but how can I replace those code in the binary code file of the original file. For example, if file.txt stores abc then its ASCII encoded file will store 01100001 01100010 01100011 and huffman-coded file i.e file.huf should store 10 11 0 and then this file should be decoded using the encoding map generated.
My question is to how can I do this in C++ at the file implementation level, how can I alter the binary file?

I have figured it out, in simple words, make the code for an alphabet and then for another and combine, iterate this till the code is equal to or more than 8 characters containing 1 and 0, then this sequence of 8 digits is encoded to some character and stored in a file.
This way file can be compressed.

Related

C++: Problem of Korean alphabet encoding in text file write process with std::ofstream

I have a code for save the log as a text file.
It usually works well, but I found a case where doesn't work:
{Id": "testman", "ip": "192.168.1.1", "target": "?뚯뒪??exe", "desc": "?덈뀞諛⑷??뚯슂"}
My code is a simple logic that saves the log string as a text file.
My code was works well when log is English, but there is a problem when log is Korean language.
After checking through various experiments, it was confirmed that Korean language would not problem if the file could be saved as utf-8 format.
I think, if Korean language is included in log string, c++ is basically saved as ANSI format.
This is my c++ code:
string logfilePath = {path};
log = "{\Id\": \"testman\", \"ip\": \"192.168.1.1\", \"target\": \"테스트.exe\", \"desc\": \"안녕방가워요\"}";
ofstream output(logFilePath, ios::app);
output << log << endl;
output.close();
Is there a way to save log files as uft-8 or any other good way?
Please give me some advice.
You could set UTF-8 in File->Advanced Save Options.
If you do not find it, you could add Advanced Save Options in Tools->Customize->Commands->Add Command..->File.
TDLR: write 0xefbbbf (3-bytes UTF-8 BOM) in the beginning of the file before writing out your string.
One of the hints that text viewer software use to determine if the file should be shown in the Unicode format is something called the Byte Order Marker (or BOM for short). It is basically a series of bytes in the beginning of a stream of text that specifies the encoding and endianness of the text string. For UTF-8 it is these three bytes 0xEF 0xBB 0xBF.
You can experiment with this by opening notepad, writing a single character and saving file in the ANSI format. Then look at the size of file in bytes. It will be 1 byte. Now open the file and save it in UTF-8 and look at the size of file again. It will 4 bytes that is three bytes for the BOM and one byte for the single character you put in there. You can confirm this by viewing both files in some hex editor.
That being said, you may need to insert these bytes to your files before writing your string to them. So why UTF-8? you may ask, well, it depends on the encoding the original string is encoded in (your std::string log) which in this case it is an string literal written in a source file whose encoding is (most likely) UTF-8. Therefor the bytes that build up the string are made according to this encoding and are put into your executable.
note that std::string can contain Unicode string, it just can't make sense of it. For example it reports its length wrong. But it can be used to carry Unicode string around fine.

How can I know if im reading a binary or a text file

My program has different options: You can read a binary file or a text file, but you can the binary file option and choose a text file... How can I do to detect that you have introduced a incorrect file while I'm doing this
while(fich.read((char *)&struct,sizeof(struct)))
How can I do to detect that you have introduced a incorrect file while I'm doing this
The simple answer is: You cannot.
It's impossible to distinguish plain (let's say ASCII encoded) text files from binary files.
Any of the introductory byte sequences read from the file might be valid for both.
The silly but common solutions for this problem are:
give your file name an extension that implies a particular format
let your file have a magic byte sequence (1-2 bytes) in the beginning and imply a particular format

How to read large files in C++ with mixed text and binary

I need to read a large file of either text, binary, or combination, such as a JPEG file, encrypt it, and write it to a file. At some later time I will need to read the encrypted data, and decrypt it.
The end goal is to verify that the decrypted data matches the original data.
My problem is that with large files greater than 1Meg, I don't want to read and write character by character. I am targeting this code for a phone and I/O will cause too long a delay for the user.
With a pure text file, using fread() and fwrite() convert the data to binary, and the result is different than the original. With a jpeg image, it appears that there is some textual content mixed in with the binary data.
Is there a way to efficiently read in an arbitrary type of file and write it back in the original format?
Or is character by character the only option?
Or am I still out of luck?
After debugging it turned out that the decrypt function had the plain text and cipher text buffers assigned backwards. After swapping the buffer assignments, the decrypted results matched the original data. I originally thought that maybe reading the text as binary and then rewriting as binary would not appear as text, but I was wrong.
Reading the entire file as binary works just fine.

How to read output of hexdump of a file?

I wrote a program in C++ that compresses a file.
Now I want to see the contents of the compressed file.
I used hexdump but I dont know what the hex numbers mean.
For example I have:
0000000 00f8
0000001
How can I convert that back to something that I can compare with the original file contents?
If you implemented a well-known compression algorithm you should be able to find a tool that performs the same kind of compression and compare its results with yours. Otherwise you need to implement an uncompressor for your format and check that the result of compressing and then uncompressing is identical to your original data.
That looks like a file containing the single byte 0xf8. I say that since it appears to have the same behaviour as od under UNIX-like operating systems, with the last line containing the length and the contents padded to a word boundary (you can use od -t x1 to get rid of the padding, assuming your od is advanced enough).
As to how to recreate it, you need to run it through a decryption process that matches the encryption used.
Given that the encrypted file is that short, you either started with a very small file, your encryption process is broken, or it's incredibly efficient.

What is a good way to test a file to see if its a zip file?

I am looking as a new file format specification and the specification says the file can be either xml based or a zip file containing an xml file and other files.
The file extension is the same in both cases. What ways could I test the file to decide if it needs decompressing or just reading?
The zip file format is defined by PKWARE. You can find their file specification here.
Near the top you will find the header specification:
A. Local file header:
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
file name (variable size)
extra field (variable size)
From this you can see that the first 4 bytes of the header should be the file signature which should be the hex value 0x04034b50. Byte order in the file is the other way round - PKWARE specify that "All values are stored in little-endian byte order unless otherwise specified.", so if you use a hex editor to view the file you will see 50 4b 03 04 as the first 4 bytes.
You can use this to check if your file is a zip file. If you open the file in notepad, you will notice that the first two bytes (50 and 4b) are the ASCII characters PK.
You could look at the magic number of the file. The ones for ZIP archives are listed on the ZIP format wikipedia page: PK\003\004 or PK\005\006.
Check the first few bytes of the file for the magic number. Zip files begin with PK (50 4B). As XML files cannot start with these characters and still be valid, you can be fairly sure as to the file type.
You can use file to see if it's a text file(xml) or an executable(zip).
Scroll down to see an example.
Not a good solution though, but just thinking out load... how about:
try
{
LoadXmlFile(theFile);//Exception if not an xml file
}
catch(Exception ex)
{
LoadZipFile(theFile)
}
You could check the file to see if it contains a valid XML header. If it doesn't, try decompressing it.
See Click here for XML specification.
File magic numbers
To clarify, it starts with 50 4b 03 04.
See http://www.pkware.com/documents/casestudies/APPNOTE.TXT (From Simon P Stevens)
You could try unzipping it - an XML file is exceedingly unlikely to be a valid zip file, or could check the magic numbers, as others have said.
it depends on what you are using but the zip library might have a function that test wether a file or not is a zip file
something like is_zip, test_file_zip or whatever ...
or create you're own function by using the magic number given above.