File Binary vs Text - c++

Are there some situation where I have to prefer binary file to text file? I'm using C++ as programming language?
For example if I have to store some large text file is it better use text file or binary file?
Edit
The file for the moment has no requirment to be readable from human. Are some performance difference, security difference and so on?
Edit
Sorry for the omit other the requirment (thanks to Carey Gregory)
The record to save are in ascii encoding
The file must be crypted ( AES )
The machine can power off any time. So I've to try to prevents errors.
I've to know if the file change outside the program, I think I'll use a sha1 digest of the file.

As a general rule, define a text format, and use it. It's much
easier to develop and debug, and it's much easier to see what is
going wrong if it doesn't work.
If you find that the files are becoming too big, or taking to
much time to transfer over the wire, consider compressing them.
A compressed text file is often smaller than you can do with
binary. Or consider a less verbose text format; it's possible
to reliably transmit a text representation of your data with
a lot less characters than XML uses.
And finally, if you do end up having to use binary, try to chose
an existing format (e.g. Google's protocol blocks), or base your
format on an existing format. Just remember that:
Binary is a lot more work than text, since you practically
have to write all of the << operators again, including those
in the standard library.
Binary is a lot more difficult to debug, because you can't
easily see what you've actually done.
Concerning your last edit:
Once you've encrypted, the results will be binary. You can
use a text representation of the binary (base64 or some such),
but the results won't be any more readable than the binary, so
it's not worth the bother. If you're encrypting in process,
before writing to disk, you automatically lose all of the
advantages of text.
The issues concerning powering off mean that you cannot use
ofstream directly. You must open or create the file with the
necessary options for full transactional integrity (O_SYNC as
a flag to open under Unix). You must write each record as
a single write request to the system.
It's always a good idea to have a checksum, just in case. If
you're worried about security, SHA1 is a good choice. But keep
in mind that if someone has access to the file, and wants to
intentionally change it, they can recalculate the SHA1 and
insert the new value as well.

All files are binary; the data within them is a binary representation of some information. If you have to store a large amount of text then the file will contain the binary representation of that text. The difference between a "binary file" and a "text file" is that creating the latter involves converting data to a text form before saving it. This is typically done so humans can read it.
The distinction between binary and text is usually made when storing data that is for computer consumption. Typically this data would not be text - it might be a list of numerical configuration values, for example: 1, 2, 3.
If you stored this in text format, your file could contain a list of human-readable numbers, and if you opened the file in Notepad you might see one number per line. But what you're actually saving here is not the binary values 1, 2, 3 - you're saving a string "1\n2\n3\n". Note that this string is 6 characters long, and the binary values (assuming ASCI) would actually be 49, 10, 50, 10, 51, 10!
If the same data were stored in binary format, you would store the numbers in the smallest useful space, and write the file as individual bytes that can often only be read by the code that created them. Opening this file in Notepad would likely display junk characters, because the data makes no sense as text. In this case you would be saving a byte array with actual values { 1, 2, 3 } - or even a single byte with the three values embedded. This could be much smaller than the human-readable equivalent.

Binary files store a sequence of bytes like all other files. You can store numeric values like integers per 4 bytes, characters per single byte, or even serialized class objects and anything you want.
When you know how to read a binary file (ie. you know what is stored in it) you can extract all the information from it. However, text files use text encodings like UTF8, ANSI etc. and they are intended to encode text characters to be processed by text editors.

Binary files are for machines only to interpret, whereas a text file, a human can also open and interpret its content.
So it depends whether you want your file to be readable by a human or not.

It depends on a lot of factors. I can think of two right now:
Do you require the file to be readable by humans?
Is compression a factor? A 10-digits number will take at least 10 bytes as text, but might take as little as four or two as binary.

All data is binary. You always need a machine to interpret it for you. Even if the data is compressed like protocol buffers, Avro, Thrift etc, it is binary, and if it is uncompressed, it is still binary. If you want to read protocol buffers by notepad, there is a two step process. Uncompress, and read. In case of text, this step of uncompressing is not needed. Same is case with encrypted. First unencrypted, and then read. Humans cannot read binary (as some commenters are mentioning). We still need notepad to interpret and display binary (so called text).

All data stored in a text file are human-readable graphic characters. Each line of data ends with a new line character.
In case of a binary file - data is stored in the same format as they are stored in the memory. There are no lines or new line characters. There is an end of file marker.
Moreover binary files show more efficiency for memory as they are stored in zeros and one's.

Related

Most efficient file type for random data access

I am writing a password generation program. I have collected a list of around 30,000 English words and plan on picking from them at random by index.
Currently, I have all the words in a .txt file each separated by a newline character and organized by length.
My current plan is to write the program in C++ because that is the language I am most comfortable in so I could just load the entire file into memory, but that seems incredibly sloppy.
What would be a more efficient way (or file type like JSON if necessary) to do this? Thanks
30,000 words sounds like an insignificant amount of data to load. Even if it's ~50-500MB just load it in and forget about it.
On a modern system this will take a fraction of a second to accomplish the first time, any SSD can do ~600MB/s+, and even less once it's in the OS disk buffer.
You'd only concern yourself with not loading it if you've got a file too big to fit in memory.

Open statement in fortran

I have question with open command of fortran.
OPEN (UNIT = , FILE=file-name, ACCESS=access, FORM=form, RECL=recl)`
Access = sequential, direct
FORM=formatted, unformatted
recl is is the record length in bytes for a file
I tried searching a lot, but could not get what is meaning of sequential or direct access, formatted or unformatted file, record length of a file. Can someone explain me what these terms mean?
File access specifies how the file will be written to (or read from) after opening. Opening with one access mode, but reading/writing consistent with another access mode, often results in a runtime error.
Sequential access, naturally enough, implies reading and writing sequentially. Writing sequentially means that output is placed in the output file in the same order that the program produces it so, if X is output before Y, the file will contain X before (closer to the beginning of the file) than Y. Reading sequentially means that reading occurs from start toward end of the file. Append access is a special form of sequential access which starts at the end of the file (so write operations add to the end of the file).
Direct access means that contents of the file can be accessed in any order. This is also called random access. Essentially, when performing input or output, the program must specify the position in the file where the operation is to occur.
The position in the direct access file in Fortran is specified in terms of "records", which all have exactly the same length (specified by the RECL= clause when the file is opened). So, if a file contains 20 records and has record length equal to 30, the total size of data the program can access from the file is 600 bytes, and every read or write operation will access a record containing 30 bytes.
An unformatted file basically means the contents of the file are read and written as a stream. An unformatted sequential access file is the equivalent of a binary file in languages like C that is read from beginning to end. An unformatted direct access file is also binary, but operations can access the file in any order (under control of the program).
A formatted file essentially means that all reading and writing must involve a format specification. There are also some special treatments such as, when writing, a newline marker written to the file at the end of every write statement.
A straight text file is typically opened as a sequential access formatted file. Every Fortran read or write operation acts on a new line (so two write statements will produce two lines in the file, and two corresponding read statements will be need to read them back in).
It is possible to have a formatted direct access file. This basically means the read and write statements must specify formats to read/write the records, but records can be accessed in any order. The ends of records are typically marked with newlines.
It's easy to find on the web (including discussion here):
A "record" is data, usually in characters. Some files have records which are all the same length, some do not. In between, there are files which store the length of each record as part of the record. It is simplest to work with files having records which are all the same length, because (for many storage devices) you can compute the beginning of a particular record by knowing the record number and the length of the records. If the records are different lengths, it is more work to keep track of the record locations.
sequential files are accessed one record at a time, like a tape (see this page for length discussion). As a rule, tapes could be rewound, read forward, but reading at a random point was harder. Doing that is direct access. This page makes it clear that there is a distinct choice between the two - you can have one or the other.
Formatted output is just that - making the output follow some report-style format (on the level of lines), while unformatted output does not follow tidy rules. See Fortran unformatted file format for examples of discussion. On a more technical slant, this page at Oracle goes into more depth.

How to efficiently decompress huffman coded file

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file.
I have found these related questions:
Huffman code with lookup table
How to decode huffman code quickly?
But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file
symbol
huffman code (unsigned long)
huffman code length
What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.
I have 3 questions:
What information should I write to file in the header to construct the look up table?
How do I recreate this table from file?
How do I use it to decode the huffman encoded file quickly?
Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.
You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".
As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.
So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.
the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.
Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'
You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.
on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

Fast CSV parser in C++

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.
I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.
However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.
I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.
Does anyone have a better method to read large .csv files? Many thanks in advance.
An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.
File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.
I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.
Another technique is called memory mapping, where the OS handles reading the file into memory as needed.
Please edit your post to show the code where the bottleneck is.