rdbuf vs getline vs ">>" - c++

I want to load a map from a text file (If you can come up with whatever else way to load a map to an array, I'm open for anything new).
Whats written in the text file is something like this but a bit larger in the scale.
6 6 10 (Nevermind what this number "10" is but the two other are the map size.)
1 1 1 1 1 1
1 0 2 0 0 1
1 0 0 0 2 1
1 2 2 0 0 1
1 0 0 0 0 1
1 1 1 1 1 1
Where 1 is border, 0 is empty, 2 is wall.
Now i want to read this text file but I'm not sure what way would be best.
What i have in mind yet is:
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
Reading it number by number via getline().
Reading it number by number using the >> operator.
My question is which of the mentioned (or any other way if available) ways is better by means of ram use and speed.
Note: weather or not using rdbuf() is a good way. I'd appreciate a lot a good comparison between different ways of splitting a string, for example splitting a text to words regarding to the whitespaces.

Where 1 is border, 0 is empty, 2 is wall. Now i want to read this text file but I'm not sure what way would be best. What i have in mind yet is:
You don't have enough data to make a significant impact on performance by any of the means you mentioned. In other words, concentrate on correctness and robustness of your program then come back and optimize the parts that are slow.
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
The best method for inputting data is to keep the input stream flowing. This usually means reading large chunks of data per transaction versus many small transactions of small quantities. Memory is a lot faster to search and process than an input stream.
I suggest using istream::read before using rdbuf. For either one, I recommend reading into a preallocated area of memory, that is either an array or if using string, reserve a large space in the string when constructing it. You don't want the reallocation of std::string data to slow your program.
Reading it number by number via getline().
Since your data is line oriented this could be beneficial. You read one row and process the one row. Good technique to start with, however, a bit more complicate than the one below, but simpler than the previous method.
Reading it number by number using the >> operator.
IMO, this is the technique you should be using. The technique is simple and easy to get working; enabling you to work on the remainder of your project.
Changing the Data Format
If you want to make the input faster, you can change the format of the data. Binary data, data that doesn't need translations, is the fastest format to read. It bypasses the translation of textual format to internal representation. The binary data is the internal representation.
One of the caveats to binary data is that it is hard to read and modify.
Optimizing
Don't. Focus on finishing the project: correctly and robustly.
Don't. Usually, the time you gain is wasted in waiting for I/O or
the User. Development time is costly, unnecessary optimization is a waste of development time thus a waste of money.
Profile your executable. Optimize the parts that occupy the most
execution time.
Reduce requirements / Features before changing code.
Optimize the design or architecture before changing the code.
Change compiler optimization settings before changing the code.
Change data structures & alignment for cache optimization.
Optimize I/O if your program is I/O bound.
Reduce branches / jumps / changes in execution flow.

Related

Parallel bzip2 decompression scan may fail?

Bzip2 byte stream compression in parallel can be easily done with a FIFO queue where every chunk is processed as parallel task and streamed into a file.
The other way round parallel decompression is not so easy, because everything is bit-aligned and the exact bit-length of a block is known after it's decompressed.
As far as I can see, parallel decompress implementations use magic numbers for block start and stream end and perform a bit-scan. Isn't there a small chance that one of the streams contain such a magic value by coincidence?
Possible block validations:
4 Bytes CRC
6 Bytes "compressed magic"
6 Bytes "end of stream magic"
some bit combinations for the huffman trees are not allowed
max. x Bytes of huffman stream (range to search for next magic)
Per file:
4 Bytes File CRC
padding at the end
I could implement such a scan by just bit-shifting from the stream until I have a magic. But then when I read block N and it fails, I should (maybe also not) take into account, that it was a false positive. For a parallel implementation I can then stop all tasks for blocks N, N+1, N+2, .. , then try to find the next signature and go one. That makes everything very complicated and I don't know if it's worth the effort? I guess maybe not, but is there a chance that a parallel bzip2 implementation fails?
I'm wondering why a file format uses magic numbers as markers, but doesn't include jump hints. I guess the magic numbers are important for filesystem recovery, but anyway, why can't a block contain e.g 16bits for telling how far to jump to the next block.
Yes, the source code you linked notes that the magic 48-bit value can show up in compressed data by chance. It also notes the probability, around 10-14 (actually 2-48, closer to 3.55x10-15). That probability is at every sample, so on average one will occur in every 32 terabytes of compressed data. That's about one month of run time on one core on my machine. Not all that long. In a production environment, you should assume that it will happen. Because it will.
Also as noted in the source you linked, due to the possibility of a false positive, you need to then validate the remainder of the block. You would not stop the subsequent possible block processing, since it is extremely likely that they are all valid blocks. Just validate all and keep the validated ones. Verify when combining that the valid blocks exactly covered the input, with no overlaps. A properly implemented parallel bzip2 decompressor will always work on valid bzip2 streams.
It would need to be more than 16 bits, but yes, in principle a block could have contained the offset to the next block, since it already contains a CRC at the start of the block. Julian did consider that in the revision of bzip2, but decided against it:
bzip2-1.0.X, 0.9.5 and 0.9.0 use exactly the same file format as the
original version, bzip2-0.1. This decision was made in the interests
of stability. Creating yet another incompatible compressed file format
would create further confusion and disruption for users.
...
The compressed file format was never designed to be handled by a
library, and I have had to jump though some hoops to produce an
efficient implementation of decompression. It's a bit hairy. Try
passing decompress.c through the C preprocessor and you'll see what I
mean. Much of this complexity could have been avoided if the
compressed size of each block of data was recorded in the data stream.

Reading the nth element from a file

I have a text file with one column of numbers. There are about 500 numbers within the column. How would I only read the nth number. For example, is there a way to read and store the 49th number in the column?
if the numbers are fixed size (you don't show a sample file) then you can seek to size * n and read. Otherwise just do a read, parse, count loop till you reach n
If they're stored as text, so the space occupied by each number can vary, you're pretty much stuck with either reading through them until you get to the correct point, or else using a level of indirection--that is, creating an index into the data itself.
For the former, you could (for example) store each number as a 32-bit binary number. On a typical machine that means every number occupies 4 bytes, so getting to the Nth item, you multiply N by 4, seek to that point in the file, and read 4 bytes.
If you want to store the numbers as text, but still support seeking like this, you could pad every number with spaces so they all still take up the same amount of space in the file (e.g., 10 characters for every number).
If you really want to avoid that (or have a pre-defined format so you can't do it), then you could create an index into the file. This makes sense primarily when/if you can justify the cost of reading through the entire file and storing the position of the beginning of each line. There are at least two obvious justifications. One is that the indexing can be separated from usage, such as building the index in a batch at night to optimize use of the data during the day. The other possibility is simply that you use the data file enough that the savings from being able to seek to a specific point outweighs the time to index the file once.
There are less obvious justifications as well though--for example, you might need to meet some (at least soft) real-time constraints, and if you read through the file to find each item, you might not be able to meet those constraints--in particular, the size of file you can process can be limited by the real-time constraints. In this case, an index may be absolutely necessary to meet your requirements, rather than just an optimization.
Adding verbosity to #pm100's answer (perhaps unnecessarily), the fixed size means same ascii count.
001
01
Line 001 takes up 3 bytes while 01 only takes up two.
Thus if your file has numbers formatted like this:
1
2
3
100
10
Using lseek (or fseek) would only work if each column entry has the same number of ASCII chars for each line (as far as I am aware).
Also you need to keep track of the \n character if you go this route too.
lseek(fd, size * n * 2);
You can do it this way:
ifstream fin("example.txt");
std::string line;
for(size_t i = 0; i < n; ++i) {
std::getline(fin, line);
}
// line contains line n;

Unexpected "padding" in a Fortran unformatted file

I don't understand the format of unformatted files in Fortran.
For example:
open (3,file=filename,form="unformatted",access="sequential")
write(3) matrix(i,:)
outputs a column of a matrix into a file. I've discovered that it pads the file with 4 bytes on either end, however I don't really understand why, or how to control this behavior. Is there a way to remove the padding?
For unformated IO, Fortran compilers typically write the length of the record at the beginning and end of the record. Most but not all compilers use four bytes. This aids in reading records, e.g., length at the end assists with a backspace operation. You can suppress this with the new Stream IO mode of Fortran 2003, which was added for compatibility with other languages. Use access='stream' in your open statement.
I never used sequential access with unformatted output for this exact reason. However it depends on the application and sometimes it is convenient to have a record length indicator (especially for unstructured data). As suggested by steabert in Looking at binary output from fortran on gnuplot, you can avoid this by using keyword argument ACCESS = 'DIRECT', in which case you need to specify record length. This method is convenient for efficient storage of large multi-dimensional structured data (constant record length). Following example writes an unformatted file whose size equals the size of the array:
REAL(KIND=4),DIMENSION(10) :: a = 3.141
INTEGER :: reclen
INQUIRE(iolength=reclen)a
OPEN(UNIT=10,FILE='direct.out',FORM='UNFORMATTED',&
ACCESS='DIRECT',RECL=reclen)
WRITE(UNIT=10,REC=1)a
CLOSE(UNIT=10)
END
Note that this is not the ideal aproach in sense of portability. In an unformatted file written with direct access, there is no information about the size of each element. A readme text file that describes the data size does the job fine for me, and I prefer this method instead of padding in sequential mode.
Fortran IO is record based, not stream based. Every time you write something through write() you are not only writing the data, but also beginning and end markers for that record. Both record markers are the size of that record. This is the reason why writing a bunch of reals in a single write (one record: one begin marker, the bunch of reals, one end marker) has a different size with respect to writing each real in a separate write (multiple records, each of one begin marker, one real, and one end marker). This is extremely important if you are writing down large matrices, as you could balloon the occupation if improperly written.
Fortran Unformatted IO I am quite familiar with differing outputs using the Intel and Gnu compilers. Fortunately my vast experience dating back to 1970's IBM's allowed me to decode things. Gnu pads records with 4 byte integer counters giving the record length. Intel uses a 1 byte counter and a number of embedded coding values to signify a continuation record or the end of a count. One can still have very long record lengths even though only 1 byte is used.
I have software compiled by the Gnu compiler that I had to modify so it could read an unformatted file generated by either compiler, so it has to detect which format it finds. Reading an unformatted file generated by the Intel compiler (which follows the "old' IBM days) takes "forever" using Gnu's fgetc or opening the file in stream mode. Converting the file to what Gnu expects results in a factor of up to 100 times faster. It depends on your file size if you want to bother with detection and conversion or not. I reduced my program startup time (which opens a large unformatted file) from 5 minutes down to 10 seconds. I had to add in options to reconvert back again if the user wants to take the file back to an Intel compiled program. It's all a pain, but there you go.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.

Write a program that takes text as input and produces a program that reproduces that text

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
Write a program, that reads a text from input and prints some other
program on output. If we compile and run the printed program, it must
output the original text.
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.
Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
20 + 21 + 22 + ... + 2n - 1
Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!
If we are talking about ASCII text...
I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).
People here are saying that the string cannot be compressed, yet it can.
Why?
Requirement: OUTPUT THE ORIGINAL TEXT
Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.
Take this for example:
ASCII values characters
0x00 .. 0x08 NUL, (other control codes)
0x09 .. 0x0D (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F (other control codes)
... rest of printable characters
Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F).
You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.
Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.
Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.
Of course this is just a conceptual answer, not a functional one.
I don't know if I could ever achieve this.
But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.
The real problem in your problem is understanding the problem and the requirements.
What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.
Pseudocode:
string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;
Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.
It's called a file archiver producing self-extracting archives.