Reading the nth element from a file - c++

I have a text file with one column of numbers. There are about 500 numbers within the column. How would I only read the nth number. For example, is there a way to read and store the 49th number in the column?

if the numbers are fixed size (you don't show a sample file) then you can seek to size * n and read. Otherwise just do a read, parse, count loop till you reach n

If they're stored as text, so the space occupied by each number can vary, you're pretty much stuck with either reading through them until you get to the correct point, or else using a level of indirection--that is, creating an index into the data itself.
For the former, you could (for example) store each number as a 32-bit binary number. On a typical machine that means every number occupies 4 bytes, so getting to the Nth item, you multiply N by 4, seek to that point in the file, and read 4 bytes.
If you want to store the numbers as text, but still support seeking like this, you could pad every number with spaces so they all still take up the same amount of space in the file (e.g., 10 characters for every number).
If you really want to avoid that (or have a pre-defined format so you can't do it), then you could create an index into the file. This makes sense primarily when/if you can justify the cost of reading through the entire file and storing the position of the beginning of each line. There are at least two obvious justifications. One is that the indexing can be separated from usage, such as building the index in a batch at night to optimize use of the data during the day. The other possibility is simply that you use the data file enough that the savings from being able to seek to a specific point outweighs the time to index the file once.
There are less obvious justifications as well though--for example, you might need to meet some (at least soft) real-time constraints, and if you read through the file to find each item, you might not be able to meet those constraints--in particular, the size of file you can process can be limited by the real-time constraints. In this case, an index may be absolutely necessary to meet your requirements, rather than just an optimization.

Adding verbosity to #pm100's answer (perhaps unnecessarily), the fixed size means same ascii count.
001
01
Line 001 takes up 3 bytes while 01 only takes up two.
Thus if your file has numbers formatted like this:
1
2
3
100
10
Using lseek (or fseek) would only work if each column entry has the same number of ASCII chars for each line (as far as I am aware).
Also you need to keep track of the \n character if you go this route too.
lseek(fd, size * n * 2);

You can do it this way:
ifstream fin("example.txt");
std::string line;
for(size_t i = 0; i < n; ++i) {
std::getline(fin, line);
}
// line contains line n;

Related

Interleaving bzip2 and non-bzip2 data

I am looking at making a file format that interleaves two types of chunks of raw bytes.
One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).
The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).
The two magic numbers would be used for seeking, identifying and processing the two data block types differently.
My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?
In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?
One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.
No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.
The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.
To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

rdbuf vs getline vs ">>"

I want to load a map from a text file (If you can come up with whatever else way to load a map to an array, I'm open for anything new).
Whats written in the text file is something like this but a bit larger in the scale.
6 6 10 (Nevermind what this number "10" is but the two other are the map size.)
1 1 1 1 1 1
1 0 2 0 0 1
1 0 0 0 2 1
1 2 2 0 0 1
1 0 0 0 0 1
1 1 1 1 1 1
Where 1 is border, 0 is empty, 2 is wall.
Now i want to read this text file but I'm not sure what way would be best.
What i have in mind yet is:
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
Reading it number by number via getline().
Reading it number by number using the >> operator.
My question is which of the mentioned (or any other way if available) ways is better by means of ram use and speed.
Note: weather or not using rdbuf() is a good way. I'd appreciate a lot a good comparison between different ways of splitting a string, for example splitting a text to words regarding to the whitespaces.
Where 1 is border, 0 is empty, 2 is wall. Now i want to read this text file but I'm not sure what way would be best. What i have in mind yet is:
You don't have enough data to make a significant impact on performance by any of the means you mentioned. In other words, concentrate on correctness and robustness of your program then come back and optimize the parts that are slow.
Reading the whole text file at once in a stringstream and convert it to string later via rdbuf() and then split the string and put it in the array.
The best method for inputting data is to keep the input stream flowing. This usually means reading large chunks of data per transaction versus many small transactions of small quantities. Memory is a lot faster to search and process than an input stream.
I suggest using istream::read before using rdbuf. For either one, I recommend reading into a preallocated area of memory, that is either an array or if using string, reserve a large space in the string when constructing it. You don't want the reallocation of std::string data to slow your program.
Reading it number by number via getline().
Since your data is line oriented this could be beneficial. You read one row and process the one row. Good technique to start with, however, a bit more complicate than the one below, but simpler than the previous method.
Reading it number by number using the >> operator.
IMO, this is the technique you should be using. The technique is simple and easy to get working; enabling you to work on the remainder of your project.
Changing the Data Format
If you want to make the input faster, you can change the format of the data. Binary data, data that doesn't need translations, is the fastest format to read. It bypasses the translation of textual format to internal representation. The binary data is the internal representation.
One of the caveats to binary data is that it is hard to read and modify.
Optimizing
Don't. Focus on finishing the project: correctly and robustly.
Don't. Usually, the time you gain is wasted in waiting for I/O or
the User. Development time is costly, unnecessary optimization is a waste of development time thus a waste of money.
Profile your executable. Optimize the parts that occupy the most
execution time.
Reduce requirements / Features before changing code.
Optimize the design or architecture before changing the code.
Change compiler optimization settings before changing the code.
Change data structures & alignment for cache optimization.
Optimize I/O if your program is I/O bound.
Reduce branches / jumps / changes in execution flow.

Fast code for searching bit-array for contiguous set/clear bits?

Is there some reasonably fast code out there which can help me quickly search a large bitmap (a few megabytes) for runs of contiguous zero or one bits?
By "reasonably fast" I mean something that can take advantage of the machine word size and compare entire words at once, instead of doing bit-by-bit analysis which is horrifically slow (such as one does with vector<bool>).
It's very useful for e.g. searching the bitmap of a volume for free space (for defragmentation, etc.).
Windows has an RTL_BITMAP data structure one can use along with its APIs.
But I needed the code for this sometime ago, and so I wrote it here (warning, it's a little ugly):
https://gist.github.com/3206128
I have only partially tested it, so it might still have bugs (especially on reverse). But a recent version (only slightly different from this one) seemed to be usable for me, so it's worth a try.
The fundamental operation for the entire thing is being able to -- quickly -- find the length of a run of bits:
long long GetRunLength(
const void *const pBitmap, unsigned long long nBitmapBits,
long long startInclusive, long long endExclusive,
const bool reverse, /*out*/ bool *pBit);
Everything else should be easy to build upon this, given its versatility.
I tried to include some SSE code, but it didn't noticeably improve the performance. However, in general, the code is many times faster than doing bit-by-bit analysis, so I think it might be useful.
It should be easy to test if you can get a hold of vector<bool>'s buffer somehow -- and if you're on Visual C++, then there's a function I included which does that for you. If you find bugs, feel free to let me know.
I can't figure how to do well directly on memory words, so I've made up a quick solution which is working on bytes; for convenience, let's sketch the algorithm for counting contiguous ones:
Construct two tables of size 256 where you will write for each number between 0 and 255, the number of trailing 1's at the beginning and at the end of the byte. For example, for the number 167 (10100111 in binary), put 1 in the first table and 3 in the second table. Let's call the first table BBeg and the second table BEnd. Then, for each byte b, two cases: if it is 255, add 8 to your current sum of your current contiguous set of ones, and you are in a region of ones. Else, you end a region with BBeg[b] bits and begin a new one with BEnd[b] bits.
Depending on what information you want, you can adapt this algorithm (this is a reason why I don't put here any code, I don't know what output you want).
A flaw is that it does not count (small) contiguous set of ones inside one byte ...
Beside this algorithm, a friend tells me that if it is for disk compression, just look for bytes different from 0 (empty disk area) and 255 (full disk area). It is a quick heuristic to build a map of what blocks you have to compress. Maybe it is beyond the scope of this topic ...
Sounds like this might be useful:
http://www.aggregate.org/MAGIC/#Population%20Count%20%28Ones%20Count%29
and
http://www.aggregate.org/MAGIC/#Leading%20Zero%20Count
You don't say if you wanted to do some sort of RLE or to simply count in-bytes zeros and one bits (like 0b1001 should return 1x1 2x0 1x1).
A look up table plus SWAR algorithm for fast check might gives you that information easily.
A bit like this:
byte lut[0x10000] = { /* see below */ };
for (uint * word = words; word < words + bitmapSize; word++) {
if (word == 0 || word == (uint)-1) // Fast bailout
{
// Do what you want if all 0 or all 1
}
byte hiVal = lut[*word >> 16], loVal = lut[*word & 0xFFFF];
// Do what you want with hiVal and loVal
The LUT will have to be constructed depending on your intended algorithm. If you want to count the number of contiguous 0 and 1 in the word, you'll built it like this:
for (int i = 0; i < sizeof(lut); i++)
lut[i] = countContiguousZero(i); // Or countContiguousOne(i)
// The implementation of countContiguousZero can be slow, you don't care
// The result of the function should return the largest number of contiguous zero (0 to 15, using the 4 low bits of the byte, and might return the position of the run in the 4 high bits of the byte
// Since you've already dismissed word = 0, you don't need the 16 contiguous zero case.

Most efficient to read file into separate variables using fstream

I have tons of files which look a little like:
12-3-125-BINARYDATA
What would be the most efficient way to save the 12, 3 and 125 as separate integer variables, and the BINARYDATA as a char-vector?
I'd really like to use fstream, but I don't exactly know how to (got it working with std::strings, but the BINARYDATA part was all messed up).
The most efficient method for reading data is to read many "chunks", or records into memory using the fewest I/O function calls, then parsing the data in memory.
For example, reading 5 records with one fread call is more efficient than 5 calls to fread to read in a record. Accessing memory is always faster than accessing external data such as files.
Some platforms have the ability to memory-map a file. This may be more efficient than reading the using I/O functions. Profiling will determine the most efficient.
Fixed length records are always more efficient than variable length records. Variable length records involve either reading until a fixed size is read or reading until a terminal (sentinel) value is found. For example, a text line is a variable record and must be read one byte at a time until the terminating End-Of-Line marker is found. Buffering may help in this case.
What do you mean by Binary Data? Is it a 010101000 char by char or "real" binary data? If they are real "binary data", just read the file as binary file. First read 2 bytes for the first int, next 1 bytes for -, 2 bytes for 3,and so on, until you read the first pos of binary data, just get file length and read all of it.

Write a program that takes text as input and produces a program that reproduces that text

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:
Write a program, that reads a text from input and prints some other
program on output. If we compile and run the printed program, it must
output the original text.
The input text is supposed to be rather large (more than 10000 characters).
The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like
std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";
I believe some archiving techniques are to be used here.
Unfortunately, such a program does not exist.
To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2n binary strings of length n.
Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is
20 + 21 + 22 + ... + 2n - 1
Using a bit of math, we can get that this number is equal to 2n - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.
But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.
This is where we reach a problem. Since there are 2n different bitstrings of length n and only 2n-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.
So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 210000 strings of length 10000 onto the 210000 - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.
That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 210000 any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.
For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.
Hope this helps!
If we are talking about ASCII text...
I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).
People here are saying that the string cannot be compressed, yet it can.
Why?
Requirement: OUTPUT THE ORIGINAL TEXT
Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.
Take this for example:
ASCII values characters
0x00 .. 0x08 NUL, (other control codes)
0x09 .. 0x0D (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F (other control codes)
... rest of printable characters
Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F).
You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.
Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.
Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.
Of course this is just a conceptual answer, not a functional one.
I don't know if I could ever achieve this.
But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.
The real problem in your problem is understanding the problem and the requirements.
What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.
Pseudocode:
string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;
Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.
It's called a file archiver producing self-extracting archives.