fstream - How to seekg() to position x from end - c++

I'm looking for a way to set my get pointer at position x from the end of an fstream.
I tried
file.seekg(-x, ios_base::end);
But according to this question, this line is undefined behavior.
How can I, in any way, seek to position x from the end of a fstream?

If you want to set your pointer at position x from the end, you need to know where the end is, so you need to begin with:
file.seekg(0, ios_base::end);
int length = file.tellg();
When you will know the file length, you can set your pointer:
file.seekg(length - x, ios_base::beg);

The problem is related to the text mode, which may convert certain sequences of bytes in the file into a sequence of different size.
Take for example text mode under Windows. Here the byte sequence '\r' '\n' in a file on the disk is converted at reading time into '\n'. Now imagine that you have a file:
Hello\r\n
World\r\n
If you position yourself at file.seekg(-1, ios_base::end); the result is undefined, because it's not clear what the result should be:
Should you simply be positioned at the '\n'? But in this case, reading the file in the reverse order would be inconsistent with reading the file in the correct order.
Should it be positioned at the '\r', as '\r' '\n' should be understood as a single byte? But in this case the positioning would have to be done byte by byte, and for every byte the library would have to check the previous one, just in case of.
This is by the way also the reason why you should only directly seekg() to positions previously acquired by tellg().
If you really have to do this kind of positioning from the end, if you open the file as ios::binary because then you're assured that a byte is always a byte whether counting from the end or counting from the beginning.

Related

How to correctly buffer lines using fread

I want to use fread() for IO for some reason (speed and ...). i have a file with different sized lines. by using a code like:
while( !EOF ){
fread(buffer,500MB,1,fileName);
// process buffer
}
the last line may read incompletely and we have to read the last line again in next trial, so how to force fread() to continue from the beginning of the last line?
or if possible how to force fread() to read more than 500MB untill reaching a \n or another specific character?
Thanks All
Ameer.
Assuming a bufferof bytes that you have reverse found a \n character in at position pos, then you want to roll back to the length of the buffer minus this pos. Call this step.
You can use fseek to move the file pointer back by this much:
int fseek( FILE *stream, long offset, int origin );
In your case
int ret = fseek(stream, -step, SEEK_END);
This will involve re-reading part of the file, and a fair bit of jumping around - the comments have suggested alternative ways that may be quicker.

C++ std::ifstream std::ios::in versus std::ios::in | std::ios::binary, why do I get junk data?

I'm curious when I call read on an std::ifstream object, why I get junk data if I open the file as std::ios::in, whereas I don't get junk data with std::ios::in | std::ios::binary?
I included screenshots of some messy code I've been trying stuff out with. I'm just confused why I get junk data with the first picture, when the second picture produces the correct data with the std::ios::binary flag set.
Junk data, but correct file length:
No junk data, same file length:
In text mode, certain characters may be transformed.
On cppreference, it says this about binary vs. text mode for files:
Data read in from a text stream is guaranteed to compare equal to the
data that were earlier written out to that stream only if all of the
following is true:
the data consist only of printing characters and the control
characters \t and \n (in particular, on Windows OS, the
character '\0x1A' terminates input)
no \n is immediately preceded by a space character (space
characters that are written out immediately
before a \n may disappear when read)
the last character is \n
At a guess I would say that some of the characters in your input file do not obey these rules.
In text mode, file positions aren't the number of bytes you can read.
So when you seek to the end, and see the file position is 24, that doesn't mean you can read 24 bytes. In fact you only read 20 bytes, but your loop ran 24 times, printing the 20 read bytes and another 4 garbage bytes from whatever was already in memory.
To find out the actual number of bytes that were read, you can call file_data.gcount() after calling file_data.read.

c++: Istream counts every newline in a .txt file as two

I've got a slight problem. It appears that for some reason my function, when counting the size of a .txt file, counts a newline as it was two chars instead of one. Here's the function:
#define IN_FILE "in_mat.txt"
#define IN_BUF
#ifdef IN_BUF
void inBuf(char *(&b)){
streampos size;
ifstream f(IN_FILE, ios::in);
f.seekg(0,ios::end);
size=f.tellg();
b=new char[size];
f.seekg(0, ios::beg);
f.read(b, size);
f.close();
}
#endif
And here's the read file:
2 2
1 0
0 1
2 2
i 0
0 -i
2 2
0 1
-1 0
2 2
0 i
i 0
Earlier, i've put some couts, and it appears, that size=60, while the actual size is 49 (checked it), and the count of newlines in the file is 11, so exactly 60-49. Could somebody help me with that?
To add to the other answers, if you want to read special characters such as newline characters, you should open your file in binary mode, not text mode.
ifstream f(IN_FILE, ios::in | ios::binary);
If you don't open the file in binary mode, the actual characters that make up the '\n' are translated by the runtime to a single character (namely '\n'). So in text mode, you don't get the "real" version of the file in terms of all of the actual characters that the file consists of.
In addition, functions such as seekg() and tellg() will not work as expected with a file opened in text mode, or at the very least, will give you "wrong results" (actually not wrong to the functions themselves, but wrong if you're writing a program that tries to "hone in" on a position within the file). Again, the newline (and EOF) translation that is done under the hood by the runtime gets in the way of these functions working as you would expect them to.
On the other hand, a file opened in binary mode allows these functions to work as expected -- no translation of newline, or EOF -- whatever the individual bytes that makes up the file contents are, that is what you get.
The next thing you need to determine is whether it is a Unix text file or a Windows text file. Depending on which one it is, the line endings will be different.
Windows uses "\r\n" to return to the beginning of the line ('\r') and begin a new one ('\n').
To remove them from your count you have to read the whole file and count the number of '\r's.
Windows stores newlines as two characters: '\r\n', known as carriage return and line feed. That's why it's counted twice: there are actually two characters to be counted.
I am assuming that you are running on Windows. If not, disregard my answer below.
Windows stores new line characters in text files as two characters (CR LF or '\r' '\n'). So, seeking to the end of the file and calling tellg() will return the binary size of the file (60), not the text size (49).
In order to get the correct text size (49), one solution would be to count each new line character (11) and subtract that number from the total byte size.

search in a big binary file loading blocks in buffer

I want to know the algorithm of searching in a big file by blocks, loading them into memory buffer.
So I have a huge file and will read it to small buffer and scan it for "needle" word:
while ( read = fread(buff, buff_size, 1, file) )
if strstr(buff, needle) print "found!";
But what if "needle" in the "hay" will be cut by block border? It will impossible to find.
One solution I see is to read next block fseek'ing it back every time (reducing offset on length of "needle" string)
offset += read - strlen(needle);
if (offset > 0) fseek(file, offset ,SEEK_SET);
Am I right?
You are right that you'll need to handle the case where the search pattern spans two blocks.
You are also right that seek can be one solution.
But there are other solutions which doesn't use seek.
Solution 1
A solution could be to copy the last part of the buffer, i.e. strlen(needle) to a little buffer capable of holding 2 times strlen(needle).
Then when reading the next block you copy the first part of the new buffer (again strlen(needle)) to the little buffer so that it is contatenated with the part from the end of the previous buffer.
Finally you can do a search in the little buffer for niddle.
Solution 2
A solution could be to read from file into buffer + strlen(needle), i.e. avoid overwriting the first strlen(needle) characters of buffer. The number of characters read from the file must be decreased accordingly (i.e. buff_size - strlen(needle))
When done with a buffer, you copy the last strlen(needle) characters to the start of buffer and read more data from the file into buffer + strlen(needle)
For the first search in buffer, you'll have to skip the first strlen(needle) characters (or make sure they don't match your pattern, e.g. by initialization). Subsequent searches shall search the whole buffer.

Accessing to information in a ".txt" file and go to a determinated row

When accessing a text file, I want to read from a specific line. Let's suppose that my file has 1000 rows and I want to read row 330. Each row has a different number of characters and could possibly be quite long (let's say around 100,000,000 characters per row). I'm thinking fseek() can't be used effectively here.
I was thinking about a loop to track linebreaks, but I don't know how exactly how to implement it, and I don't know if that would be the best solution.
Can you offer any help?
Unless you have some kind of index saying "line M begins at position N" in the file, you have to read characters from the file and count newlines until you find the desired line.
You can easily read lines using std::getline if you want to save the contents of each line, or std::istream::ignore if you want to discard the contents of the lines you read until you find the desired line.
There is no way to know where row 330 starts in an arbitrary text file without scanning the whole file, finding the line breaks, and then counting.
If you only need to do this once, then scan. If you need to do it many times, then you can scan once, and build up a data structure listing where all of the lines start. Now you can figure out where to seek to to read just that line. If you're still just thinking about how to organize data, I would suggest using some other type of data structure for random access. I can't recommend which one without knowing the actual problem that you are trying to solve.
Create an index on the file. You can do this "lazily" but as you read a buffer full you may as well scan it for each character.
If it is a text file on Windows that uses a 2-byte '\n' then the number of characters you read to the point where the newline occurs will not be the offset. So what you should do is a "seek" after each call to getline().
something like:
std::vector< off_t > lineNumbers;
std::string line;
lineNumbers.push_back(0); // first line begins at 0
while( std::getline( ifs, line ) )
{
lineNumbers.push_back(ifs.tellg());
}
last value will tell you where EOF is.
I think you need to scan the file and count the \n occurrences since you find the desired line. If this is a frequent operation, and you are the only one you write the file, you can possibly mantain an index file containing such information side by side with the one containing the data, a sort of "poor-man-index", but can save a lot of time.
Try running fgets in a loop
/* fgets example */
#include <stdio.h>
int main()
{
FILE * pFile;
char mystring [100];
pFile = fopen ("myfile.txt" , "r");
if (pFile == NULL) perror ("Error opening file");
else {
fgets (mystring , 100 , pFile);
puts (mystring);
fclose (pFile);
}
return 0;
}