eof from string, not a stream - c++

I have a secret "mission" to write Vigenère cipher with it's analysis with ascii alphabet.
I have some troubles with encrypting text.
There are two kinds of them:
1) If I use whole ascii table, there are some troubles with decrypting text, because i use "system" chars that kills my text (by the way, it is "War and Peace" written by Tolstoy). Should i use it truncated version?
if yes, so - could i do operations from next question with truncated ascii table?
2) I want to have whole my text in one string. I can do it by this:
string s;
string p = "";
ifstream in("text_for_encryption.txt");
while (getline(in, s))
{
p+=s;
p+="\n";
}
"s" is the temporary string, and "p" is the string that has all text from file in it (with endl's and, of course, EOF)
i will make a cycle for "p" which looks like as
while (not eof in p)
{
take first keyword.length() chars from "p"? check every of them for EOF and encrypt them. (they will be deleted from p)
kick them in file "encrypted_text.txt"
}
in pseudocode (yeah, it is shit-like :( ).
so, the question is - how can i compare a string element with eof?
maybe, i can't google good, but i couldn't find the answer for this question.
Thanks in advance for every advice!
Update:
if i will encrypt string-by-string, it wll be easy to get a length of a key by Fridman's method (if the key is quite small).
so i want to encrypt text with endl's for more security

For encrypting, it depends largely on what you want to encrypt,
and what you want to do with the encrypted text. The usual
solution is to encrypt the bytes values (not the characters);
this means that you'll have to read and write the encrypted file
in binary mode, but since it's not meant to be readable anyway,
that's usually not an issue.
For the rest, strings do not have "EOF" characters. In fact,
there is no such thing as an EOF character[1]. (Nor en endl
character, either.) EOF is, in fact, an "event" which occurs
when reading from a stream; in C++, it is, in fact, treated as
a sort of an error. std::istream functions which can return
EOF (e.g. std::istream::get()) return int, and not char,
in order to be able to return an out of band value.
Strings do have a known length. To visit all of the characters
in a string:
for ( std::string::const_iterator current = s.begin();
current != s.end();
++ current ) {
// Do something with *current...
}
(If you have C++11, you can replace
std::string::const_iterator with auto. This is much simpler
to type, but until you master the iterator idioms, it's probably
better to write the type out, to ensure you understand what is
going on.)
[1] Historically, text files have had EOF characters on some
systems. This is not the end of file that you see with
std::istream::get(), but even today, if you open a file in
text mode under Windows, a 0x1A in the file will trigger the end
of file event in the input.

Related

Number getting stored as special character in C++

#include<fstream>
#include<string.h>
#include<iostream>
using namespace std;
class contact
{
long long ph;
unsigned char name[20],add[50],email[30];
public:
void create_contact()
{
cout<<"Phone: ";
cin>>ph;
cout<<"Name: ";
cin.ignore();
cin>>name;
cout<<"Address: ";
cin.ignore();
cin>>add;
cout<<"Email address: ";
cin.ignore();
cin>>email;
cout<<"\n";
}
void show_contact()
{
cout<<endl<<"Phone Number: "<<ph;
cout<<endl<<"Name: "<<name;
cout<<endl<<"Address: "<<add;
cout<<endl<<"Email Address : "<<email;
}
long long getPhone()
{
return ph;
}
unsigned char* getName()
{
return name;
}
unsigned char* getAddress()
{
return add;
}
unsigned char* getEmail()
{
return email;
}
};
fstream fp;
contact cont;
void save_contact()
{
fp.open("contactBook.txt",ios::out|ios::app);
cont.create_contact();
fp.write((char*)&cont,sizeof(contact));
fp.close();
cout<<endl<<endl<<"Contact Has Been Sucessfully Created...";
getchar();
Hey there, I am new to C++ as well as this community and in this is the code that I have been working on, the phone number of the contact is getting saved as random special characters. This is half of the code where I think the problem occurs Any ideas on how I could fix it? It would be of much help. Thanks!
I take it you expected to see the phone number written out in your text file as something like "15551234567." However, long long is not stored in this form in memory. It's actually stored as a 64-bit binary integer. The special characters you describe are likely the encoded version of that integer. If you read the data back in, you should find that it is still an integer.
However, there is one remaining issue. You are missing ios::binary on the fstream open command. Each of the ios flag imbues the stream with a particular behavior:
ios::out - indicates that this stream should be an output stream that you can write bytes to
ios::app - indicates that this stream should be opened in "append" mode. This means that it will not erase the contents of the file every time you open it, and any bytes outputted to the stream are appended to the end of the file.
ios::binary - opens the file in binary mode, which is needed when you want to input/output binary data, rather than just text.
You want to open the file with ios::out | ios::app | ios::binary. Forgetting binary is going to lead to very difficult to debug errors.
Now binary mode is a bit of a pest. Sorry for this being a long read, but its a lot easier to come to grips with this flag if you understand the history behind it.
Way back in the early days of computing, there was a disagreement about how to write a newline into a file. This way the days of typewriters, where starting a new line was broken into two actions. There was "carriage return" which moved the sliding bit of the typewriter back to the start of the line (this was the loud part of the motion), and there was a "line feed" which moved the paper up one spot. Each of these were separate actions, so they were given separate characters in ASCII, one of the the definitive ways to write text as a string of bytes. The 8-bit number 10 encoded a line feed (aka LF), and the 8-bit number 13 encoded a carriage return (aka CR). This would permit one to do things like overtyping, a trick where one types one character (like a letter) and then goes back to add another over the top (like an accent). You might write à by first typing a, doing a "carriage return" and then writing a `, just like you did on a typewriter.
Some operating systems (such as Windows) encoded the start of the next line as both of these characters, so you'd see CR LF in a text file. Other operating systems (such as Unix) decided that it wasn't worth wasting a precious byte at the end of every line, so they chose to represent the start of a new line just with a LF. Others (such as Macintosh), decided to represent the new line as CR. Nobody could agree.
To deal with this, many file reading/writing APIs treat these characters specially. fopen and fstream follow a pattern where if they see a CR LF or a CR in a text file, they silently turn it into a LF character when read. This lets you read every file type. Likewise, if it sees a LF character when writing, it expands it to whatever the platform specified a new line should look like. This lets you write cross-platform code which writes text files without having to pay attention to which new line character is used on each platform!
However, this causes huge problems for binary data. Consider the number 302,844,416 written as a 32 bit number. In hexadecimal, we would write that as 0x120D0A00 (hex is a popular way to write numbers in programming because every byte can be written as 2 characters in hex). The issue is the middle two bytes of the number, 0x0D and 0x0A. In decimal, these are 13 and 10, which you should recognize as the same bytes as CR and LF.
If the program tries to read that number in "text mode," it will see the CR LF pair, and turn it into just a single LF, per the C rules. Now, instead of our number being 0x120D0A00, its 0x120A00XX, where the XX is whatever the next byte was in the file. Very bad things! Not only is this data corrupted, but you probably needed the next byte for whatever came next in the file!
ios::binary and the "b" flag for fopen resolve this. They tell C/C++ that the data is going to be binary. There wont be any new lines to convert. If you write bytes to a binary stream, they get written directly to the file, without any clever attempts to handle new lines.
Your phone number is stored as a long long, which is a binary integer format. Without ios::binary, you run the risk of the number just happening to have a CR LF pair in it, and fstream will corrupt your data. ios::binary tells fstream to not mess with the data in that way.

what's exactly the string of "^A" is?

I run my code on an online judgement. I log the string, key. Below is my code:
fprintf(stderr, "key=%s, and key.size()=%d\n", key.c_str(), key.size());
But the result is this:
key=^A, and key.size()=8
I want to what is the ^A represent in ascii. ^A's size is 2 rather than 8, but it shows that it is 8. I view the result by vim, and the log_file is encoded by UTF-8. Why?
Your viewer is electing to show you the bytes interpreted using a character encoding of its choosing and electing to show the resulting characters in caret notation.
Other viewers could make different choices on both counts or allow you to indicate what you want. For example, control picture characters (␁) instead of caret notation.
For a std:string c_str() is terminated by an additional \x00 byte following the actual value. You often use c_str() with functions that expect a string to be \x00 terminated. This applies to fprintf. In such cases, what's read ends just before the first \x00 seen.
You have several \x00 bytes in your string, which, of course, contributes to size() but fprintf will stop right at the first one (and not count it).
I have solve it by myself. If you write a std::string "\x01\x00\x00\x00\x00end" to a file and open it with vim later, you will get '^A'.
This is my test code:
string sss("\x01\x00\x00\x00\x00end");
ofstream of("of.txt");
for (int i=0; i<sss.size(); i++) {
of.put(sss[i]);
}
of.close();
After I open the file "of.txt", I saw "^A";

std::getline deal with \n, \r and \r\n [duplicate]

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline and have it gracefully handle all line endings.
Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.
It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.
getline reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.
There are three types of line endings seen in text files:
'\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.
The problem is that getline leaves the '\r' on the end of the string.
ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
// BUT, there might be an '\r' at the end now.
}
Edit Thanks to Neil for pointing out that f.good() isn't what I wanted. !f.fail() is what I want.
I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!
.. and that's not even considering Unicode :-)
.. maybe Boost has a nice way to consume one line at a time from any text-file type?
Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.
if(!line.empty() && *line.rbegin() == '\r') {
line.erase( line.length()-1, 1);
}
As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."
However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
// Also handle the case when the last line has no line ending
if(t.empty())
is.setstate(std::ios::eofbit);
return is;
default:
t += (char)c;
}
}
}
And here is a test program:
int main()
{
std::string path = ... // insert path to test file here
std::ifstream ifs(path.c_str());
if(!ifs) {
std::cout << "Failed to open the file." << std::endl;
return EXIT_FAILURE;
}
int n = 0;
std::string t;
while(!safeGetline(ifs, t).eof())
++n;
std::cout << "The file contains " << n << " lines." << std::endl;
return EXIT_SUCCESS;
}
Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.
This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...
Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:
char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');
//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...
would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).
The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.
Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.
The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:
#include <string>
#include <iostream>
using namespace std;
int main() {
string line;
while( getline( cin, line ) ) {
cout << line << endl;
}
}
Of course, if you are dealing with files from another platform, all bets are off.
As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the line string in the above code to see if it is \r and if so remove it before doing your application-specific processing.
For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):
ostream & safegetline( ostream & os, string & line ) {
string myline;
if ( getline( os, myline ) ) {
if ( myline.size() && myline[myline.size()-1] == '\r' ) {
line = myline.substr( 0, myline.size() - 1 );
}
else {
line = myline;
}
}
return os;
}
One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.
Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1] is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.
Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.
Unfortunately the accepted solution does not behave exactly like std::getline(). To obtain that behavior (to my tests), the following change is necessary:
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
is.setstate(std::ios::eofbit); //
if(t.empty()) // <== change here
is.setstate(std::ios::failbit); //
return is;
default:
t += (char)c;
}
}
}
According to https://en.cppreference.com/w/cpp/string/basic_string/getline:
Extracts characters from input and appends them to str until one of the following occurs (checked in the order listed)
end-of-file condition on input, in which case, getline sets eofbit.
the next available input character is delim, as tested by Traits::eq(c, delim), in which case the delimiter character is extracted from input, but is not appended to str.
str.max_size() characters have been stored, in which case getline sets failbit and returns.
If no characters were extracted for whatever reason (not even the discarded delimiter), getline sets failbit and returns.
If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as
string num;
is >> num >> num >> num >> num;
This also works with other line endings.

How to detect a CRLF in a stream

I got a stringstream of with HTTP request content. As you know HTTP request end up with CRLF break. But operator>> won't recognize CRLF as if it's a normal end-of-file.
How can I detect this CRLF break?
EDIT:
All right, actually I'm using boost.iostreams. But I don't think there should be any differences.
char head[] = "GET / HTTP1.1\r\nConnection: close\r\nUser-Agent: Wget/1.12 (linux-gnu)\r\nHost: www.baidu.com\r\n\r\n";
io::stream<My_InOut> in(head, sizeof head);
string s;
while(in >> s){
char c = in.peek(); // what I am doing here is to check if next character is a normal break so that 's' is a complete word.
switch( c ){
case -1:
// is it eof or an incomplete word?
break;
case 0x20: // a complete word
break;
case 0x0d:
case 0x0a: // also known as \r\n should indicate a complete word
break;
}
In this code, I assume that the request could possibly be split into parts because of its transmission, so I wanted to recognize whether '-1' stand for actual end-of-request or just a break word that I need to read more to complete the request.
First of all, peek returns an int, not a char (at least, std::istream::peek returns int--I don't know about boost). This distinction is important for recognizing -1 as the end of the file rather than a character with the value of 0xFF.
Also be aware that i/o streams in text mode will transform the platform's line separator into '\n' (which, in C and C++, usually has the same value as a line feed, but it might not). So if you're running this on Windows, where the native line separator is CR+LF, you'll never see the CR. But if you run the same code on a Linux box, where the native separator is simply LF, you will.
So given your question:
How can I detect this CRLF break?
The answer is to open the stream in binary mode and check for the character values 0x0D followed by 0x0A.
That said, it's not unheard of for HTML code to overlook that the network protocol requires CR+LF. If you want to be abide by the "be liberal in what you accept" maxim, you just watch for either CR or LF and then skip the next character if it's the complement.

Read text file step-by-step

I have a file which has text like this:
#1#14#ADEADE#CAH0F#0#0.....
I need to create a code that will find text that follows # symbol, store it to variable and then writes it to file WITHOUT # symbol, but with a space before. So from previous code I will get:
1 14 ADEADE CAH0F 0 0......
I first tried to did it in Python, but files are really big and it takes a really huge time to process file, so I decided to write this part in C++. However, I know nothing about C++ regex, and I'm looking for help. Could you, please, recommend me an easy regex library (I don't know C++ very well) or the well-documented one? It would be even better, if you provide a small example (I know how to perform transmission to file, using fstream, but I need help with how to read file as I said before).
This looks like a job for std::locale and his trusty sidekick imbue:
#include <locale>
#include <iostream>
struct hash_is_space : std::ctype<char> {
hash_is_space() : std::ctype<char>(get_table()) {}
static mask const* get_table()
{
static mask rc[table_size];
rc['#'] = std::ctype_base::space;
return &rc[0];
}
};
int main() {
using std::string;
using std::cin;
using std::locale;
cin.imbue(locale(cin.getloc(), new hash_is_space));
string word;
while(cin >> word) {
std::cout << word << " ";
}
std::cout << "\n";
}
IMO, C++ is not the best choice for your task. But if you have to do it in C++ I would suggest you have a look at Boost.Regex, part of the Boost library.
If you are on Unix, a simple sed 's/#/ /' <infile >outfile would suffice.
Sed stands for 'stream editor' (and supports regexes! whoo!), so it would be well-suited for the performance that you are looking for.
Alright, I'm just going to make this an answer instead of a comment. Don't use regex. It's almost certainly overkill for this task. I'm a little rusty with C++, so I'll not post any ugly code, but essentially what you could do is parse the file one character at a time, putting anything that wasn't a # into a buffer, then writing it out to the output file along with a space when you do hit a #. In C# at least two really easy methods for solving this come to mind:
StreamReader fileReader = new StreamReader(new FileStream("myFile.txt"),
FileMode.Open);
string fileContents = fileReader.ReadToEnd();
string outFileContents = fileContents.Replace("#", " ");
StreamWriter outFileWriter = new StreamWriter(new FileStream("outFile.txt"),
Encoding.UTF8);
outFileWriter.Write(outFileContents);
outFileWriter.Flush();
Alternatively, you could replace
string outFileContents = fileContents.Replace("#", " ");
With
StringBuilder outFileContents = new StringBuilder();
string[] parts = fileContents.Split("#");
foreach (string part in parts)
{
outFileContents.Append(part);
outFileContents.Append(" ");
}
I'm not saying you should do it either of these ways or my suggested method for C++, nor that any of these methods are ideal - I'm just pointing out here that there are many many ways to parse strings. Regex is awesome and powerful and may even save the day in extreme circumstances, but it's not the only way to parse text, and may even destroy the world if used for the wrong thing. Really.
If you insist on using regex (or are forced to, as in for a homework assignment), then I suggest you listen to Chris and use Boost.Regex. Alternatively, I understand Boost has a good string library as well if you'd like to try something else. Just look out for Cthulhu if you do use regex.
You've left out one crucial point: if you have two (or more) consecutive #s in the input, should they turn into one space, or the same number of spaces are there are #s?
If you want to turn the entire string into a single space, then #Rob's solution should work quite nicely.
If you want each # turned into a space, then it's probably easiest to just write C-style code:
#include <stdio.h>
int main() {
int ch;
while (EOF!=(ch=getchar()))
if (ch == '#')
putchar(' ');
else
putchar(ch);
return 0;
}
So, you want to replace each ONE character '#' with ONE character ' ' , right ?
Then it's easy to do since you can replace any portion of the file with string of exactly the same length without perturbating the organisation of the file.
Repeating such a replacement allows to make transformation of the file chunk by chunk; so you avoid to read all the file in memory, which is problematic when the file is very big.
Here's the code in Python 2.7 .
Maybe, the replacement chunk by chunk will be unsifficient to make it faster and you'll have a hard time to write the same in C++. But in general, when I proposed such codes, it has increased the execution's time satisfactorily.
def treat_file(file_path, chunk_size):
from os import fsync
from os.path import getsize
file_size = getsize(file_path)
with open(file_path,'rb+') as g:
fd = g.fileno() # file descriptor, it's an integer
while True:
x = g.read(chunk_size)
g.seek(- len(x),1)
g.write(x.replace('#',' '))
g.flush()
fsync(fd)
if g.tell() == file_size:
break
Comments:
open(file_path,'rb+')
it's absolutely obligatory to open the file in binary mode 'b' to control precisely the positions and movements of the file's pointer;
mode '+' is to be able to read AND write in the file
fd = g.fileno()
file descriptor, it's an integer
x = g.read(chunk_size)
reads a chunk of size chunk_size . It would be tricky to give it the size of the reading buffer, but I don't know how to find this buffer's size. Hence a good idea is to give it a power of 2 value.
g.seek(- len(x),1)
the file's pointer is moved back to the position from which the reading of the chunk has just been made. It must be len(x), not chunk_size because the last chunk read is in general less long than chink_size
g.write(x.replace('#',' '))
writes on the same length with the modified chunk
g.flush()
fsync(fd)
these two instructions force the writing, otherwise the modified chunk could remain in the writing buffer and written at uncontrolled moment
if g.tell() >= file_size: break
after the reading of the last portion of file , whatever is its length (less or equal to chunk_size), the file's pointer is at the maximum position of the file, that is to say file_size and the program must stop
.
In case you would like to replace several consecutive '###...' with only one, the code is easily modifiable to respect this requirement, since writing a shortened chunk doesn't erase characters still unread more far in the file. It only needs 2 files's pointers.