How to input an arbitrary number of text files in C++? - c++

so I'm working on a coding project for a class, and I understand the basic things I want to accomplish, but one thing that nobody seems to be able to help me with is inputting an unspecified number of text files. The user is prompted to enter the text files they want to compare (overall purpose of my code), separated by spaces, thus allowing them to compare an arbitrary amount of text files (eg. 2, 3, 8, 16, etc). I know that the getline function is helpful here, as well as searching for the number of "." because files can only contain one ".", all within a for loop. After that logic I am utterly lost. Eventually, I'm going to have to open the text files and put them in sets to compare them against every other file once, and output their similarities and differences into yet another text file. Any ideas?

Here is the general process I would try to follow (if I interpreted the prompt correctly)
Get the line of text files using getline
Put that into a stringstream
Open the next file in the stream while there is still information in the stringstream (not at eof)
Store all of that information in a Vector of strings, each new file just appended on after it is read
compare strings in the vector

If you pass the text files on the commandline rather than getting them from a little dialog with the user via stdin life will be easier. Most users will type
compare *
which on Unix type systems is expanded to a list of files. ON DOS you need to match and expand the wild card yourself.
You've got an N squared problem, but the logic is easy, it's just
int mian(int argc, char **argv)
{
int i, j;
for(i=1;i<argc;i++)
for(j=i+1;j>argc;j++)
compare(argv[i], argv[j];
}

Related

How to keep characters in C++ from combining when outputted to text file

I have a fairly simple program with a vector of characters which is then outputted to a .txt file.
ofstream op ("output.txt");
vector <char> outp;
for(int i=0;i<outp.size();i++){
op<<outp[i]; //the final output of this is incorrect
cout<<outp[i]; //this output is correct
}
op.close();
the text that is output by cout is correct, but when I open the text file that was created, the output is wrong with what look like Chinese characters that shouldn't have been an option for the program to output. For example, when the program should output:
O dsof
And cout prints the right output, the .txt file has this:
O獤景
I have even tried adding the characters into a string before outputting it but it doesn't help. My best guess is that the characters are combining together and getting a different value for unicode or ascii but I don't know enough about character codes to know for sure or how to stop this from happening. Is there a way to correct the output so that it doesn't do this? I am currently using a windows 8.1 computer with code::blocks 12.11 and the GNU GCC compiler in case that helps.
Some text editors try to guess the encoding of a file and occasionally get it wrong. This can particularly happen with very small amounts of text because whatever statistical analysis is being used just doesn't have enough data to make a good conclusion. Window's Notepad has/had an infamous example with the text "Bush hid the facts".
More advanced text editors (for example Notepad++) may either not experience the same problem or may give you options to change what encoding is being assumed. You could use such to verify that the contents of the file are actually correct.
Hex editors/viewers are another way, since they allow you to examine the raw bytes of the file without interpretation. For instance, HxD is a hex editor that I have used in the past.
Alternatively, you can simply output more text. The more there is, generally the less likely something will guess wrong. From some of my experiences, newlines are particularly helpful in convincing the text editor to assume the correct encoding.
there is nothing wrong with your code.
maybe the text editor you use has a default encoding.
use more advanced editors and you will get the right output.

c++ overwriting file data?

I am trying to run a program to replace certain data within a file. The relevant parts of the file attempting to be replaced look like the following:
1 Information 15e+10
2 Information 2e+16
3 Information 6e+2
And so on.
The files in question can be very large in the multiple gigabyte range and to my understanding because of this using a buffer of the whole file and rewriting the whole file is impossible/unreasonable. Well that is all fine I just want to replace the values (ex. the 15e+10).
This all works fine with simple ios::in|ios::out and tellp() if I am replacing the value with a similar sized value (15e+10->12e+12) or even if its a smaller size as I can simply add an extra space which can be ignored down the line (ex. 15e+10->4e+10 ). But I am running into the problem if I need to replace the value with a value whose length is longer than already in the file (ex. 6e+2->16e+10) it will write over the new line character or start writing over the information in the next line.
I have searched on the forums and everyone says you can either overwrite in the file, you can append to the end of the file, or you can buffer and recreate the whole file. Is there anyway I can achieve my goal of overwriting the value correctly without having to recreate the file?
If not then how can I have 2 files open (1 input 1 output) to do this if multiple files in question are too large for the memory?
Note: I would also like to avoid using boost:: as I need to be able to run this on a system without the boost library.
Open a stream to read from the input (IN) file and a second stream (OUT) to write to a new output (tmp) file.
Read from IN and write to OUT. When you get a value from IN that you want to replace write the replacement to OUT instead of the value you got from IN.
When parsing is complete replace the first file with the second (tmp) file.
Would this work for you?
Use lseek()/fseek() for "jump" to a given position in a file.
You can use seekp to go to the location and rewrite it with <<
Example:
example.txt ( |?| = 1 byte of data )
|A|B|C|\n|1|2|3|D|E|F|\n|4|5|6|
//Somewhere in the code
fstream file;
open("example.txt");
//Somehow find the character distance and store it into "distance"
seekp(distance);//If distance = 0, it will go to "A" like rewind() but easier for me
If the distance is 4, the next character will be overwritten is 1
file << "987";
And the file will be
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
BUT the only problem here is when you need to increase/decrease the size:
Increase:
You will overwrite the other character so you need to create a temp string to store it the rest of data or separate it into smaller chunk if the data is too large like
|A|B|C|\n|9|8|7|D|E|F|\n|4|5|6|
string tempstring;
seekp(distance);
file >> tempstring;
seekp(distance);
file << content << tempstring; //content is the data
Decrease:
The easiest solution is to write NULL character \0 to the excess space like
|A|B|C|\n|1|\0|\0|D|E|F|\n|4|5|6|
The only side-effect is the file size is the same as before

Retrieving file from .dat via getline() w/ c++

I posted this over at Code Review Beta but noticed that there is much less activity there.
I have the following code and it works just fine. It's function is to grab the input from a file and display it out (to confirm that it's been grabbed). My task is to write a program that counts how many times a certain word (string) "abc" is found in the input file.
Is it better to store the input as a string or in arrays/vectors and have each line be stored separately? a[1], a[2] ect? Perhaps someone could also point me to a resource that I can use to learn how to filter through the input data.
Thanks.
input_file.open ("in.dat");
while(!input_file.eof()) // Inputs all the lines until the end of file (eof).
{
getline(input_file,STRING); // Saves the input_file in STRING.
cout<<STRING; // Prints our STRING.
}
input_file.close();
Reading as much of the file into memory is always more efficient than reading one letter or text line at a time. Disk drives take a lot of time to spin up and relocate to a sector. However, your program will run faster if you can minimize the number of reads from the file.
Memory is fast to search.
My recommendation is to read the entire file, or as much as you can into memory, then search the memory for a "word". Remember, that in English, words can have hyphens,'-', and single quotes, "don't". Word recognition may become more difficult if it is split across a line or you include abbreviations (with periods).
Good luck.

getline() text with UNIX formatting characters

I am writing a C++ program which reads lines of text from a .txt file. Unfortunately the text file is generated by a twenty-something year old UNIX program and it contains a lot of bizarre formatting characters.
The first few lines of the file are plain, English text and these are read with no problems. However, whenever a line contains one or more of these strange characters mixed in with the text, that entire line is read as characters and the data is lost.
The really confusing part is that if I manually delete the first couple of lines so that the very first character in the file is one of these unusual characters, then everything in the file is read perfectly. The unusual characters obviously just display as little ascii squiggles -arrows, smiley faces etc, which is fine. It seems as though a decision is being made automatically, without my knowledge or consent, based on the first line read.
Based on some googling, I suspected that the issue might be with the locale, but according to the visual studio debugger, the locale property of the ifstream object is "C" in both scenarios.
The code which reads the data is as follows:
//Function to open file at location specified by inFilePath, load and process data
int OpenFile(const char* inFilePath)
{
string line;
ifstream codeFile;
//open text file
codeFile.open(inFilePath,ios::in);
//read file line by line
while ( codeFile.good() )
{
getline(codeFile,line);
//check non-zero length
if (line != "")
ProcessLine(&line[0]);
}
//close line
codeFile.close();
return 1;
}
If anyone has any suggestions as to what might be going on or how to fix it, they would be very welcome.
From reading about your issues it sounds like you are reading in binary data, which will cause getline() to throw out content or simply skip over the line.
You have a couple of choices:
If you simply need lines from the data file you can first sanitise them by removing all non-printable characters (that is the "official" name for those weird ascii characters). On UNIX a tool such as strings would help you with that process.
You can off course also do this programmatically in your code by simply reading in X amount of data, storing it in a string, and then removing those characters that fall outside of the standard ASCII character range. This will most likely cause you to lose any unicode that may be stored in the file.
You change your program to understand the format and basically write a parser that allows you to parse the document in a more sane way.
If you can, I would suggest trying solution number 1, simply to see if the results are sane and can still be used. You mention that this is medical data, do you per-chance know what file format this is? If you are trying to find out and have access to a unix/linux machine you can use the utility file and maybe it can give you a clue (worst case it will tell you it is simply data).
If possible try getting a "clean" file that you can post the hex dump of so that we can try to provide better help than that what we are currently providing. With clean I mean that there is no personally identifying information in the file.
For number 2, open the file in binary mode. You mentioned using Windows, binary and non-binary files in std::fstream objects are handled differently, whereas on UNIX systems this is not the case (on most systems, I'm sure I'll get a comment regarding the one system that doesn't match this description).
codeFile.open(inFilePath,ios::in);
would become
codeFile.open(inFilePath, ios::in | ios::binary);
Instead of getline() you will want to become intimately familiar with .read() which will allow unformatted operations on the ifstream.
Reading will be like this:
// This code has not been tested!
char input[1024];
codeFile.read(input, 1024);
int actual_read = codeFile.gcount();
// Here you can process input, up to a maximum of actual_read characters.
//ProcessLine() // We didn't necessarily read a line!
ProcessData(input, actual_read);
The other thing as mentioned is that you can change the locale for the current stream and change the separator it considers a new line, maybe this will fix your issue without requiring to use the unformatted operators:
imbue the stream with a new locale that only knows about the newline. This method may or may not let your getline() function without issues.

Incorporating text files in applications?

Is there anyway I can incorporate a pretty large text file (about 700KBs) into the program itself, so I don't have to ship the text files together in the application directory ? This is the first time I'm trying to do something like this, and I have no idea where to start from.
Help is greatly appreciated (:
Depending on the platform that you are on, you will more than likely be able to embed the file in a resource container of some kind.
If you are programming on the Windows platform, then you might want to look into resource files. You can find a basic intro here:
http://msdn.microsoft.com/en-us/library/y3sk7e6b.aspx
With more detailed information here:
http://msdn.microsoft.com/en-us/library/zabda143.aspx
Have a look at the xxd command and its -include option. You will get a buffer and a length variable in a C formatted file.
If you can figure out how to use a resource file, that would be the preferred method.
It wouldn't be hard to turn a text file into a file that can be compiled directly by your compiler. This might only work for small files - your compiler might have a limit on the size of a single string. If so, a tiny syntax change would make it an array of smaller strings that would work just fine.
You need to convert your file by adding a line at the top, enclosing each line within quotes, putting a newline character at the end of each line, escaping any quotes or backslashes in the text, and adding a semicolon at the end. You can write a program to do this, or it can easily be done in most editors.
This is my example document:
"Four score and seven years ago,"
can be found in the file c:\quotes\GettysburgAddress.txt
Convert it to:
static const char Text[] =
"This is my example document:\n"
"\"Four score and seven years ago,\"\n"
"can be found in the file c:\\quotes\\GettysburgAddress.txt\n"
;
This produces a variable Text which contains a single string with the entire contents of your file. It works because consecutive strings with nothing but whitespace between get concatenated into a single string.