Read contents of a text file character by character into a vector without skipping whitespace or new lines - c++

So I have several text files. I need to figure out the 10 most common characters and words in the file. I've decided to use a vector, and load it with each character from the file. However, it needs to include white space and new lines.
This is my current function
void readText(ifstream& in1, vector<char> & list, int & spaces, int & words)
{
//Fills the list vector with each individual character from the text ifle
in1.open("test1");
in1.seekg(0, ios::beg);
std::streampos fileSize = in1.tellg();
list.resize(fileSize);
string temp;
char ch;
while (in1.get(ch))
{
//calculates words
switch(ch)
{
case ' ':
spaces++;
words++;
break;
default:
break;
}
list.push_back(ch);
}
in1.close();
}
But for some reason, it doesn't seem to properly hold all of the characters. I have another vector elsewhere in the program that has 256 ints all set to 0. It goes through the vector with the text in it and tallys up the characters with their 0-256 int value in the other vector. However, it's tallying them up fine but spaces and newlines are causing problems. Is there a more efficient way of doing this?

The problem with your code right now is that you're calling
list.resize(fileSize);
and use
list.push_back(ch);
in your read loop at the same time. You only need one or the other.
Omit one of them.
Is there a more efficient way of doing this?
The easiest way is to resize the std::vector <char> with the size you already know and use std::ifstream::read() to read in the whole file in one go. Calculate everything everything else from the vector contents afterwards.
Something along these lines:
list.resize(fileSize);
in1.read(&list[0],fileSize);
for(auto ch : list) {
switch(ch) {
// Process the characters ...
}
}

Related

C++: How to read multiple lines from file until a certain character, store them in array and move on to the next part of the file

I'm doing a hangman project for school, and one of the requirements is it needs to read the pictures of the hanging man from a text file. I have set up a text file with the '-' char which means the end of one picture and start of the next one.
I have this for loop set up to read the file until the delimiting character and store it in an array, but when testing I am getting incomplete pictures, cut off in certain places.
This is the code:
string s;
ifstream scenarijos("scenariji.txt");
for(int i = 0; i < 10; i++ ) {
getline(scenarijos, s, '-');
scenariji[i] = s;
}
For the record, scenariji is an array with type of string
And here is an example of the text file:
example
From your example input, it looks like '-' can be part of the input image (look at the "arms" of the hanged man). Unless you use some other, unique character to delimit the images, you won't be able to separate them.
If you know the dimensions of the images, you could read them without searching for the delimiter by reading a certain amount of bytes from the input file. Alternatively, you could define some more complex rules for image termination, e.g. when the '-' character is the only character in the line. For example:
ifstream scenarijos("scenariji.txt");
string scenariji[10];
for (int i = 0; i < 10; ++i) {
string& scenarij = scenariji[i];
while (scenarijos.good()) {
string s;
getline(scenarijos, s); // read line
if (!scenarijos.good() || s == "-")
break;
scenarij += s;
scenarij.push_back('\n'); // the trailing newline was removed by getline
}
}

Pull out data from a file and store it in strings in C++

I have a file which contains records of students in the following format.
Umar|Ejaz|12345|umar#umar.com
Majid|Hussain|12345|majid#majid.com
Ali|Akbar|12345|ali#geeks-inn.com
Mahtab|Maqsood|12345|mahtab#myself.com
Juanid|Asghar|12345|junaid#junaid.com
The data has been stored according to the following format:
firstName|lastName|contactNumber|email
The total number of lines(records) can not exceed the limit 100. In my program, I've defined the following string variables.
#define MAX_SIZE 100
// other code
string firstName[MAX_SIZE];
string lastName[MAX_SIZE];
string contactNumber[MAX_SIZE];
string email[MAX_SIZE];
Now, I want to pull data from the file, and using the delimiter '|', I want to put data in the corresponding strings. I'm using the following strategy to put back data into string variables.
ifstream readFromFile;
readFromFile.open("output.txt");
// other code
int x = 0;
string temp;
while(getline(readFromFile, temp)) {
int charPosition = 0;
while(temp[charPosition] != '|') {
firstName[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != '|') {
lastName[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != '|') {
contactNumber[x] += temp[charPosition];
charPosition++;
}
while(temp[charPosition] != endl) {
email[x] += temp[charPosition];
charPosition++;
}
x++;
}
Is it necessary to attach null character '\0' at the end of each string? And if I do not attach, will it create problems when I will be actually implementing those string variables in my program. I'm a new to C++, and I've come up with this solution. If anybody has better technique, he is surely welcome.
Edit: Also I can't compare a char(acter) with endl, how can I?
Edit: The code that I've written isn't working. It gives me following error.
Segmentation fault (core dumped)
Note: I can only use .txt file. A .csv file can't be used.
There are many techniques to do this. I suggest searching StackOveflow for "[C++] read file" to see some more methods.
Find and Substring
You could use the std::string::find method to find the delimiter and then use std::string::substr to return a substring between the position and the delimiter.
std::string::size_type position = 0;
positition = temp.find('|');
if (position != std::string::npos)
{
firstName[x] = temp.substr(0, position);
}
If you don't terminate a a C-style string with a null character there is no way to determine where the string ends. Thus, you'll need to terminate the strings.
I would personally read the data into std::string objects:
std::string first, last, etc;
while (std::getline(readFromFile, first, '|')
&& std::getline(readFromFile, last, '|')
&& std::getline(readFromFile, etc)) {
// do something with the input
}
std::endl is a manipulator implemented as a function template. You can't compare a char with that. There is also hardly ever a reason to use std::endl because it flushes the stream after adding a newline which makes writing really slow. You probably meant to compare to a newline character, i.e., to '\n'. However, since you read the string with std::getline() the line break character will already be removed! You need to make sure you don't access more than temp.size() characters otherwise.
Your record also contains arrays of strings rather than arrays of characters and you assign individual chars to them. You either wanted to yse char something[SIZE] or you'd store strings!

Count the number of unique words and occurrence of each word

CSCI-15 Assignment #2, String processing. (60 points) Due 9/23/13
You MAY NOT use C++ string objects for anything in this program.
Write a C++ program that reads lines of text from a file using the ifstream getline() method, tokenizes the lines into words ("tokens") using strtok(), and keeps statistics on the data in the file. Your input and output file names will be supplied to your program on the command line, which you will access using argc and argv[].
You need to count the total number of words, the number of unique words, the count of each individual word, and the number of lines. Also, remember and print the longest and shortest words in the file. If there is a tie for longest or shortest word, you may resolve the tie in any consistent manner (e.g., use either the first one or the last one found, but use the same method for both longest and shortest). You may assume the lines comprise words (contiguous lower-case letters [a-z]) separated by spaces, terminated with a period. You may ignore the possibility of other punctuation marks, including possessives or contractions, like in "Jim's house". Lines before the last one in the file will have a newline ('\n') after the period. In your data files, omit the '\n' on the last line. You may assume that the lines will be no longer than 100 characters, the individual words will be no longer than 15 letters and there will be no more than 100 unique words in the file.
Read the lines from the input file, and echo-print them to the output file. After reaching end-of-file on the input file (or reading a line of length zero, which you should treat as the end of the input data), print the words with their occurrence counts, one word/count pair per line, and the collected statistics to the output file. You will also need to create other test files of your own. Also, your program must work correctly with an EMPTY input file – which has NO statistics.
Test file looks like this (exactly 4 lines, with NO NEWLINE on the last line):
the quick brown fox jumps over the lazy dog.
now is the time for all good men to come to the aid of their party.
all i want for christmas is my two front teeth.
the quick brown fox jumps over a lazy dog.
Copy and paste this into a small file for one of your tests.
Hints:
Use a 2-dimensional array of char, 100 rows by 16 columns (why not 15?), to hold the unique words, and a 1-dimensional array of ints with 100 elements to hold the associated counts. For each word, scan through the occupied lines in the array for a match (use strcmp()), and if you find a match, increment the associated count, otherwise (you got past the last word), add the word to the table and set its count to 1.
The separate longest word and the shortest word need to be saved off in their own C-strings. (Why can't you just keep a pointer to them in the tokenized data?)
Remember – put NO NEWLINE at the end of the last line, or your test for end-of-file might not work correctly. (This may cause the program to read a zero-length line before seeing end-of-file.)
This is not a long program – no more than about 2 pages of code
Here is what I have so far:
#include<iostream>
#include<iomanip>
#include<fstream>
#include<string>
#include<cstring>
using namespace std;
void totalwordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words.
char *token;
int totalCount = 0; // Counts the total number of words.
// Read every word in the file.
while(inputFile >> words[99])
{
totalCount++; // Increment the total number of words.
// Tokenize each word and remove spaces, periods, and newlines.
token = strtok(words[99], " .\n");
while(token != NULL)
{
token = strtok(NULL, " .\n");
}
}
cout << "Total number of words in file: " << totalCount << endl;
}
void uniquewordCount(ifstream &inputFile)
{
char words[100][16]; // Holds the unique words
int counter[100];
char *tok = "0";
int uniqueCount = 0; // Counts the total number of unique words
while(!inputFile.eof())
{
uniqueCount++;
tok = strtok(words[99], " .\n");
while(tok != NULL)
{
tok = strtok(NULL, " .\n");
inputFile >> words[99];
if(strcmp(tok, words[99]) == 0)
{
counter[99]++;
}
else
{
words[99][15] += 1;
}
uniqueCount++;
}
}
cout << counter[99] << endl;
}
int main(int argc, char *argv[])
{
ifstream inputFile;
char inFile[12] = "string1.txt";
char outFile[16] = "word result.txt";
// Get the name of the file from the user.
cout << "Enter the name of the file: ";
cin >> inFile;
// Open the input file.
inputFile.open(inFile);
// If successfully opened, process the data.
if(inputFile)
{
while(!inputFile.eof())
{
totalwordCount(inputFile);
uniquewordCount(inputFile);
}
}
return 0;
}
I already took care of how to count the total number of words in the file in the totalwordCount() function, but in the uniquewordCount() function, I am having trouble counting the total number of unique words and counting the number of occurrences of each word. Is there anything that I need to change in the uniquewordCount() function?
This program contains several issues which are to be considered harmful! To prevent bad software being created based on entirely nonsensical assignments like the above, here are a number of hints:
Always test the stream for success after reading from it. Using in.eof() to determine if the stream is in a good state does not work! One of the problems is that you will get an infinite loop if the stream goes bad for a different reason than end of file, e.g., failure to correctly parse a value (this will set std::ios_base::failbit but not std::ios_base::eofbit.
Reading to a fixed size char array a using in >> a without having set up limits for the number of characters to be read is the C++ way to spell gets()! If you really think that using in >> a is the right way to (see next item), you absolutely need to set up the array's width, e.g., using in >> std::setw(sizeof(a)) >> a. You still need to check that this extraction was successful, of course.
From the looks of it, your teacher wants you to actually use std::istream::getline() to read the array, e.g., using in.getline(a, sizeof(a)) (which, of course, needs to be checked for success).
Note that the formatted input, i.e., in >> a already tokenizes the stream being received by spaces! There is no need to faff about with strtok() after that.
Once you have consumed a stream, it is consumed. Assuming the characters don't come from a file but rather from something like standard input, you also can't rewind the stream to read it again. I'd think you want to tokenize the values once and use them for both purposes.
This is more of a sidenote: after you created a stream, its nature should be entirely immaterial for the processing of the stream's content (although, e.g., for string streams you might want to eventually collect the result using the str() member): implement your stream processing functions in terms of std::istream rather than std::ifstream!
Since you have a concrete question ("Is there anything that I need to change in the uniquewordCount() function?"): yes, everything! Throw away this function entirely and rethink what you need to do. Basically, the structure of the functionality should be along the lines of
char buffer[100];
while (in.getline(buffer, sizeof(buffer))) {
// tokenize buffer into words
// for each word check if it already exists
// if the word does not exist, append it to the array of known words and set count to 1
// if the word exists, increment the count
// determine if the word is shorter or longer than the shortest or longest word so far
// if it is the case, remember the word's index or a pointer to it
}

C++ Reading from a text file into a const char array

I want to read in lines from a text file into a 2-d char array but without the newline character.
Example of .txt:
TCAGC
GTAGA
AGCAG
ATGTC
ATGCA
ACAGA
CTCGA
GCGAC
CGAGC
GCTAG
...
So far, I have:
ifstream infile;
infile.open("../barcode information.txt");
string samp;
getline(infile,samp,',');
BARCLGTH = samp.length();
NUMSUBJ=1;
while(!infile.eof())
{
getline(infile,samp,',');
NUMSUBJ++;
}
infile.close(); //I read the file the first time to determine how many sequences
//there are in total and the length of each sequence to determine
//the dimensions of my array. Not sure if there is a better way?
ifstream file2;
file2.open("../barcode information.txt");
char store[NUMSUBJ][BARCLGTH+1];
for(int i=0;i<NUMSUBJ;i++)
{
for(int j=0;j<BARCLGTH+1;j++)
{
store[i][j] = file2.get();
}
}
However, I do not know how to ignore the newline character. I want the array to be indexed so that I can access a sequence with the first index and then a specific char within that sequence with the second index; i.e. store[0][0] would give me 'T', but I do not want store[0][5] to give me '\n'.
Also, as an aside, store[0][6], which I think should be out of bounds since BARCLGTH is 5, returns 'G',store[0][7] returns 'T',store[0][8] returns 'A', etc. These are the chars from the next line. Alternatively, store[1][0],store[1][1], and store[1][2] also return the same values. Why does the first set return values, shouldn't they be out of bounds?
As you're coding in C++, you could do like this instead:
std::vector<std::string> barcodes;
std::ifstream infile("../barcode information.txt");
std::string line;
while (std::getline(infile, line))
barcodes.push_back(line);
infile.close();
After this the vector barcodes contains all the contents from the file. No need for arrays, and no need to count the number of lines.
And as both vectors and strings can be indexed like arrays, you can use syntax such as barcodes[2][0] to get the first character of the third entry.

Incrementing characters being read

I am trying to decode an input file that looks something like this:
abbaabbbbaababbaabababaabababaabbababaabababababababa...
and compare it to a makeshift mapping I have made using two arrays
int secretNumber[10];
string coding[10];
coding[0]="abb";
coding[1]="aabbbba";
coding[2]="abab";
...
I am not sure how I can start off by reading the first character which is 'a' then check if it's in the coding array. If it is print out the secretCoding and move the next character b. Else if it's not in the array then add the next character to the first in a string and check to see if "ab" is in the array and if that isn't either add the next character which makes "abb" and so on.
Something like this:
while (!(readFile.eof()) ){
for(int i=0; i<10; i++){
if(stringOfChars==coding[i]){
cout << secretNumber[i] <<endl;
//Now increment to next char
}
else{
//combine the current string with the next character
}
}
}
Question: How do I go about reading in a character if its a match move to next character if not combine current character and the next character until there's a match.
You sould use a design pattern called interpreter.
Here is a link to a c++ version.
If you want a solution that works for arbitrary input sizes, i.e. which doesn't store the entire input in memory, then you can use a queue (e.g. std::deque<char>) to read in a handful of characters at a time, pushing data in from the back. Then you check if the queue still has three, four or five characters left, and if so compare them to your patterns; if there's a match, you pop the corresponding characters off from the front of the queue.
I'm not sure but perhaps it seems like you are trying to implement the LZW compression algorithm. If that is the case, then you would have to change your approach a little. If you have decided that your secret code are integers, then you would have to assign a code to all the elements of the initial contents of the dictionary. The initial dictionary is basically all the strings in your source alphabet of size 1. In your case it would be "a to z", or only "a" and "b" if you are keeping it simple.
The other thing is that you need to look through your dictionary for any existing string which has been assigned a code. The best way to do that is to use STL map container which could map strings to integers in your case. Also, its a good idea to place a restriction on the size to which the dictionary could grow as new strings continue to be added to it.
Overall,
Use std::map< std::string, int > dictionary; as your dictionary for strings such as a, b, aa, ab, aab, etc... and the matching code for it.
The coding[0], coding[1] would not be required as they strings would serve as the key in this dictionary.
The secretNumber[0], secretNumber[1] also would not be needed as the value would for a key would give the secretNumber.
Here is what it may look like:
std::map< std::string, int > dictionary;
int iWordCount = 0;
/*
Initialize the dictionary with the code for strings of length 1 in your source alphabet.
*/
dictionary["a"] = 0;
dictionary["b"] = 1;
iWordCount = 2; // We would be incrementing this as we keep on adding more strings to the dictionary.
std::string newWord = "", existingWord = "";
while (!(readFile.eof()) ){
/*
I'm assuming the next character is read in the variable "ch".
*/
newWord += ch;
if ( dictionary.count(newWord) != 0 ) { // Existing word.
/*
Do something
*/
}
else { // We encountered this word for the first time.
/*
Do something else
*/
}
}