Find out which worker had the most working hours

Find out which worker had the most working hours - c++

The problem goes something like this. Given the number "n", on the next "n" lines you will find texts similar to " Worker "x" has worked "y" hours this month". "x" and "y" are numbers from 1 to 10000. If a worker appears multiple times in the list, sum up the hours.
At the end print the index number of the worker with the most hours worked.
For ex:
3
Worker 23 worked 5 hours.
Worker 5 worked 10 hours.
Worker 23 worked 7 hours.
The output will be: 23.
I read de file like this:
#include <iostream>
#include <fstream>
#include <cstring>
using namespace std;
char s[50];
int n;
int main() {
ifstream fin("date.in");
fin >> n;
fin.getline(s, 50);
for (int i = 0; i < n; ++i) {
fin.getline(s, 50);
}
return 0;
}
My problem is that I cannot differentiate "x" from "y". Can anyone give me an idea to how to start the problem?

You can decompose a line with std::istringstream.
Example:
std::string line = "Worker 23 worked 5 hours.";
std::string word;
int id = 0;
int time = 0;
std::istringstream is(line);
if (is >> word >> id >> word >> time)
{
...
do whatever with 'id' and 'time'
...
}

There are two main approaches to solve this problem.
If you are learning C++ and do not know the language and its algorithm that good, then you need to go the hard way and do everything by yourself.
If you are an experienced developer, then you can use the full power of C++. This will result in a compact solution, but you need to learn a while to understand it.
Whatever, I will show you a poetntial soultion for both approaches.
Number 1. We will use only every basic C++ language and as less library functions as possible. We will even not use a std::string or a std::vector, which is basically a "must have" here.
What will be do?
Open the file
Read line by line in a string (char array)
Scan the char array character by character to find the numbers (id, hours)
Check, if a worker already existed. If yes, add up, if no, create a new one. Use 2 arrays for that.
Last but not least, find the max element
This is a horrible work with a lot of programming . . .
Please see:
#include <iostream>
#include <fstream>
#include <iomanip>
// The format of the line is know. A length of 100 will be sufficient for the given use cases.
const int MaxLineLength = 50;
int main() {
// Open the source file
std::ifstream fin("r:\\date.in");
// Check, if the file could be opened
if (fin) {
// This will contain the content of one line in the source file
char lineText[MaxLineLength];
// Now, read the number of lines that we shall process
int numberOfLinesToProcess = 0;
// Read a number, check, if that works and check for a valid range (>0)
if ((fin >> numberOfLinesToProcess >> std::ws) and (numberOfLinesToProcess > 0)) {
// We will now waste some space and create space for numberOfLinesToProcess
// although we know, that there moste likely are duplicates
int* workerID = new int[numberOfLinesToProcess];
int* numberOfHours = new int[numberOfLinesToProcess];
// We need a counter to detect the differnt workers
int differentWorkersCount = 0;
// Now read line by line from the source file
for (int lineIndex = 0; lineIndex < numberOfLinesToProcess and fin; ++lineIndex) {
// Read one complete line fomr the file
fin.getline(lineText, MaxLineLength);
// Check, if the input function worked OK
if (fin) {
// Here will store the temporary extracted values
int wID = 0, whours = 0;
// We are looking for a number and will skip the text
bool waitForNumberToStart = true;
char* startOfNumber = nullptr;
bool waitForWorkerID = true;
// Now, iterate over the string and find digits, the worker ID
for (char* cptr = &lineText[0]; *cptr != '\0'; ++cptr) {
// Check, if we found a digit
if (waitForNumberToStart) {
if (*cptr >= '0' and *cptr <= '9') {
// First time taht we saw a digit. Remember start position in string
startOfNumber = cptr;
// Now we wait no longer for a number to start. We wait for the end
waitForNumberToStart = false;
}
}
else {
// We found a digit and now wait for the end of the number, so a none-digit
if (not (*cptr >= '0' and *cptr <= '9')) {
if (waitForWorkerID) {
wID = std::atoi(startOfNumber);
waitForWorkerID = false;
}
else {
whours = std::atoi(startOfNumber);
break;
}
waitForNumberToStart = true;;
}
}
}
// Now, we have read both numbers. The worker IS and the working hours.
// Check, if we saw this worker already
bool workerFound = false;
for (int i = 0; i < differentWorkersCount and not workerFound; ++i) {
// Check, if such a worker already exists
if (workerID[i] == wID) {
// Yes, worker did exists, add working hours.
numberOfHours[i] += whours;
workerFound = true;
}
}
// If the worker did not yet exist, then we create a new one
if (not workerFound) {
// Cearte a new worker at the end of the array
workerID[differentWorkersCount] = wID;
numberOfHours[differentWorkersCount] = whours;
// Now we have one worker more
++differentWorkersCount;
}
}
else {
// Problem while reading
std::cout << "\nError: Problem while reading\n";
break;
}
}
// So, now we have found all workers and summed up all hours
// Search for max working time
int indexForMax = 0;
int maxWorkingTime = 0;
// Check all workers
for (int i = 0; i < differentWorkersCount; ++i) {
// Did we find a new max?
if (numberOfHours[i] >= maxWorkingTime) {
// If yes, then remeber this new may and remeber the workers index
maxWorkingTime = numberOfHours[i];
indexForMax = i;
}
}
// Now, we found everything and can show the result
std::cout << "Max: Worker: " << workerID[indexForMax] << " with: " << numberOfHours[indexForMax] << " hours\n";
// Release memory
delete[] workerID;
delete[] numberOfHours;
}
else {
// Invalid file content
std::cout << "\nError: Invalid file content\n";
}
}
else {
// The file could not be opened. Show error message
std::cout << "The source file yould not be opened\n";
}
}
Yes, indeed, the above works of course, but you really need to write a lot of code.
Modern C++ will help you here a lot. For many problems, we have ready-to-use functions.
Then the code will be rather compact in the end.
Please see:
#include <iostream>
#include <fstream>
#include <iomanip>
#include <unordered_map>
#include <algorithm>
using Sum = std::unordered_map<int, int>;
namespace rng = std::ranges;
int main() {
// Open file and check, if it could be opened
if (std::ifstream ifs{ "r:\\date.in" }; ifs) {
Sum data{};
// Read number of entries to operate on. We will ignore this value, we do not need it
if (int numberOfEntries{}; (ifs >> numberOfEntries >> std::ws)) {
// For reading the text and throwing away
std::string dummy;
// Read all lines and add up
for (int id{}, hours{}; ifs >> dummy >> id >> dummy >> hours >> dummy; data[id] += hours);
}
// Get max
const auto& [worker, hours] = *rng::max_element(data, {}, &Sum::value_type::second);
// Show result
std::cout << "Max: Worker: " << worker << " with: " << hours << " hours\n";
}
else std::cerr << "\n*** Error: Could not open souirce file\n";
}
Also many intermediate solutions are possible.
This depends on how much you learned already.

Related

save values from strigng array in 2d int array?

I have been taking a voluntary computer science course at school share for 1 month and want to practice something. Task:
I am to store a series of numbers with symbols in a string array for the first time. The numbers are to be stored in a two-dimensional int array. But the symbols should look different on output. And the output should be done only using the values of the int array. The numbers are one-digit. There is no need to check if the user makes the input correctly.
This is how the program should look like when it is executed:
Input:
.1.2.3.|.4.5.6.|.7.8.9
-------|-------|-------
Output:
;1;2;3;//;4;5;6;//;7;8;9
=======//=======//=======
I know that you always have to proceed in small steps. I have made it so far that the input is output exactly the same again. I just can't get the solution, I have been sitting here for hours. How do I save the numbers I have saved to the string array to the 2d array? And how do I replace the symbols to look like the example?
My code:
#include <iostream>
#include <string>
using namespace std;
int main()
{
int ArrayTwo[3][2] = {0}; //From Task
string ArrayInput[2] = {""}; //From Task
cout << "Input:" << endl;
for (int i = 0; i < 2; i++)
{
cin >> ArrayInput[i];
}
cout << "Output" << endl;
for (int i = 0; i < 2; i++)
{
cout << ArrayInput[i] << endl;
}
system("pause");
return 0;
}

Probably it's simplest to iterate to simply iterate through the input string and convert anything that's a digit to a number:
#include <cctype>
...
void PrintInner(int (&inner)[3])
{
std::cout << ';';
for (auto element : inner)
{
std::cout << element << ';';
}
}
...
int ArrayTwo[3][3];
string ArrayInput[2];
...
auto readPos = ArrayInput[0].cbegin();
for (auto& inner : ArrayTwo)
{
for(auto& element : inner)
{
// skip non-digit chars
while(!std::isdigit(*readPos))
{
++readPos;
}
element = *readPos - '0'; // char codes of digits 0123456789 are next to each other
++readPos;
}
}
cout << "Output" << endl;
PrintInner(ArrayTwo[0]);
for (int i = 1; i != 3; ++i)
{
std::cout << "//";
PrintInner(ArrayTwo[i]);
}
...
Demo on godbolt
In case you're unfamiliar with this: for ( ... : ...) is a ranged for-loop; this sets the loop variable to the elements of the array in increasing order of indices.

Error handling when reading integers from a file

Sorry for the somewhat beginner question, but I've been at this for a couple of days and can't figure out a solution.
I'm basically reading integers from a file, these files should have a set amount of numbers, for the purpose of this question let us say 40. I can return an error fine when the file has less than or more than 40 integers. However, if there happens to be a non-numeric character in there, I'm struggling to figure out how to return an error.
This is what I'm currently doing:
int number = 0;
int counter = 0;
while(inputstream >> number)
{
// random stuff
counter++;
}
if (counter < 40)
return error;
It is at this point I'm a bit confused where to go. My while loop will terminate when the input stream is not an int, but there are two cases when this could happen, a non-integer character is in there, or the end of file has been reached. If we're at eof, my error message is fine and there were less than 40 integers. However, we could also be at less than 40 if it encountered a non-int somewhere. I want to be able to determine the difference between the two but struggling to figure out how to do this. Any help would be appreciated. Thanks!

you can input a line inside loop and try to convert it to integer so if the conversion fails means a non-integer and immediately break the loop returning a error telling that a non-integer found.
otherwise continue read until the end of file then check whether values are less or more 40 checking whether the loop reads all the content or broke because of non-integer value:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
enum ERRORFLAG{INIT, LESS_40, MORE_40, NON_INT}; // enumerate error
int main()
{
ifstream in("data.txt");
string sLine; // input one line for each read
int value; // value that will be assigned the return value of conversion
int count = 0; // counter for integer values
ERRORFLAG erFlag = INIT; // intialize the error flag
while(getline(in, sLine)) // iterate reading one line each time
{
if( !(value = atoi(sLine.c_str()) ) ) // conversion from string to integer so if the conversion failed else body will be executed
{
erFlag = NON_INT; // setting the error flag to non-int and break
break;
}
else
count++; // otherwise continue reading incrementing count
}
if(INIT == erFlag) // check whether the loop finishes successfully or a non-int caused it to break
{
if( count < 40) // checking whether number of ints less than 40
erFlag = LESS_40; //
else
if(count > 40) // or more than 40
erFlag = MORE_40;
}
// printing the error
switch(erFlag)
{
case LESS_40:
cout << "Error: less than 40 integers << endl";
break;
case MORE_40:
cout << "Error: More than 40 integers << endl";
break;
case NON_INT:
cout << "Error: non-intger found!" << endl;
break;
default:
cout << "Undefined Error" << endl;
}
in.close();
std::cout << std::endl;
return 0;
}

#include <iostream>
using namespace std;
int main() {
int count = 0;
int x;
istream& is = cin; // works with every class that inherits this one
while (is >> x) ++count;
if (is.eof()) {} // end of file reached
else {} // a bad value has been read
cout << "Read count " << count << '\n';
}

this program works fine: first read the file checking for a non-digits and non-white space characters and if you find break immediately setting the error flag.
keep in mind that white spaces like single and tab space will not be considered as invalid because they are used in your file as separators so any character other than digit or white space will break the loop returning an error.
if no error occurred (no invalid character found) and reaching the end of file then read again the file pushing the integer values into a vector which is a good idea without needing a counter then check the size of vector if it is less or more than 40 issuing an error otherwise print the content of vector:
#include <iostream>
#include <string>
#include <fstream>
using namespace std;
#include <vector>
enum ERRORFLAG{INIT, LESS_40, MORE_40, NON_INT};
int main()
{
ifstream in("data.txt");
char c;
string sLine;
int value;
vector<int> vec;
ERRORFLAG erFlag = INIT;
while(in >> c)
{
if(!isspace(c) && !isdigit(c))
{
erFlag = NON_INT;
break;
}
}
in.clear();
in.seekg(0, ios::beg); // jumping back the the beginning of the file moving the get pointer to the beginning
while(in >> value)
vec.push_back(value);
if(NON_INT == erFlag)
cout << "non-int found!" << endl;
else
{
if(vec.size() < 40)
cout << "les than 40 integers!" << endl;
else
if(vec.size() > 40)
cout << "more than 40 integers found!" << endl;
else
for(int i(0); i < vec.size(); i++)
cout << vec[i] << ", ";
}
std::cout << std::endl;
return 0;
}

For-loop with strings not working

Here is the instructions:
Write a program that reads in a text file one word at a time. Store a word into a dynamically created array when it is first encountered. Create a paralle integer array to hold a count of the number of times that each particular word appears in the text file. If the word appears in the text file multiple times, do not add it into your dynamic array, but make sure to increment the corresponding word frequency counter in the parallel integer array. Remove any trailing punctuation from all words before doing any comparisons.
Create and use the following text file containing a quote from Bill Cosby to test your program.
I don't know the key to success, but the key to failure is trying to please everybody.
At the end of your program, generate a report that prints the contents of your two arrays
Here is my Code:
#include <iostream>
#include <fstream>
#include <string>
#include <cstring>
#include <cctype>
using namespace std;
int main()
{
ifstream inputFile;
int numWords;
string filename;
string *readInArray = 0;
char testArray[300] = {0};
char *realArray = 0;
const char *s1 = 0;
string word;
int j =1;
int k = 0;
int start =0;
int ending = 0;
char wordHolder[20] = {0};
cout << "Enter the number of words the file contains: ";
cin >> numWords;
readInArray = new string[(2*numWords)-1];
cout << "Enter the filename you wish to read in: ";
cin >> filename;
inputFile.open(filename.c_str());
if (inputFile)
{
cout << "\nHere is the text from the file:\n\n";
for (int i=0; i <= ((2*numWords) -1); i +=2)
{
inputFile >> readInArray[i]; // Store word from file to string array
cout << readInArray[i];
strcat(testArray, readInArray[i].c_str()); // Copy c-string conversion of word
// just read in to c-string
readInArray[j] = " ";
cout << readInArray[j];
strcat(testArray, readInArray[j].c_str()); // This part is for adding spaces in arrays
++j;
}
inputFile.close();
}
else
{
cout << "Could not open file, ending program";
return 0;
}
realArray = new char[strlen(testArray)];
cout << "\n\n";
for(int i=0; i < strlen(testArray); ++i)
{
if (isalpha(testArray[i]) || isspace(testArray[i])) // Is makes another char array equal to
{ // the first one but without any
realArray[k]=testArray[i]; // Punctuation
cout << realArray[k] ;
k++;
}
}
cout << "\n\n";
for (int i=0; i < ((2*numWords) -1); i+=2)
{
while (isalpha(realArray[ending])) // Finds space in char array to stop
{
++ending;
}
cout << "ending: " << ending << " ";
for ( ; start < ending; ++start) // saves the array up to stopping point
{ // into a holder c-string
wordHolder[start] = realArray[start];
}
cout << "start: " << start << " ";
readInArray[i] = string(wordHolder); // Converts holder c-string to string and
cout << readInArray[i] << endl; // assigns to element in original string array
start = ending; // Starts reading where left off
++ending; // Increments ending counter
}
return 0;
}
Output:
Enter the number of words the file contains: 17
Enter the filename you wish to read in: D:/Documents/input.txt
Here is the text from the file:
I don't know the key to sucess, but the key to failure is trying to please everybody.
I dont know the key to sucess but the key to failure is trying to please everybody
ending: 1 start: 1 I
ending: 6 start: 6 I dont
ending: 11 start: 11 I dont know
ending: 15 start: 15 I dont know the
ending: 19 start: 19 I dont know the key
ending: 22 start: 22 I dont know the key to>
ending: 29 start: 29 I dont know the key to sucess
ending: 33 start: 33 I dont know the key to sucess but↕>
My Question:
Something is wrong with the last for-loop, it crashes after I run it. I included the ending and starting variables to maybe help see whats going on. I know there are better ways of doing this problem but the instructor wants it done this way. If you know where I went wrong with the last for-loop any help would be very much appreciated!!

You aren't null-terminating your strings as you go along. You copy the characters correctly, but without null terminators, your loops might go off into the weeds.

Trouble with dynamic arrays and string occurence (C++)

I am working on a lab for my C++ class. I have a very basic working version of my lab running, however it is not quite how it is supposed to be.
The assignment:
Write a program that reads in a text file one word at a time. Store a word into a dynamically created array when it is first encountered. Create a parallel integer array to hold a count of the number of times that each particular word appears in the text file. If the word appears in the text file multiple times, do not add it into your dynamic array, but make sure to increment the corresponding word frequency counter in the parallel integer array. Remove any trailing punctuation from all words before doing any comparisons.
Create and use the following text file containing a quote from Bill Cosby to test your program.
I don't know the key to success, but the key to failure is trying to please everybody.
At the end of your program, generate a report that prints the contents of your two arrays in a format similar to the following:
Word Frequency Analysis
Word Frequency
I 1
don't 1
know 1
the 2
key 2
...
I can figure out if a word repeats more than once in the array, but I cannot figure out how to not add/remove that repeated word to/from the array. For instance, the word "to" appears three times, but it should only appear in the output one time (meaning it is in one spot in the array).
My code:
using namespace std;
int main()
{
ifstream file;
file.open("Quote.txt");
if (!file)
{
cout << "Error: Failed to open the file.";
}
else
{
string stringContents;
int stringSize = 0;
// find the number of words in the file
while (file >> stringContents)
{
stringSize++;
}
// close and open the file to start from the beginning of the file
file.close();
file.open("Quote.txt");
// create dynamic string arrays to hold the contents of the file
// these will be used to compare with each other the frequency
// of the words in the file
string *mainContents = new string[stringSize];
string *compareContents = new string[stringSize];
// holds the frequency of each word found in the file
int frequency[stringSize];
// initialize frequency array
for (int i = 0; i < stringSize; i++)
{
frequency[i] = 0;
}
stringContents = "";
cout << "Word\t\tFrequency\n";
for (int i = 0; i < stringSize; i++)
{
// if at the beginning of the iteration
// don't check for the reoccurence of the same string in the array
if (i == 0)
{
file >> stringContents;
// convert the current word to a c-string
// so we can remove any trailing punctuation
int wordLength = stringContents.length() + 1;
char *word = new char[wordLength];
strcpy(word, stringContents.c_str());
// set this to no value so that if the word has punctuation
// needed to remove, we can modify this string
stringContents = "";
// remove punctuation except for apostrophes
for (int j = 0; j < wordLength; j++)
{
if (ispunct(word[j]) && word[j] != '\'')
{
word[j] = '\0';
}
stringContents += word[j];
}
mainContents[i] = stringContents;
compareContents[i] = stringContents;
frequency[i] += 1;
}
else
{
file >> stringContents;
int wordLength = stringContents.length() + 1;
char *word = new char[wordLength];
strcpy(word, stringContents.c_str());
// set this to no value so that if the word has punctuation
// needed to remove, we can modify this string
stringContents = "";
for (int j = 0; j < wordLength; j++)
{
if (ispunct(word[j]) && word[j] != '\'')
{
word[j] = '\0';
}
stringContents += word[j];
}
// stringContents = "dont";
//mainContents[i] = stringContents;
compareContents[i] = stringContents;
// search for reoccurence of the word in the array
// if the array already contains the word
// don't add the word to our main array
// this is where I am having difficulty
for (int j = 0; j < stringSize; j++)
{
if (compareContents[i].compare(compareContents[j]) == 0)
{
frequency[i] += 1;
}
else
{
mainContents[i] = stringContents;
}
}
}
cout << mainContents[i] << "\t\t" << frequency[i];
cout << "\n";
}
}
file.close();
return 0;
}
I apologize if the code is difficult to understand/follow through. Any feedback is appreciated :]

If you use stl, the entire problem can be solved easily, with less coding.
#include <iostream>
#include <fstream>
#include <string>
#include <unordered_map>
#include <algorithm>
using namespace std;
int main()
{
ifstream file("Quote.txt");
string aword;
unordered_map<string,int> wordFreq;
if (!file.good()) {
cout << "Error: Failed to open the file.";
return 1;
}
else {
while( file >> aword ) {
aword.erase(remove_if(aword.begin (), aword.end (), ::ispunct), aword.end ()); //Remove Punctuations from string
unordered_map<string,int>::iterator got = wordFreq.find(aword);
if ( got == wordFreq.end() )
wordFreq.insert(std::make_pair<string,int>(aword.c_str(),1)); //insert the unique strings with default freq 1
else
got->second++; //found - increment freq
}
}
file.close();
cout << "\tWord Frequency Analyser\n"<<endl;
cout << " Frequency\t Unique Words"<<endl;
unordered_map<string,int>::iterator it;
for ( it = wordFreq.begin(); it != wordFreq.end(); ++it )
cout << "\t" << it->second << "\t\t" << it->first << endl;
return 0;
}

The algorithm that you use is very complex for such a simple task. Here is what you sahll do:
Ok, first reading pass for determining the maximum size of the
array
Then second reading pass, look directly at what to do: if string is already in the table just increment its frequency, otherwise add it to the table.
Output the table
The else block of your code would then look like:
string stringContents;
int stringSize = 0;
// find the number of words in the file
while (file >> stringContents)
stringSize++;
// close and open the file to start from the beginning of the file
file.close();
file.open("Quote.txt");
string *mainContents = new string[stringSize]; // dynamic array for strings found
int *frequency = new int[stringSize]; // dynamic array for frequency
int uniqueFound = 0; // no unique string found
for (int i = 0; i < stringSize && (file >> stringContents); i++)
{
//remove trailing punctuations
while (stringContents.size() && ispunct(stringContents.back()))
stringContents.pop_back();
// process string found
bool found = false;
for (int j = 0; j < uniqueFound; j++)
if (mainContents[j] == stringContents) { // if string already exist
frequency[j] ++; // increment frequency
found = true;
}
if (!found) { // if string not found, add it !
mainContents[uniqueFound] = stringContents;
frequency[uniqueFound++] = 1; // and increment number of found
}
}
// display results
cout << "Word\t\tFrequency\n";
for (int i=0; i<uniqueFound; i++)
cout << mainContents[i] << "\t\t" << frequency[i] <<endl;
}
Ok, it's an assignment. So you have to use arrays. Later you could sumamrize this code into:
string stringContents;
map<string, int> frequency;
while (file >> stringContents) {
while (stringContents.size() && ispunct(stringContents.back()))
stringContents.pop_back();
frequency[stringContents]++;
}
cout << "Word\t\tFrequency\n";
for (auto w:frequency)
cout << w.first << "\t\t" << w.second << endl;
and even have the words sorted alphabetically.

Depending on whether or not your assignment requires that you use an 'array', per se, you could consider using a std::vector or even a System::Collections::Generic::List for C++/CLI.
Using vectors, your code might look something like this:
#include <vector>
#include <string>
#include <fstream>
#include <iostream>
using namespace std;
int wordIndex(string); //Protoype a function to check if the vector contains the word
void processWord(string); //Prototype a function to handle each word found
vector<string> wordList; //The dynamic word list
vector<int> wordCount; //The dynamic word count
void main() {
ifstream file("Quote.txt");
if (!file) {
cout << "Error: Failed to read file" << endl;
} else {
//Read each word into the 'word' variable
string word;
while (!file.eof()) {
file >> word;
//Algorithm to remove punctuation here
processWord(word);
}
}
//Write the output to the console
for (int i = 0, j = wordList.size(); i < j; i++) {
cout << wordList[i] << ": " << wordCount[i] << endl;
}
system("pause");
return;
}
void processWord(string word) {
int index = wordIndex(word); //Get the index of the word in the vector - if the word isn't in the vector yet, the function returns -1.
//This serves a double purpose: Check if the word exsists in the vector, and if it does, what it's index is.
if (index > -1) {
wordCount[index]++; //If the word exists, increment it's word count in the parallel vector.
} else {
wordList.push_back(word); //If not, add a new entry
wordCount.push_back(1); //in both vectors.
}
}
int wordIndex(string word) {
//Iterate through the word list vector
for (int i = 0, j = wordList.size(); i < j; i++) {
if (wordList[i] == word) {
return i; //The word has been found. return it's index.
}
}
return -1; //The word is not in the vector. Return -1 to tell the program that the word hasn't been added yet.
}
I've tried to annotate any new code/concepts with comments to make it easy to understand, so hopefully you can find it useful.
As a side note, you may notice that I've moved a lot of the repetative code out of the main function and into other functions. This allows for more efficient and readable coding because you can divide each problem into easily manageable, smaller problems.
Hope this can be of some use.

What is the best efficient way to read millions of integers separated by lines from text file in c++

I have about 25 millions of integers separated by lines in my text file. My first task is to take those integers and sort them. I have actually achieved to read the integers and put them into an array (since my sorting function takes an unsorted array as an argument). However, this reading the integers from a file is a very long and an expensive process. I have searched many other solutions to get the cheaper and efficient way of doing this but I was not able to find one that tackles with such sizes. Therefore, what would your suggestion be to read the integers from the huge (about 260MB) text file. And also how can I get the number of lines efficiently for the same problem.
ifstream myFile("input.txt");
int currentNumber;
int nItems = 25000000;
int *arr = (int*) malloc(nItems*sizeof(*arr));
int i = 0;
while (myFile >> currentNumber)
{
arr[i++] = currentNumber;
}
This is just how I get the integers from the text file. It is not that complicated. I assumed the number of lines are fixed (actually it is fixed)
By the way, it is not too slow of course. It completes reading in approximately 9 seconds in OS X with 2.2GHz i7 processor. But I feel it could be much better.

Most likely, any optimisation on this is likely to have rather little effect. On my machine, the limiting factor for reading large files is the disk transfer speed. Yes, improving the read speed can improve it a little bit, but most likely, you won't get very much from that.
I found in a previous test [I'll see if I can find the answer with that in it - I couldn't find the source in my "experiment code for SO" directory] that the fastest way is to load the file using mmap. But it's only marginally faster than using ifstream.
Edit: my home-made benchmark for reading a file in a few different ways.
getline while reading a file vs reading whole file and then splitting based on newline character
As per usual, benchmarks measure what the benchmark measures, and small changes to either the environment or the way the code is written can sometimes make a big difference.
Edit:
Here are a few implementations of "read a number from a file and store it in a vector":
#include <iostream>
#include <fstream>
#include <vector>
#include <sys/time.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
using namespace std;
const char *file_name = "lots_of_numbers.txt";
void func1()
{
vector<int> v;
int num;
ifstream fin(file_name);
while( fin >> num )
{
v.push_back(num);
}
cout << "Number of values read " << v.size() << endl;
}
void func2()
{
vector<int> v;
v.reserve(42336000);
int num;
ifstream fin(file_name);
while( fin >> num )
{
v.push_back(num);
}
cout << "Number of values read " << v.size() << endl;
}
void func3()
{
int *v = new int[42336000];
int num;
ifstream fin(file_name);
int i = 0;
while( fin >> num )
{
v[i++] = num;
}
cout << "Number of values read " << i << endl;
delete [] v;
}
void func4()
{
int *v = new int[42336000];
FILE *f = fopen(file_name, "r");
int num;
int i = 0;
while(fscanf(f, "%d", &num) == 1)
{
v[i++] = num;
}
cout << "Number of values read " << i << endl;
fclose(f);
delete [] v;
}
void func5()
{
int *v = new int[42336000];
int num = 0;
ifstream fin(file_name);
char buffer[8192];
int i = 0;
int bytes = 0;
char *p;
int hasnum = 0;
int eof = 0;
while(!eof)
{
fin.read(buffer, sizeof(buffer));
p = buffer;
bytes = 8192;
while(bytes > 0)
{
if (*p == 26) // End of file marker...
{
eof = 1;
break;
}
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
memset(buffer, 26, sizeof(buffer)); // To detect end of files.
}
cout << "Number of values read " << i << endl;
delete [] v;
}
void func6()
{
int *v = new int[42336000];
int num = 0;
FILE *f = fopen(file_name, "r");
char buffer[8192];
int i = 0;
int bytes = 0;
char *p;
int hasnum = 0;
int eof = 0;
while(!eof)
{
fread(buffer, 1, sizeof(buffer), f);
p = buffer;
bytes = 8192;
while(bytes > 0)
{
if (*p == 26) // End of file marker...
{
eof = 1;
break;
}
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
memset(buffer, 26, sizeof(buffer)); // To detect end of files.
}
fclose(f);
cout << "Number of values read " << i << endl;
delete [] v;
}
void func7()
{
int *v = new int[42336000];
int num = 0;
FILE *f = fopen(file_name, "r");
int ch;
int i = 0;
int hasnum = 0;
while((ch = fgetc(f)) != EOF)
{
if (ch == '\n' || ch == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
hasnum = 0;
}
else if (ch >= '0' && ch <= '9')
{
hasnum = 1;
num *= 10;
num += ch-'0';
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
fclose(f);
cout << "Number of values read " << i << endl;
delete [] v;
}
void func8()
{
int *v = new int[42336000];
int num = 0;
int f = open(file_name, O_RDONLY);
off_t size = lseek(f, 0, SEEK_END);
char *buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);
int i = 0;
int hasnum = 0;
int bytes = size;
char *p = buffer;
while(bytes > 0)
{
if (*p == '\n' || *p == ' ')
{
if (hasnum)
v[i++] = num;
num = 0;
p++;
bytes--;
hasnum = 0;
}
else if (*p >= '0' && *p <= '9')
{
hasnum = 1;
num *= 10;
num += *p-'0';
p++;
bytes--;
}
else
{
cout << "Error..." << endl;
exit(1);
}
}
close(f);
munmap(buffer, size);
cout << "Number of values read " << i << endl;
delete [] v;
}
struct bm
{
void (*f)();
const char *name;
};
#define BM(f) { f, #f }
bm b[] =
{
BM(func1),
BM(func2),
BM(func3),
BM(func4),
BM(func5),
BM(func6),
BM(func7),
BM(func8),
};
double time_to_double(timeval *t)
{
return (t->tv_sec + (t->tv_usec/1000000.0)) * 1000.0;
}
double time_diff(timeval *t1, timeval *t2)
{
return time_to_double(t2) - time_to_double(t1);
}
int main()
{
for(int i = 0; i < sizeof(b) / sizeof(b[0]); i++)
{
timeval t1, t2;
gettimeofday(&t1, NULL);
b[i].f();
gettimeofday(&t2, NULL);
cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
}
for(int i = sizeof(b) / sizeof(b[0])-1; i >= 0; i--)
{
timeval t1, t2;
gettimeofday(&t1, NULL);
b[i].f();
gettimeofday(&t2, NULL);
cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
}
}
Results (two consecutive runs, forwards and backwards to avoid file-caching benefits):
Number of values read 42336000
func1: 6068.53ms
Number of values read 42336000
func2: 6421.47ms
Number of values read 42336000
func3: 5756.63ms
Number of values read 42336000
func4: 6947.56ms
Number of values read 42336000
func5: 941.081ms
Number of values read 42336000
func6: 962.831ms
Number of values read 42336000
func7: 2572.4ms
Number of values read 42336000
func8: 816.59ms
Number of values read 42336000
func8: 815.528ms
Number of values read 42336000
func7: 2578.6ms
Number of values read 42336000
func6: 948.185ms
Number of values read 42336000
func5: 932.139ms
Number of values read 42336000
func4: 6988.8ms
Number of values read 42336000
func3: 5750.03ms
Number of values read 42336000
func2: 6380.36ms
Number of values read 42336000
func1: 6050.45ms
In summary, as someone pointed out in the comments, the actual parsing of integers is quite a substantial part of the whole time, so reading the file isn't quite as critical as I first made out. Even a very naive way of reading the file (using fgetc() beats the ifstream operator>> for integers.
As can be seen, using mmap to load the file is slightly faster than reading the file via fstream, but only marginally so.

You can use external sorting to sort values in your file without loading them all into memory. Sorting speed will be limited by your hard drive capabilities, but you will be able to mess with really huge files. Here is the implementation.

Try reading blocks of integers and parsing those blocks instead of reading line by line.

260MB is not that big. You should be able to load the whole thing into memory and then parse through it. Once in you can use a nested loop to read the integers between line endings and convert using the usual functions. I'd try and preallocate sufficient memory for your array of integers before you start.
Oh, and you may find the crude old C-style file access functions are the faster options for things like this.

I would do it this way :
#include <fstream>
#include <iostream>
#include <string>
using namespace std;
int main() {
fstream file;
string line;
int intValue;
int lineCount = 0;
try {
file.open("myFile.txt", ios_base::in); // Open to read
while(getline(file, line)) {
lineCount++;
try {
intValue = stoi(line);
// Do something with your value
cout << "Value for line " << lineCount << " : " << intValue << endl;
} catch (const exception& e) {
cerr << "Failed to convert line " << lineCount << " to an int : " << e.what() << endl;
}
}
} catch (const exception& e) {
cerr << e.what() << endl;
if (file.is_open()) {
file.close();
}
}
cout << "Line count : " << lineCount << endl;
system("PAUSE");
}

It will be pretty straightforward with Qt:
QFile file("h:/1.txt");
file.open(QIODevice::ReadOnly);
QDataStream in(&file);
QVector<int> ints;
ints.reserve(25000000);
while (!in.atEnd()) {
int integer;
qint8 line;
in >> integer >> line; // read an int into integer, a char into line
ints.append(integer); // append the integer to the vector
}
At the end, you have the ints QVector you can easily sort. The number of lines is the same as the size of the vector, provided the file was properly formatted.
On my machine, i7 3770k #4.2 Ghz, it takes about 490 milliseconds to read 25 million ints and put them into a vector. Reading from a regular mechanical HDD, not SSD.
Buffering the entire file into memory didn't help all that much, time dropped to 420 msec.

You don't say how you are reading the values, so it's hard to
say. Still, there are really only two solutions: `someIStream
anIntandfscanf( someFd, "%d", &anInt )` Logically, these
should have similar performance, but implementations vary; it
might be worth trying and measuring both.
Another thing to check is how you're storing them. If you know
you have about 25 million, doing a reserve of 30 million on
the std::vector before reading them would probably help. It
might also be cheaper to construct the vector with 30 million
elements, then trim it when you've seen the end, rather than
using push_back.
Finally, you might consider writing a immapstreambuf, and
using that to mmap the input, and read it directly from the
mapped memory. Or even iterating over it manually, calling
strtol (but that's a lot more work); all of the streaming
solutions probably end up calling strtol, or something
similar, but doing significant work around the call first.
EDIT:
FWIW, I did some very quick tests on my home machine (a fairly
recent LeNova, running Linux), and the results surprised me:
As a reference, I did the trivial, naïve implementation, using
std::cin >> tmp and v.push_back( tmp );, with no attempts to
optimize. On my system, this ran in just under 10 seconds.
Simple optimizations, such as using reserve on the vector,
or initially creating the vector with a size of 25000000, didn't
change much—the time was still over 9 seconds.
Using a very simple mmapstreambuf, the time dropped to
around 3 seconds—with the simplest loop, no reserve,
etc.
Using fscanf, the time dropped to just under 3 seconds. I
suspect that the Linux implementation of FILE* also uses
mmap (and std::filebuf doesn't).
Finally, using a mmapbuffer, iterating with two char*, and
using stdtol to convert, the time dropped to under a second,
These tests were done very quickly (less than an hour to write
and run all of them), and are far from rigorous (and of course,
don't tell you anything about other environments), but the
differences surprised me. I didn't expect as much difference.

One possible solution would be dividing the large file into smaller chunks. Sort each chunk separately and then merge all the sorted chunks one by one.
EDIT:
Apparently this is a well-established method. See 'External merge sort' at http://en.wikipedia.org/wiki/External_sorting

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find out which worker had the most working hours - c++

You can decompose a line with std::istringstream. Example: std::string line = "Worker 23 worked 5 hours."; std::string word; int id = 0; int time = 0; std::istringstream is(line); if (is >> word >> id >> word >> time) { ... do whatever with 'id' and 'time' ... }

Related

save values from strigng array in 2d int array?

Error handling when reading integers from a file

For-loop with strings not working

Trouble with dynamic arrays and string occurence (C++)

What is the best efficient way to read millions of integers separated by lines from text file in c++

Categories

Resources