this is quite a primitive problem, so I guess the solution shouldn't be hard, but I didn't find a way how to do it simply, neither have I summarized it to actually find it in the internet.So going to the question, I have a file of information like this:
1988 Godfather 3 33 42
1991 Dance with Wolves 3 35 43
1992 Silence of the lambs 3 33 44
And I have a requirement to put all the information in a data structure, so lets say it will be int year, string name and three more int types for numbers. But how do I know if the next thing I read is a number or not? I never know how long is the word.Thank you in advance for anyone who took their time with such a primitive problem. :)
EDIT: Don't consider movies with numbers in their title.
You're going to have some major issues when you go to try to parse other movies, like, Free Willy 2.
You might try instead to treat it as a std::stringstream and rely on the last three chunks being the data you're looking for rather than generalizing with a Regular Expression.
your best bet would be to use C++ regex
That would give you a more fine grained control over what you want to parse.
examples:
year -> \d{4}
word -> \w+
number->\d+
If you do not have control over the file format, you may want to do something along these lines (pseudo-process):
1) read in the line from the file
2) reverse the order of the "words" in the file
3) read in the 3 ints first
4) read in the rest of the stream as a string
4) reverse the "words" in the new string
5) read in the year
6) the remainder will be the movie title
Read every field as a string and then convert the appropriate string to integers.
1)initially
1983
GodFather
3
33
45
are all strings and stored in a vector of strings (vector<string>).
2)Then 1983(1st string is converted to integer using atoi) and last three strings are also converted to integers. Rest of the strings constitute the movie_name
Following code has been written under the assumption that input file has already been validated for the format.
// open the input file for reading
ifstream ifile(argv[1]);
string input_str;
//Read each line
while(getline(ifile,input_str)) {
stringstream sstr(input_str);
vector<string> strs;
string str;
while(sstr>>str)
strs.push_back(str);
//use the vector of strings to initialize the variables
// year, movie name and last three integers
unsigned int num_of_strs = strs.size();
//first string is year
int year = atoi(strs[0].c_str());
//last three strings are numbers
int one_num = atoi(strs[num_of_strs-3].c_str());
int two_num = atoi(strs[num_of_strs-2].c_str());
int three_num = atoi(strs[num_of_strs-1].c_str());
//rest correspond to movie name
string movie_name("");
//append the strings to form the movie_name
for(unsigned int i=1;i<num_of_strs-4;i++)
movie_name+=(strs[i]+string(" "));
movie_name+=strs[i];
IMHO Changing delimiters in the file from space to some other character like , or ; or : , will simplify the parsing significantly.
For example , if later on the data specifications change and instead of only last three , either last three or last four can be integers then the code above will need major refactoring.
Related
Hi here's my first questions here, I would write as clear as possible, if I am too newbie here, please bear it with me. Thanks
Backgroud: I was asked to solve longest common substring(lcs) problem with given input files in c++.
Its purpose is to optimize the algorithm, so it has limited run-time and RAM requirement.(case insensitive)
My Approach: I used to stringstream to parse the every input line and stored them into a vector. use something like suffix tree to chop the string, sort it and put into a vector array (vector that store vectors) and compare every 2 lines (v1,v2) to find common substirng.(I used nested foop loop to compare each word inside every vector), and then put common substrings back to array and remove v1 and v2.
suffix tree eg. banana -> anana -> nana -> ana -> na -> a..[I stored all 5 elements into the vector]
result: it works for most of the files (normal textfiles)
problem: I got 2 special test case that took me forever to find lcs.
1. has 10000 line input, and each line has ave 3000 chars (include space). It took me 50 mins to find lcs. the requirement is not exceed 5 mins.
2. has 100 line input, and each line has ave 60k chars. It never finish running
what I tried:build a common word dictionary for first 2 sentence
read first two lines and stored into vectors
used suffix tree again to find common elements(substring) and named as dictionary
for rest of input lines,
if (words read is within dictionary)
fine do what I did before, read next one
else if (word is not in dictionary)
ignore this word, read next one
help needed: I still cannot read the first two lines if each line contains 60k char, so building the dict itself would exceed the run time limitations. I am not sure if the hashed table would work way better than vectors. I knew a bit about HT but never write anything with it, so if you can explain HT with patience, I would appreciate that.
Update:
As suggested, I put some code here (first one for parse and store into vector, second involve how I compare 2 string and find common element)
vector< vector<string> > parsed_array;
vector<string> choped_element;
// Num1::read from file in a while loop
while (getline (myfile,line))
{
cout << "< InputlineLoopCounter: "<<InputlineLoopCounter<<endl;:q
choped_element.clear();
choped_element.push_back(line); //whole string as first element, eg'Hello World"
stringstream ss(line);
string copystr (line);
while (ss >> temp)
{
copystr.erase(0,copystr.find_first_of(" \t")+1); // here turns into "World"
choped_element.push_back(copystr);
}
choped_element.pop_back();//since I stored whole string as frist element, last one is not necessary
sort(choped_element.begin(),choped_element.end());
parsed_array.push_back(choped_element);//stored into vector array
InputlineLoopCounter ++;
}
//Num::2 compare part in 2 diff string and assembly into new string
//v1 and v2 and 2 vectors full of chopped strings and v3 should be common element
// eg. v1[0]="hello world"; v1[1]="world"
// eg. v2[0]="I dislike hello world"; v2[1]="dislike hello word"; v2[2]="hello word"; v2[4]="word"
// eg. v3 as result would be v3[0]="hello word";v3[1]="word"
for (size_t i = 0; i < v1len; i++)
{
for (size_t j = 0; j< v2len; j++)
{
stringstream ss1(v1[i]);
string fword1;
ss1>>fword1;
stringstream ss2(v2[j]);
string fword2;
ss2>>fword2;
if(fword1 == fword2) //v1[i] and v2[j] are space seperated words
{
string nword1;
string nword2;
string lcommon;
int comlen = 1;
string combine;
combine.append(fword1);
combine.append(space);
while (ss1>>nword1 && ss2>>nword2)
{
if (nword1 == nword2)
{
combine.append(nword1);
combine.append(space);
comlen ++;
}
else
break;
}
combine.erase(combine.find_last_of(" "));
cout<< "common word: "<<combine<<endl;
v3.push_back(combine);
}
}
}
I'm a C++ beginner. I'm trying to read a file that is formatted like so:
5 Christine Kim # 30.00 3 1
15 Ray Allrich # 10.25 0 0
...
number string # number number number
where each number has its own significance and the name does as well. I can get the file open and read the first two items but after the name I can't get the numbers after. This is my function right now:
void getItems( ifstream& dataFile, // in file
Employee item[], // class array so I can store the data later
int &transNum) // number of transactions
{
int id; // employee ID
char name[20]; // employee name
double hourlyPay; // pay per hour
int numDeps; // number of dependents
int type; // employee type
transNum = 0;
dataFile >> id;
dataFile.ignore(); // discard space before name
dataFile.getline( name, '#');
dataFile >> hourlyPay >> numDeps >> type;
}
I need it to read the first number, read the name, then read the last 3 numbers. After every name (maximum 20 characters) there is a # symbol where we should stop reading the name. I've tried adjusting the size of my char array for my string and other small fixes but nothing works. I realize that I will only get 1 line right now... I was just trying to get the first line to work before I made a loop to grab the other lines.
istream:getline needs a length parameter when used with a char array. Currently it is using '#' as the length value and using the default delimiter.
Change your code to
dataFile.getline( name, sizeof name, '#');
alternatively, use a std:string as the parameter to getline, then you don't need to specify a maximum size.
You're very close. The problem is likely to be the line
dataFile.ignore();
By default, ignore ignores everything up to EOF. What you want to do instead is ignore up to some number of characters until the next space. So that call would instead be:
datafile.ignore(100, ' ');
Where 100 is a purely arbitrary choice of number. Substitute your own rational value.
Next is your use of getline which must be told how many characters to read. Since your buffer is 20 characters, you need to inform getline like so:
getline(name, 20, '#');
I'm trying to create a function where it allows the user to type in multiple amounts of integers, so if the user wanted to have 3 different storages that hold different integers, the input would look something like this:
5
97 12 31 2 1 //let's say this is held in variable "a"
1 3 284 3 8 // "b"
2 3 482 3 4 // "c"
2 3 4 2 3 // "d"
99 0 2 3 42 // "e"
Since we don't know what number the user will input every time, I'm not sure how to create a dynamically allocated array that will create an x amount of arrays every time.. I want to be able to access each index of a, b, c, d, e or however many arrays there are.
So far, this is what I have, but I'm having trouble creating the arrays since it's unpredictable. I'm purposely not using vectors because I don't really get how pointers work so I'm trying to play around with it.
int* x;
int length, numbers;
cin >> length;
x = new int[length]
for (int i=0;i<length;i++)
{
std::getline(std::cin, numbers); //this didn't work for me
x[i] = numbers
}
If anything seems unclear, please let me know! Thank you!
It doesn't get the first line. It gets 1 integer at a time and since you have 5 integers per line and you entered 5 in the first line you end up getting only the numbers in the first line. x in your code is an array of integers and it needs to have enough place for all your integers which in this case is 25. If 5 integer per line is guaranteed then you can assume allocating 5 * length integer-long place will work. You will also need an inner for loop. 1 for to loop through lines and another one to loop through every integer on a line.
I would suggest using cin like so:
int d;
while(cin){
cin >> d;
// Do something with d
if(cin.peek() == '\n'){
// Create a new row in your dynamic array
}
}
This will grab each digit up to the space.
Another way to achieve this is by using strings with getline() in conjunction with string.empty() to get each line, then using strtok to split the line up into tokens. Although getline only works on strings, strtok will split the string up into tokens, which you can then cast to an int (or use atoi).
To store these tokens you will want to use a vector, since they are dynamic by nature and can easily be resized to fit any need. I would see this discussion on multi-dimensional vectors.
In a file which has an xml like format (but is not xml)
<mgrwt event="1">
<rscale> 1 1234</rscale>
<asrwt>0 4234</asrwt>
<pdfrwt beam="1"> 1 2 0.11790045E+00 0.22210436E+03</pdfrwt>
<pdfrwt beam="2"> 1 -2 0.92962177E-02 0.22210436E+03</pdfrwt>
<totfact> 0.34727485E+01</totfact>
<matchscale> 0.10000000E+11 0.41999999E+02 0.61496031E+02</matchscale>
</mgrwt>
I would like with C++ to read in the file (this I know ;-) block-by block , and then assign variables to sub-components of each block, - I know for example that all numbers in
<pdfrwt beam="1"> 1 2 0.11790045E+00 0.22210436E+03</pdfrwt>
ie 1 2 0.11790045E+00 0.22210436E+03 are always numbers, and never string,so question is , how can I strip/read from each line in each block the separated-by-space numbers ?
I also tried this Read input numbers separated by spaces but could not help me..
thanks
I suggest reading line by line and using std::istringstream to parse the string.
For example, you could find out where the start of the number is by searching for the first >.
Either advance the stringstream by this amount or ...
Create a substring contain all the text before the '<'.
Create an istringstream using this substring.
Then try something like:
unsigned int first, second;
double third, fourth;
string_stream >> first >> second >> third >> fourth;
I have possible inputs 1M 2M .. 11M and 1Y (M and Y stand for months ) and I want to output "somestring1 somestring2.... and somestring12" note M and Y are removed and the last string is changed to 12
Example: input "11M" "hello" output: hello11
input "1Y" "hello" output: hello1
char * (const char * date, const char * somestr)
{
// just need to output final string no need to change the original string
cout<< finalStr<<endl;
}
The second string is getting output as a whole itself. So no change in its output.
The second string would be output as long as M or Y are encountered. As Stack Overflow discourages providing exact source codes, so I can give you some portion of it. There is a condition to be placed which is up to you to figure out.(The second answer gives that as well)
Code would be somewhat like this.
//Code for first string. Just for output.
for (auto i = 0 ; date[i] != '\0' ; ++i)
{
// A condition comes here.
cout << date[i] ;
}
And note that this is considering you just output the string. Otherwise you can create another string and add up the two or concatenate the existing ones.
is this homework? If not, here's what i'd suggest. (i ask about homework because you may have restrictions, not because we're not here to help)
1) do a find on 'M' in your string (using find), insert a '\0' at that position if one is found (btw i'm assuming you have well formatted input)
2) do a find on 'Y'. if one is found, insert a '\0' at that position. then do an atoi() or stringstream conversion on your string to convert to number. multiply by 12.
3) concatenate your string representation of part 1 or part 2 to your somestr
4) output.
This can probably be done in < 10 lines if i could be bothered.
the a.find('M') part and its checks can be conditional operator, then the conversion/concatenation in two or three lines at most.