I'm trying to create a spam filter. I need to train the model first. I read the words from a text file which has the word "spam" or "ham" as the first word of a paragraph, and then the words in the mail and number of its occurrences just after the word. There are paragraphs in the file. My program is able to read the first paragraph that is the words and their number of occurrences.
The problem is that, the file stops reading after encountering the newline that and doesn't read the next paragraph. Although I have a feeling that the way I am checking for a newline character that is the end of a paragraph is not entirely correct.
I have given two paragraphs so you just get the idea of the train text.
Train text file.
/000/003 ham need 1 fw 1 35 2 39 1 thanks 1 thread 2 40 1 copy 1 else 1 correlator 1 under 1 companies 1 25 1 he 2 26 2 168 1 29 2 content 4 1 1 6 1 5 1 4 1 review 2 we 1 john 3 17 1 use 1 15 1 20 1 classes 1 may 1 a 1 back 1 l 1 01 1 produced 1 i 1 yes 1 10 2 713 2 v6 1 p 1 original 2
/000/031 ham don 1 kim 5 dave 1 39 1 customer 1 38 2 thanks 1 over 1 thread 2 year 1 correlator 1 under 1 williams 1 mon 2 number 2 kitchen 1 168 1 29 1 content 4 3 2 2 6 system 2 1 2 7 1 6 1 5 2 4 1 9 1 each 1 8 1 view 2
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main()
{
int V = 0; // Total number of words
ifstream fin;
fin.open("train", ios::in);
string word;
int wordnum;
int N[2] = {0};
char c, skip;
for (int i = 0; i < 8; i++) fin >> skip; // There are 8 characters before the first word of the paragraph
while (!fin.fail())
{
fin >> word;
if (word == "spam") N[0]++;
else if (word == "ham") N[1]++;
else
{
V++;
fin >> wordnum;
}
int p = fin.tellg();
fin >> c; //To check for newline. If its there, we skip the first eight characters of the new paragraph because those characters aren't supposed to be read
if (c == '\n')
{
for (int i = 0; i < 8; i++) fin >> skip;
}
else fin.seekg(p);
}
cout << "\nSpam: " << N[0];
cout << "\nHam :" << N[1];
cout << "\nVocab: " << V;
fin.close();
return 0;
}
std::ifstream::operator>>() doesn't read \n in the variable; it drops it. If you need to manipulate with whitespaces and \n symbols, you can use std::ifstream::get()
Related
I have made the following script, that is supposed to read from a file:
char match[] = "match";
int a;
int b;
inp >> lin;
while(!inp.eof()) {
if(!strcmp(lin, match)) {
inp >> a >> b;
cout << a << " " << b <<endl;
}
inp >> lin;
}
inp.close();
return num_atm;
}
It is supposed to read all words, and if a line starts with match, it should then also print the rest of the line.
My input file is this:
match 1 2 //1
match 5 2 //2
nope 3 6 //3
match 5 //4
match 1 4 //5
match 5 9 //6
It will correctly print 1 2, 5 2, and skip 3 6. But then, it will get stuck and keep printing 5 0 and continue printing 5 0 for ever. I get that match is put into b, which is an integer, but I don't get why this is looped. Shouldn't the input read match 4 once, try to read/write 5 and match, and then be done with line 4 and the match from line 5? Then it should next read the number 1 and 4 and then match from number 6.
I would also understand that due to the word not fitting into the integer, it would read match in the fifth line again, but that's not what it does.
It goes back to the match in the fourth line which it already read, and reads it again. Why is this?
When you are reading with >> line enndings are handled the same as spaces: They are just more whitespace that is skipped. That means you see
match 1 2
match 5 2
nope 3 6
match 5
match 1 4
match 5 9
But the program sees
match 1 2 match 5 2 nope 3 6 match 5 match 1 4 match 5 9
Let's fast forward to where things go south
Contents of stream:
nope 3 6 match 5 match 1 4 match 5 9
Processing
inp >> lin; // reads nope stream: 3 6 match 5 match 1 4 match 5 9
if(!strcmp(lin, match)) { // nope != match skip body
}
inp >> lin; // reads 3 stream: 6 match 5 match 1 4 match 5 9
if(!strcmp(lin, match)) { // 3 != match skip body
}
inp >> lin; // reads 6 stream: match 5 match 1 4 match 5 9
if(!strcmp(lin, match)) { // 6 != match skip body
}
inp >> lin; // reads match stream: 5 match 1 4 match 5 9
if(!strcmp(lin, match)) { // match != match Enter body
inp >> a >> b; // reads 5 and fails to parse match into an integer.
// stream: match 1 4 match 5 9
// stream now in failure state
cout << a << " " << b <<endl; // prints 5 and garbage because b was not read
}
inp >> lin; // reads nothing. Stream failed
if(!strcmp(lin, match)) { // match != match Enter body
inp >> a >> b; // reads nothing. Stream failed
// stream: match 1 4 match 5 9
// stream now in failure state
cout << a << " " << b <<endl; // prints g and garbage because b was not read
}
Because nothing is ever read, while(!inp.eof()) is utterly worthless. The end of the file can never be reached. The program will loop forever, probably printing whatever it last read. Successfully read.
Fixing this depends entirely on what you want to do if you have a match line without 2 numbers on it, but a typical framework looks something like
std::string line;
while(std::getline(inp, line) // get a whole line. Exit if line can't be read for any reason.
{
std::istringstream strm(line);
std::string lin;
if(strm >> lin && lin == match) // enters if lin was read and lin == match
// if lin can't be read, it doesn't matter.
// strm is disposable
{
int a;
int b;
if (strm >> a >> b) // enters if both a and b were read
{
cout << a << " " << b <<"\n"; // endl flushes. Very expensive. just use a newline.
}
}
}
Output from this should be something like
1 2
5 2
1 4
5 9
If you want to make some use of match 5... Well it's up to you what you want to put in b if there is no b in the file.
The starting row number and the length of the hockey stick will be taken as input. We need to print the elements of the hockey stick excluding the sum.
The following code prints the pascal triangle with 10 rows(row:0 to row:9). How to add code to get the elements of the hockey stick?
#include<iostream>
using namespace std;
int main()
{
int l, r, arr[10][10];
for (int i=0; i<=9; i++)
{
for(int j=0; j<=i; j++)
{
if((i==j)||(j==0))
{
arr[i][j] = 1;
cout << arr[i][j] << " ";
}
else
{
arr[i][j] = arr[i-1][j-1]+arr[i-1][j];
cout << arr[i][j] << " ";
}
}
cout << endl;
}
return 0;
}
It gives the output as below,
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
Now we need to take starting row and length of hockey stick,
let's take
starting row-3
length-4
1
1 1
1 2 1
**1** 3 3 1
1 **4** 6 4 1
1 5 **10** 10 5 1
1 6 15 **20** 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
sow hockey stick formation will be like,
1+4+10+20 = 35
We need to print the final output as below,
1+4+10+20
Note: No need to print the sum element-35
=================================
I have added the code as below,
cout <<"enter starting row-\n";
cin >> r;
cout << "enter length of hockey stick-\n";
cin >> l;
cout << "\nelements of hockey stick-\n";
int j=0;
for (int i=r; i<=(r+l-1); i++)
{
int j = i-r;
cout << arr[i][j] << " ";
}
cout << endl;
Got output as -
enter starting row-
3
enter length of hockey stick
4
elements of hockey stick-
1 4 10 20
But I need it to be as below.
1+4+10+20
The hint of HolyBlackCat is in general right...
...except that the last element would be suffixed by + as well.
That's why I would recommend to turn it around: prefix every element except the first with +. This is achieved with an initial separator string which is empty. It is overridden at end of loop:
const char *sep = "";
//int j=0; // unused
for (int i=r; i<=(r+l-1); i++)
{
int j = i-r;
cout << sep << arr[i][j];
sep = " + ";
}
cout << endl;
Note:
The assignment of sep in loop might be unnecessary for every than the first iteration. AFAIK, this is usually cheaper than an extra if test.
I have a file with this format:
11
1 0
2 8 0
3 8 0
4 5 10 0
5 8 0
6 1 3 0
7 5 0
8 11 0
9 6 0
10 5 7 0
11 0
The first line is the number of lines, so I can make a loop to read the file with the number of lines.
For the other lines, I would like to read the file line by line and store the data until I get a "0" on the line that's why there is a 0 at the end of each line.
The first column is the task name.
The others columns are the constraints name.
I tried to code something but It doesn't seem to work
printf("Constraints :\n");
for (int t = 1; t <= numberofTasks; t++)
{
F >> currentTask;
printf("%c\t", currentTask);
F >> currentConstraint;
while (currentConstraint != '0')
{
printf("%c", currentConstraint);
F >> currentConstraint;
};
printf("\n");
};
The "0" represents the end of the constraints for a task.
I think my code doesn't work properly because the constraint 10 for the task 4 contains a "0" too.
Thanks in advance for your help
Regards
The problem is that you are reading individual characters from the file, not reading whole integers, or even line-by-line. Change your currentTask and currentConstraint variables to int instead of char, and use std::getline() to read lines that you then read integers from.
Try this:
F >> numberofTasks;
F.ignore();
std::cout << "Constraints :" << std::endl;
for (int t = 1; t <= numberofTasks; ++t)
{
std::string line;
if (!std::getline(F, line)) break;
std::istringstream iss(line);
iss >> currentTask;
std::cout << currentTask << "\t";
while ((iss >> currentConstraint) && (currentConstraint != 0))
{
std::cout << currentConstraint << " ";
}
std::cout << std::endl;
}
Live Demo
That being said, the terminating 0 on each line is unnecessary. std::getline() will stop reading when it reaches the end of a line, and operator>> will stop reading when it reaches the end of the stream.
Live Demo
I'm editing my post with the progress I made so far. Well, what I want to do for now is:
Read the text file from the first line without asterics (*), aka the line beginning with number 1, to the end of the file
When there is a "blank space" instead of ">sa0" (6th column) put a # on the variable. And put the next string on the next variable (aka fsa1)
Print this to the user line by line.
The code I have so far:
#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>
using namespace std;
int main()
{
string line, netlist;
int address, fout, fin;
string name, type, fsa0, fsa1;
cout << "Wich Netlist you want to use?" << endl;
cin >> netlist;
ifstream file(netlist.c_str());
if (file.is_open())
{
do{
getline(file, line);
} while ( line[0] == '*' );
file >> address >> name >> type >> fout >> fin >> fsa0;
if (fsa0 != ">sa0") { fsa1 = fsa0; fsa0 = "#"; } else { file >> fsa1; }
cout << address << " " << name << " " << type << " " << fout << " " << fin << " " << fsa0 << " " << fsa1 << endl;
} else { cout << "File not found" << endl; }
file.close();
return 0;
}
Problems Found:
Not showing the first line after the last line with astherisc.
Not showing all the lines just the second one.
Text File im trying to read:
*c17 iscas example (to test conversion program only)
*---------------------------------------------------
*
*
* total number of lines in the netlist .............. 17
* simplistically reduced equivalent fault set size = 22
* lines from primary input gates ....... 5
* lines from primary output gates ....... 2
* lines from interior gate outputs ...... 4
* lines from ** 3 ** fanout stems ... 6
*
* avg_fanin = 2.00, max_fanin = 2
* avg_fanout = 2.00, max_fanout = 2
*
*
*
*
*
1 1gat inpt 1 0 >sa1
2 2gat inpt 1 0 >sa1
3 3gat inpt 2 0 >sa0 >sa1
8 8fan from 3gat >sa1
9 9fan from 3gat >sa1
6 6gat inpt 1 0 >sa1
7 7gat inpt 1 0 >sa1
10 10gat nand 1 2 >sa1
1 8
11 11gat nand 2 2 >sa0 >sa1
9 6
14 14fan from 11gat >sa1
15 15fan from 11gat >sa1
16 16gat nand 2 2 >sa0 >sa1
2 14
20 20fan from 16gat >sa1
21 21fan from 16gat >sa1
19 19gat nand 1 2 >sa1
15 7
22 22gat nand 0 2 >sa0 >sa1
10 20
23 23gat nand 0 2 >sa0 >sa1
21 19
And, another thing, can you guys give me some tips on what to do with these lines only with two integers, like the last one?
Thank you all. I appreciate all the help.
At least IMO, you're approaching this the wrong way. I'd start by reading a line of input. Then check how many items there are on that line. Then parse the items in the line appropriately.
Unless you're absolutely set on doing this in pure C++ on your own, something like AWK or yacc will make the job tremendously easier.
If you do insist on doing it without a parser generator or similar, you could at least use regular expressions to help out quite a bit.
I have a 4*4 matrix in 4 files. I need to read the first two elements from each files and display it in a column. Here is an example:
File 1 File 2 File 3 File 4
1 2 3 4 2 3 4 5 3 5 8 9 1 4 6 9
3 4 4 5 3 4 5 6 6 7 9 2 6 0 8 6
1 2 4 5 4 5 6 6 8 7 6 5 4 5 6 7
1 2 3 4 4 4 7 9 3 4 5 6 5 6 7 9
I need to display first row 2 column elements from File 1, first row 2 column elements from file 2 and so on:
1 + 2 (File 1, 1st row 2 elements)
2 + 3 (File 2, 1st row 2 elements)
3 + 5 (File 3, 1st row 2 elements)
1 + 4 (File 4, 1st row 2 elements)
3 + 4 (file 1, 2nd row 2 elements)
3 + 4 (file 2, 2nd row 2 elements)
6 + 7 (File 3, 2nd row 2 elements)
and so on..
//std::fstream infile;
string st1 = "file_";
string st2 = ".txt";
string st3 = "_";
string filename;
string mystring;
float fading[16][16];
for( int row = 0 ; row < 5 ; ++row)
{
for( int column = 0 ; column < 5 ; ++column)
{
for ( int i = 1; i < 3; i++)
{
for(int j = 1; j < 3 ;j++)
{
stringstream ss, ss1;
ss << i;
ss1 << j;
filename = st1 + ss.str() + st3 + ss1.str() + st2;
std::fstream infile;
infile.open(filename.c_str());
if(infile.is_open())
{
infile >> fading[row][column];
cout << "fading[" << row << "][" << column << "] " << fading[row][column] << std::endl;
}
else
std::cout << " file " << filename << " not open" << std::endl;
infile.close();
}
}
}
}
}
I am not able to get the first row two elements from each file into a loop. Each time the file closes, the program starts from the 1 row of the first file again.
Why not read the entire matrices into memory buffers and fetch the fields you need from there? If it's really only four files with 16 entries each, that's not too expensive and you don't have the hassle of reading around in your file.
I guess You want to open the file before the loop , That will keep data pointers inside the file to the point where it last read .