I'm working on an automatic summarization system in my C++ class and have a question regarding one of the ASCII comparisons I'm doing. Here's the code:
char ch;
string sentence;
pair<char, char> sentenceCheck;
int counter = 0;
while (!ifs2.eof())
{
ch = ifs2.get();
ch = tolower(ch);
if (ch == 13)
ch = ifs2.get();
if (ch != 10 && ch != '?' && ch != '!' && ch != '.')
sentence += ch;
sentenceCheck.first = sentenceCheck.second;
sentenceCheck.second = ch;
cout << sentenceCheck.first << "-" << (int)sentenceCheck.first << " ---- " << sentenceCheck.second << "-" << (int)sentenceCheck.second << endl;
if(sentenceCheck.second == ' ' || sentenceCheck.second == 10 || sentenceCheck.second == -1)
{
if(sentenceCheck.first == '?' || sentenceCheck.first == '!' || sentenceCheck.first == '.')
{
istringstream s(sentence);
while(s >> wordInSentence)
{
sentenceWordMap.insert(pair<string, int>(wordInSentence, 0));
}
//sentenceList.push_back(pair<string, int>(sentence, 0));
sentence.clear();
}
}
}
What is being done here (with the two if statements) is checking whether a new sentence has begun in the text that is to be analyzed and dealt with later. The conditionals work but only because we discovered that we have to check for that -1 as well. Any ideas what that represents?
-1 doesn't represent anything in ASCII. All ASCII codes are in the range [0, 127]. It's not even guaranteed by C++ that -1 is a valid value for a char.
The problem is that you're not checking the return value from ifs2.get(), which returns an int (not a char!) that may be -1 on end of file. The proper way to check for this is
int ch = ifs2.get();
if (!ifs2)
// break the loop
because the EOF value is not guaranteed to be -1 (it's actually std::char_traits<char>::eof()).
(Btw., you shouldn't write ASCII codes as magic numbers; use \n for linefeed, \r for carriage return.)
The while is incorrectly structured: you need to check eof() immediately after get():
for (;;)
{
ch = ifs2.get();
if (ifs2.eof()) break;
ch = tolower(ch);
if (ch == 13)
{
ch = ifs2.get();
if (ifs2.eof()) break;
}
...
}
The -1 is probably the EOF indicator.
Note (as has already been stated) get() returns an int, not a char.
As an ASCII character -1 doesn't represent anything (which is to say -1 is not a valid ASCII value). As the return value from get() it means that the read operation failed - most likely due to the end of file being reached.
Note that the eof() function doesn't return true if the next call to get will fail because of the end of file being reached - it returns true if the previous call to get failed because of the end of file being reached.
It's not ASCII, it's an error returned by istream::get
ch = ifs2.get();
It's probably EOF, i.e. you've run out of input.
The fact that checking for -1 works is an accident, and has nothing to
do with ASCII values (which only use 0 to 127). Your code will fail
if either plain char is unsigned (compile with /J with VC++, I think),
or EOF isn't -1 (rare, but all that's guaranteed is that it is
negative). You're code will also fail if the input happens to be
Latin-1, and it contains a 'ÿ'.
The basic problem in your code is that you're not checking for end of
file correctly. Putting the test at the top of the loop doesn't work;
you need to test for failure (not eof()) after input, before using
the value read. There are several ways of doing this; in your case, the
simplest is probably to use:
if ( !ifs2.get(ch) ) {
// Input failed...
}
Alternatively, you can make ch an int, and do:
ch = ifs2.get();
if ( ch == EOF ) {
// Input failed...
}
This has the advantage that the following call to tolower is no longer
undefined behavior (tolower takes an int, which must be in the range
[0...UCHAR_MAX] or EOF—if plain char is signed, you aren't
guaranteeing this). On the other hand, it doesn't allow chaining, i.e.
you can't write the equivalent of:
while ( ifs2.get( sentenceCheck.first )
&& ifs2.get( sentenceCheck.second ) ) {
// ...
}
(which could be useful in some cases).
FWIW: the technique you're using is something called a sliding window
into a stream, and it's worth pushing it off into a separate class to
handle the logic of keeping the window filled and up to date.
Alternatively, a simple state machine could be used for your problem.
And I'd definitely avoid using magic constants: if you want to check for
a carriage return, compare with '\r'. Similarly, newline is '\n',
and in the outer if, it looks like you want to check for whitespace
(isspace( static_cast<unsigned char>( sentenceCheck.second ) )),
rather than comparing for the values.
I might also add that your code fails to correctly handle sentences that
end with a quote, like This is the "text in your input."; it also
fails for abbreviations like Mr. Jones is here.. But those problems
may be beyond the scope of your assignment. (The abbreviations one is
probably not fully solvable: sometimes "etc." is the end of a
sentence, and sometimes it's not.)
Related
For my homework assignment, I need to implement Horners Algorithm for converting between bases.
I have been told to use getchar() for this assignment. But I am having a problem where when I hit enter, the program doesn't terminate and just takes in more chars.
Example:
bryce> ./pa1
Enter the fromRadix:16
Enter the toRadix:2
abc
abc
^C
bryce>
Code:
int readRadixA(int radixA)
{
char myChar = getchar();
int result = 0;
int run = 0;
while(myChar != EOF)
{
if(myChar == "\n")
break;
Horners();
myChar = getchar();
}
return result;
}
I am not asking for help implementing Horners; I am asking for help to terminate the getchar() correctly.
if(myChar=="\n")
^ ^
You're comparing myChar wrong. Try this instead:
if(myChar == '\n')
^ ^
A second problem is that getchar returns int, not char. Maybe you can rewrite it like this:
int myChar;
while((myChar = getchar()) != EOF && myChar != '\n')
{
/* Your stuff. */
}
EDIT
In light of comments, I think some stdio operation before that while is leaving a \n in the buffer.
Instead of scanf("%d", &radix) try:
scanf("%d ", &radix);
^
That space will make scanf eat the remaining blanks (including the newline).
Check the return type of getchar(). Yes, it's an int. That's because EOF must have a value that can be distinguished from a valid character. myChar must actually be made to be int.
Try this code
int readRadixA(int radixA)
{
char myChar;
int result = 0;
int run = 0;
do
{
myChar = getchar();
// implement horners here
}while(myChar != 13);
return result;
}
I checked your code I think you are leaving a '\n' in the input keyboard buffer after the toRadix.
And their is one more thing that
getchar()
reads all the characters in one go till a '\n' is received.
And there is one more mistake you have committed by comparing a
char to a pointer e.g mychar=="\n"
further information about your implementation of toRadix can be really helpful to answer your question
On linux, to end the standard input, you have to type Ctrl-D. The kernel and tty layers makes that an end-of-file mark or condition. Then getchar gives EOF (which is not a valid char, for example on systems where char are unsigned bytes between 0 and 255, EOF could be -1).
Notice that feof(3) is valid only after a read operation (e.g. getchar, fgets, etc...) so coding while(feof(stdin)) is generally wrong (contrarily to what I wrote in the previous version of this answer). You'll better test that getchar is returning EOF so your myChar should be an int (not a char).
I was seeking some help on an issue. I have to read certain "passwords" from a .txt file, like "abE13#" and do some simple error checking to make sure they fit certain requirements. But at the current moment, it's printing out the passwords (which is meant to be done), but it's ignoring the checking and gets stuck in a loop where new lines are being printed out. I'm sure it has to do something with while(ch!='\n') but I'm not all that sure what is needed there in place of that to check.
ch = inFile.get();
while(!inFile.eof())
{
while(ch != '\n')
{
cout << ch;
if(isalpha(ch))
{
charReq++;
if(isupper(ch))
uppercaseReq++;
else
lowercaseReq++;
}
else if(isdigit(ch))
{
charReq++;
digitReq++;
}
else if(isSpecial(ch))
{
charReq++;
specialCharReq++;
}
if(uppercaseReq < 1)
cout << "\n missing uppercase" << endl;
ch = inFile.get();
}
}
It's supposed to kind of follow this format,
Read a character from the password.txt file
while( there are characters in the file )
{
while( the character from the file is not a newline character )
{
Display the character from the file
Code a cascading decision statement to test for the various required characters
Increment a count of the number of characters in the password
Read another character from the password.txt file
}
Determine if the password was valid or not. If the password was invalid,
display the things that were wrong. If the password was valid, display that it
was valid.
Read another character from the file (this will get the character after the
newline character -- ie. the start of a new password or the end of file)
}
Display the total number of passwords, the number of valid passwords, and the
number of invalid passwords
It keeps prints because of this while(inFile). This is always true. Change it to an if statement just to check if file is open:
if ( inFile )
EDIT: It will stop at the first password because of this while(ch != '\n'). When he gets to the end of the first password ch will be '\n', while fails and stop reading. Change it to:
while( !inFile.eof() )
while( the character from the file is not a newline character )
You have converted this line of pseudocode into this line of c++ code:
while (ch != '\t')
'\t' is the tab character, not the newline character. This could definitely cause problems as to why you are never ending and instead just printing out new lines (Really EOF, but you don't see that).
'\n' is the newline character.
Give that a try.
EDIT:
Also, your only checking for an entire ifstream to be false. I don't quite know when that would happen, but I would recommend checking for the EOF flag. Your code should turn into something along the lines of this:
while( !inFile.eof() )
{
while(ch != '\n' && !inFile.eof() )
{
// ...
}
}
If you don't check for infile twice, you may end up in an infinite loop.
while(infile.good())
{
while (inFile.good() && ch != '\n')
{
...
}
if (ch == '\n')
{...}
else
{...}
}
I have more than one input files like this:
>1aab_
GKGDPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKT
MSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
>1j46_A
MQDRVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAE
KWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK
>1k99_A
MKKLKKHPDFPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYK
ELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK
>2lef_A
MHIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQA
KYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK
Here, what I have to do:
vector <string> names;
vector <string> seqs;
names.resize(total); //"total" is already known.
seqs.resize(total);
counter=0;char input;
while ((input = myInput.get()) != EOF)
{
if(input=='>')
names[counter]= take all line (>1aab_, >1j46_A, so...)
else
untill the see next '>' append the character into sequence[counter]
counter++;
}
Finally it will be like this:
names[0]=">1aab_"
sequence[0]="GKGDPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE"
and so on..
I am thinking about for 2 hours and I couldn't figure out it. Can anyone help about that? Thanks in advance.
There's a few ways to solve it; I'll present some examples but I'm not testing/compiling this code, so there may be minor bugs - the logic is the important bit.
Since your pseudocode appears to be processing the input character by character, I've taken that as a requirement.
The way you seem to be thinking about it would be implemented with essentially a pair of loops - one for reading the name, the other for reading the sequence - which are enclosed in an outer loop, in order to process all records.
This would look something like the following:
// first character in file should be a '>', indicating the start
// of a record.
input = myInput.get();
if (input != '>')
{
std::cerr << "Malformed input file!" << std::endl;
return /*...*/;
}
do
{
// record name continues up until the newline
while ((input = myInput.get()) != EOF)
{
if (input == '\n' || input == '\r')
break;
names[counter].push_back(input);
}
// read sequence until we hit a '>' or EOF
while ((input = myInput.get()) != EOF)
{
if (input == '>')
{
// advance to next record number
counter++;
break;
}
sequence[counter].push_back(input);
}
} while (input != EOF && counter < total);
You'll also notice I moved the check for the initial '>' to before the loop, just as a way of ingesting (and discarding) the character, as well as a basic sanity check of the input. This is because we really use this character to mark the end of the sequence (rather than the "start of a record") - when we enter the loop, we assume we're already reading the record name.
Another way to approach it is to use a state machine. Essentially, this utilises additional variables to track the state the parser is in.
For this particular case, you only have two states: either you're reading a record name, or the sequence. So, we can just use a single boolean to track which state we're in.
Armed with the state variable, we can then make decisions about what to do with the character we just read based upon the state we're in. At the simplest level here, if we're in "read the record name" state, we add the character to the names variable, otherwise we add it to the sequence variable.
// state flag to indicate if we're currently reading a name line,
// i.e. a line starting with ">"
// This should be set true by the first record we encounter, so
// we'll set it false (to indicate we're reading a sequence) in
// order to allow us to detect bad input files.
bool reading_name = false;
// indicate we're on the first record, so we can avoid incrementing
// the record counter
bool first_record = true;
// process input character-by-character until end of file
while ((input = myInput.get()) != EOF)
{
// check for start of new record
if (input == '>')
{
// for robustness, verify we're not already reading a name,
// as this probably indicates invalid input
if (reading_name)
{
std::cerr << "Input is malformed?!" << endl;
break;
}
// switch to reading name state
reading_name = true;
// advance to next record, but only if it isn't the first record
if (first_record)
{
// disable the first_record flag, and explicitly set the
// record counter to 0.
first_record = false;
counter = 0;
}
else if (++counter >= total)
{
std::cerr << "Error: too many records!" << std::endl;
break;
}
}
// first character in file should start a new record
else if (first_record)
{
std::cerr << "Missing record start character at beginning of input!" << std::endl;
break;
}
// make sure we are processing a valid record number
else if (counter >= total)
{
std::cerr << "Invalid record number!" << std::endl;
break;
}
// continue reading the name
else if (reading_name)
{
// check if we've reached the end of the line; you
// may also want/need to check for \r if your input
// files may have Windows-style line endings
if (input == '\n')
{
// switch to reading sequence state
reading_name = false;
}
else
{
// add character to current name
names[counter].push_back(input);
}
}
// continue reading the sequence
else
{
// you might need to handle line ending characters here,
// maybe just skipping them?
// add character to current sequence
sequence[counter].push_back(input);
}
}
This adds a fair amount of complexity, which is of questionable value for this particular exercise, but does make adding additional states easier in future. It also has the benefit of only a single place in the code where I/O is done, which reduces the chances of errors (not checking for EOF, overflow array bounds, etc.).
In this case, we're actually using the '>' character as an indicator that a new record is starting, so we add a bit of extra logic to make sure that all works properly with the record counter. You could also just use a signed integer for your counter variable and start it at -1, so it will increment to 0 at the start of the first record, but using signed variables to index into arrays isn't a good idea.
There are more ways to approach this problem, but hopefully this gives you somewhere to start on your own solution.
I'm having a hard time understanding why while (cin.get(Ch)) doesn't see the EOF. I read in a text file with 3 words, and when I debug my WordCount is at 3 (just what I hoped for). Then it goes back to the while loop and gets stuck. Ch then has no value. I thought that after the newline it would read the EOF and break out. I am not allowed to use <fstream>, I have to use redirection in DOS. Thank you so much.
#include <iostream>
using namespace std;
int main()
{
char Ch = ' ';
int WordCount = 0;
int LetterCount = 0;
cout << "(Reading file...)" << endl;
while (cin.get(Ch))
{
if ((Ch == '\n') || (Ch == ' '))
{
++WordCount;
LetterCount = 0;
}
else
++LetterCount;
}
cout << "Number of words => " << WordCount << endl;
return 0;
}
while (cin >> Ch)
{ // we get in here if, and only if, the >> was successful
if ((Ch == '\n') || (Ch == ' '))
{
++WordCount;
LetterCount = 0;
}
else
++LetterCount;
}
That's the safe, and common, way to rewrite your code safely and with minimal changes.
(Your code is unusual, trying to scan all characters and count whitespace and newlines. I'll give a more general answer to a slightly different question - how to read in all the words.)
The safest way to check if a stream is finished if if(stream). Beware of if(stream.good()) - it doesn't always work as expected and will sometimes quit too early. The last >> into a char will not take us to EOF, but the last >> into an int or string will take us to EOF. This inconsistency can be confusing. Therefore, it is not correct to use good(), or any other test that tests EOF.
string word;
while(cin >> word) {
++word_count;
}
There is an important difference between if(cin) and if(cin.good()). The former is the operator bool conversion. Usually, in this context, you want to test:
"did the last extraction operation succeed or fail?"
This is not the same as:
"are we now at EOF?"
After the last word has been read by cin >> word, the string is at EOF. But the word is still valid and contains the last word.
TLDR: The eof bit is not important. The bad bit is. This tells us that the last extraction was a failure.
The Counting
The program counts newline and space characters as words. In your file contents "this if fun!" I see two spaces and no newline. This is consistent with the observed output indicating two words.
Have you tried looking at your file with a hex editor or something similar to be sure of the exact contents?
You could also change your program to count one more word if the last character read in the loop was a letter. This way you don't have to have newline terminated input files.
Loop Termination
I have no explanation for your loop termination issues. The while-condition looks fine to me. istream::get(char&) returns a stream reference. In a while-condition, depending on the C++ level your compiler implements, operator bool or operator void* will be applied to the reference to indicate if further reading is possible.
Idiom
The standard idiom for reading from a stream is
char c = 0;
while( cin >> c )
process(c);
I do not deviate from it without serious reason.
you input file is
this is fun!{EOF}
two spaces make WordCount increase to 2
and then EOF, exit loop! if you add a new line, you input file is
this is fun!\n{EOF}
I took your program loaded it in to visual studio 2013, changed cin to an fstream object that opened a file called stuff.txt which contains the exact characters "This is fun!/n/r" and the program worked. As previous answers have indicated, be careful because if there's not a /n at the end of the text the program will miss the last word. However, I wasn't able to replicate the application hanging in an infinite loop. The code as written looks correct to me.
cin.get(char) returns a reference to an istream object which then has it's operator bool() called which returns false when any of the error bits are set. There are some better ways to write this code to deal with other error conditions... but this code works for me.
In your case, the correct way to bail out of the loop is:
while (cin.good()) {
char Ch = cin.get();
if (cin.good()) {
// do something with Ch
}
}
That said, there are probably better ways to do what you're trying to do.
So basically, I might have some string that looks like: "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool".
However, this string might be huge. I'm trying to remove all the asterisks from the string, unless that asterisk appears to represent multiplication. Efficiency is somewhat important here, and I'm having trouble coming up with a good algorithm to remove all the non-multiplication asterisks from this.
In order to determine whether an asterisk is for multiplication, I can obviously just check whether it's sandwiched in between two numbers.
Thus, I was thinking I could do something like (pseudocode):
wasNumber = false
Loop through string
if number
set wasNumber = true
else
set wasNumber = false
if asterisk
if wasNumber
if the next word is a number
do nothing
else
remove asterisk
else
remove asterisk
However, that^ is ugly and inefficient on a huge string. Can you think of a better way to accomplish this in C++?
Also, how could I actually check whether a word is a number? It's allowed to be a decimal. I know there's a function to check if a character is a number...
Fully functioning code:
#include <iostream>
#include <string>
using namespace std;
string RemoveAllAstericks(string);
void RemoveSingleAsterick(string&, int);
bool IsDigit(char);
int main()
{
string myString = "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool";
string newString = RemoveAllAstericks(myString);
cout << "Original: " << myString << "\n";
cout << "Modified: " << newString << endl;
system("pause");
return 0;
}
string RemoveAllAstericks(string s)
{
int len = s.size();
int pos;
for(int i = 0; i < len; i++)
{
if(s[i] != '*')
continue;
pos = i - 1;
char cBefore = s[pos];
while(cBefore == ' ')
{
pos--;
cBefore = s[pos];
}
pos = i + 1;
char cAfter = s[pos];
while(cAfter == ' ')
{
pos++;
cAfter = s[pos];
}
if( IsDigit(cBefore) && IsDigit(cAfter) )
RemoveSingleAsterick(s, i);
}
return s;
}
void RemoveSingleAsterick(string& s, int i)
{
s[i] = ' '; // Replaces * with a space, but you can do whatever you want
}
bool IsDigit(char c)
{
return (c <= 57 && c >= 48);
}
Top level overview:
Code searches the string until it encounters an *. Then, it looks at the first non-whitespace character before AND after the *. If both characters are numeric, the code decides that this is a multiplication operation, and removes the asterick. Otherwise, it is ignored.
See the revision history of this post if you'd like other details.
Important Notes:
You should seriously consider adding boundary checks on the string (i.e. don't try to access an index that is less than 0 or greater than len
If you are worried about parentheses, then change the condition that checks for whitespaces to also check for parentheses.
Checking whether every single character is a number is a bad idea. At the very least, it will require two logical checks (see my IsDigit() function). (My code checks for '*', which is one logical operation.) However, some of the suggestions posted were very poorly thought out. Do not use regular expressions to check if a character is numeric.
Since you mentioned efficiency in your question, and I don't have sufficient rep points to comment on other answers:
A switch statement that checks for '0' '1' '2' ..., means that every character that is NOT a digit, must go through 10 logical operations. With all due respect, please, since chars map to ints, just check the boundaries (char <= '9' && char >= '0')
You can start by implementing the slow version, it could be much faster than you think. But let's say it's too slow. It then is an optimization problem. Where does the inefficiency lies?
"if number" is easy, you can use a regex or anything that stops when it finds something that is not a digit
"if the next word is a number" is just as easy to implement efficiently.
Now, it's the "remove asterisk" part that is an issue to you. The key point to notice here is that you don't need to duplicate the string: you can actually modify it in place since you are only removing elements.
Try to run through this visually before trying to implement it.
Keep two integers or iterators, the first one saying where you are currently reading your string, and the second one saying where you are currently writing your string. Since you only erase stuff, the read one will always be ahead of the writing one.
If you decide to keep the current string, you just need to advance each of your integers/iterators one by one, and copying accordingly. If you don't want to keep it, just advance the reading string! Then you only have to cut the string by the amount of asterisks you removed. The complexity is simply O(n), without any additional buffer used.
Also note that your algorithm would be simpler (but equivalent) if written like this:
wasNumber = false
Loop through string
if number
set wasNumber = true
else
set wasNumber = false
if asterisk and wasNumber and next word is a number
do nothing // using my algorithm, "do nothing" actually copies what you intend to keep
else
remove asterisk
I found your little problem interesting and I wrote (and tested) a small and simple function that would do just that on a std::string. Here u go:
// TestStringsCpp.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include <string>
#include <iostream>
using namespace std;
string& ClearAsterisk(string& iString)
{
bool bLastCharNumeric = false;
string lString = "0123456789";
for (string::iterator it = iString.begin(); it != iString.end() ; ++it) {
switch (*it) {
case ' ': break;//ignore whitespace characters
case '*':
if (bLastCharNumeric) {
//asterisk is preceded by numeric character. we have to check if
//the following non space character is numeric also
for (string::iterator it2 = it + 1; it2 != iString.end() ; ++it2) {
if (*it2 != ' ') {
if (*it2 <= '9' && *it2 >= '0') break;
else iString.erase(it);
break; //exit current for
}
}
}
else iString.erase(it);;
break;
default:
if (*it <= '9' && *it >= '0') bLastCharNumeric= true;
else bLastCharNumeric = false; //reset flag
}
}
return iString;
}
int _tmain(int argc, _TCHAR* argv[])
{
string testString = "hey this is a string * this string is awesome 97 * 3 = 27 * this string is cool";
cout<<ClearAsterisk(testString).c_str();
cin >> testString; //this is just for the app to pause a bit :)
return 0;
}
It will work perfectly with your sample string but it will fail if you have a text like this: "this is a happy 5 * 3day menu" because it checks only for the first nonspace character after the '*'. But frankly I can't immagine a lot of cases you would have this kind of construct in a sentence.
HTH,JP.
A regular expression wouldn't necessarily be any more efficient, but it would let you rely on somebody else to do your string parsing and manipulation.
Personally, if I were worried about efficiency, I would implement your pseudocode version while limiting needless memory allocations. I might even mmap the input file. I highly doubt that you'll get much faster than that.