C++ Find Word in String without Regex - c++

I'm trying to find a certain word in a string, but find that word alone. For example, if I had a word bank:
789540132143
93
3
5434
I only want a match to be found for the value 3, as the other values do not match exactly. I used the normal string::find function, but that found matches for all four values in the word bank because they all contain 3.
There is no whitespace surrounding the values, and I am not allowed to use Regex. I'm looking for the fastest implementation of completing this task.

If you want to count the words you should use a string to int map. Read a word from your file using >> into a string then increment the map accordingly
string word;
map<string,int> count;
ifstream input("file.txt");
while (input.good()) {
input >> word;
count[word]++;
}
using >> has the benefit that you don't have to worry about whitespace.

All depends on the definition of words: is it a string speparated from others with a whitespace ? Or are other word separators (e.g. coma, dot, semicolon, colon, parenntheses...) relevant as well ?
How to parse for words without regex:
Here an accetable approach using find() and its variant find_first_of():
string myline; // line to be parsed
string what="3"; // string to be found
string separator=" \t\n,;.:()[]"; // string separators
while (getline(cin, myline)) {
size_t nxt=0;
while ( (nxt=myline.find(what, nxt)) != string::npos) { // search occurences of what
if (nxt==0||separator.find(myline[nxt-1])!=string::npos) { // if at befgin of a word
size_t nsep=myline.find_first_of(separator,nxt+1); // check if goes to end of wordd
if ((nsep==string::npos && myline.length()-nxt==what.length()) || nsep-nxt==what.length()) {
cout << "Line: "<<myline<<endl; // bingo !!
cout << "from pos "<<nxt<<" to " << nsep << endl;
}
}
nxt++; // ready for next occurence
}
}
And here the online demo.
The principle is to check if the occurences found correspond to a word, i.e. are at the begin of a string or begin of a word (i.e. the previous char is a separator) and that it goes until the next separator (or end of line).
How to solve your real problem:
You can have the fastest word search function: if ou use it for solving your problem of counting words, as you've explained in your comment, you'll waste a lot of efforts !
The best way to achieve this would certainly be to use a map<string, int> to store/updated a counter for each string encountered in the file.
You then just have to parse each line into words (you could use find_fisrst_of() as suggested above) and use the map:
mymap[word]++;

Related

Using regex to parse out numbers

My problem is more or less self-explanatory, I want to write a regex to parse out numbers out of a string that user enters via console. I take the user input using:
getline(std::cin,stringName); //1 2 3 4 5
I asume that user enters N numbers followed by white spaces except the last number.
I have solved this problem by analyzing string char by char like this:
std::string helper = "";
std::for_each(stringName.cbegin(), strinName.cend(), [&](char c)
{
if (c == ' ')
{
intVector.push_back(std::stoi(helper.c_str()));
helper = "";
}
else
helper += c;
});
intVector.push_back(std::stoi(helper.c_str()));
I want to achieve the same behavior by using regex. I've wrote the following code:
std::regex rx1("([0-9]+ )");
std::sregex_iterator begin(stringName.begin(), stringName.end(), rx1);
std::sregex_iterator end;
while (begin != end)
{
std::smatch sm = *begin;
int number = std::stoi(sm.str(1));
std::cout << number << " ";
}
Problem with this regex occurs when it gets to the last number since it doesn't have space behind it, therefore it enters an infinite loop. Can someone give me an idea on how to fix this?
You're going to get an endless loop there because you never increment begin. If you do that, you'll get all the numbers except the last one (which, as you say, is not followed by a space).
But I don't understand why you feel it necessary to include the whitespace in the regular expression. If you just match a string of digits, the regex will automatically select the longest possible match, so the following character (if any) cannot be a digit.
I also see no value in the capture in the regex. If you wanted to restrict the capture to the number itself, you would have used ([0-9]+). (But since stoi only converts until it finds a non-digit, it doesn't matter.)
So you just use this:
std::regex rx1("[0-9]+");
for (auto it = std::sregex_iterator{str.begin(), str.end(), rx1},
end = std::sregex_iterator{};
it != end;
++it) {
std::cout << std::stoi(it->str(0)) << '\n';
}
(Live on coliru)

How to delimit this text file? strtok

so there's a text file where I have 1. languages, a 2. text of a number written in the said language, 3. the base of the number and 4. the number written in digits. Here's a sample:
francais deux mille quatre cents 10 2400
How I went about it:
struct Nomen{
char langue[21], nomNombre [31], baseC[3], nombreC[21];
int base, nombre;
};
and in the main:
if(myfile.is_open()){
{
while(getline(myfile, line))
{
strcpy(Linguo[i].langue, strtok((char *)line.c_str(), " "));
strcpy(Linguo[i].nomNombre, strtok(NULL, " "));
strcpy(Linguo[i].baseC, strtok(NULL, " "));
strcpy(Linguo[i].nombreC, strtok(NULL, "\n"));
i++;
}
Difficulty: I'm trying to put two whitespaces as a delimiter, but it seems that strtok() counts it as if there were only one whitespace. The fact there are spaces in the text number, etc. is messing up the tokenization. How should I go about it?
strtok treats any single character in the provided string as a delimiter. It does not treat the string itself as a single delimiter. So " " (two spaces) is the same as " " (one space).
strtok will also treat multiple delimiters together as a single delimiter. So the input "t1 t2" will be tokenized as two tokens, "t1" and "t2".
As mentioned in comments, strtok is also writes the NUL character into the input to create the token strings. So, it is an error to pass the result of string::c_str() as input to the function. The fact that you need to cast the constant string should have been enough to dissuade you from this approach.
If you want to treat a double space as a delimiter, you will have to scan the string and search for them yourself. Given you are using C APIs, you can consider strstr. However, in C++, you can use string::find.
Here's an algorithm to parse your string manually:
Given an input string input:
language is the substring from the start of input to the first SPC character.
From where language ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
text is the substring from the start of input to the first double SPC sequence.
From where text ends, skip over all whitespace, changing input to begin at the first non-whitespace character.
Parse base, and parse number.

c++ How to extract the whitespace between words if there is one

I've got two questions. I need to write a program that extracts all non-alphabetic characters and displays them, then removes them.
I am using isalpha which is working for symbols, but only if the input string has no spaces like "hello world"
but if it is more than one word like "hello! world!", it will only extract the first exclamation mark but not the second.
Second question which may be related, I want my program to detect the spaces between the words (I tried isspace but I must have used it wrong? and remove them and put them in a char variable
so for example
if the input is hello4 world! How3 are you today?
I want it to tell me
removed: 4
removed:
removed: !
removed:
removed: 3
removed:
removed:
removed:
long story short, if there is no other way, I'd like to detect spaces as !isalpha, or find something similar to isalpha for space between text.
Thanks
# include <iostream>
# include <string>
using namespace std;
void main()
{
string message;
cin >> message;
for (int i = 0; message[i]; i++)
if(!isalpha(message[i]))
cout << "deleted following character: " << message[i] <<endl;
else
cout <<"All is good! \n";
}
>> reads a single word, stopping when a whitespace character is found. To read a whole line, you want
std::getline(cout, message);
There is a better way by which you can get non-alphabetic characters,
You can check with asci value of each character and compare with alphabetic asci character if not in it & not a space (space asci val),
then you get your non-alphabetic character.
You can get all ascii codes over here :=> http://www.asciitable.com/
-Jayesh

End of Line Word Counting (C++)

I need to create a program that reads in a file, counts the words inside of it, and lists unique words with their frequency. The program considers any series of characters without spaces a word (so things like "hello." "hello" and ",.?" are all different words). I am having difficulty with using an if statement and adding a word at the end of the line to my word count. It counts the words that have spaces after them but not '/n'. This is the code I have for counting the words:
in.get(last);
in.get(current);
while(!in.eof())
{
if((current == ' ' && last != ' ') || (current == '/n' && last != ' ' && last != '/n'))
count++;
last = current;
in.get(current);
}
This is a painful way to do it... You are better off reading strings, which are automatically delimited by whitespace.
string word;
map<string,int> freq;
while( in >> word ) {
freq[word]++;
}
Note that in the example you gave, you used '/n', which should be '\n'. In my example, you don't even need it.
I would createca map,http://www.cplusplus.com/reference/map/map/, and if the word exists increment frequency otherwise add the word to the map.
This way you quickly check if the word exists, to have a unique list.

Reaching a specific word in a string

Hi I have a string like this:
word1--tab--word2--tab--word3--tab--word4--tab--word5--tab--word6
I need to extract the third word from the string. I thought of reading character by character and getting the word after reading the second tab. But I guess it is inefficient. Can you show me a more specific way please?
std::string has the find method which returns an index. You can use
find("--", lastFoundIndex + 1)
three times to find the start index of your word, a fourth time for the end index, and then use substr.
assuming "tab" is \t;
std::istringstream str(".....");
std::string temp, word;
str >> temp >> temp >> word;