Strtok and ability to ignore usernames? - c++

So I'm currently doing a project where we need to parse sentences (more specifically tweets) by word and store the frequencies of words and the words themselves in a vector pair (with a custom find function to increment frequencies).
Im currently using strtok to parse the sentences and i was wondering if you could ingore any words that have a symbol # at the beginning of them. I currently have my delimiter for the strtok function as a bunch of non useful symbols and spaces !##&()–[{}]:;',?/*\".+\\^ and it ignores them correctly, but say I have a word: #thisismyusername, is there a way to ignore the whole word, including the 'thisismyusername' and not just the #?
I've been looking for documentation on something like this but haven't found anything yet.
Here is my strtok parsing code:
char* tempMap;
tempMap = strtok (tempHolderPos," !##&()–[{}]:;',?/*\".+\\^");
*tempHolderPos is the full sentence.
Thanks guys!

You can do exactly that. For instance, something like the following will work with your strtok loop:
someloop {
ptr = strtok (NULL, yourdelims);
if (*ptr == '#')
continue;`
...
}
After getting a token from strtok you simply check if the first character is a '#' and if so, go get the next word at this point -- effectively ignoring the word beginning with '#'.
Recall, when you dereference a character pointer, you get the character itself. When called on a char * variable name (the beginning address for the pointer), you get the 1st character. So you just dereference your pointer to your token and check if the first char is '#' and if so, go get the next word, skipping all additional processing that would be done on the token.

Looking at strtok reference, I think you can't do that directly. It would be easy though to ignore any token that starts with # and just continue without saving it.

Related

How to parse user input while ignoring noise words in C++?

I'm trying to write a simple text adventure game in C++. I want to allow the user to be able to type in phrases such as "GET THE DOG" where the code would ignore 'THE' and just give me the important things like 'GET' and 'DOG'. I also want the game to support movement, so another example of a phrase could be something like "MOVE TO THE LEFT" where the game would ignore 'TO' and 'THE' and only pay attention to 'MOVE' 'LEFT'.
Anyone have any tips on how to write a function to do this? I thought at first I could use getline, but the only way I think I can get that to work, is if I already know the position of the important words. My friend suggested using substr to put the strings into a vector, then iterating over that. But even that way I'm not too sure how I'd use substr to do such a thing.
Thanks!
char str[100];
cin.getline(str,100);
char* point;
pint = strtok(str, " ");
while(piont != NULL){
cout<<point<<endl;
point = strtok(NULL, " ");
}
}
here is something I've divvied up while trying to figure out how to do this. I'm not really sure why it works, but its doing something right. Its pointing to full on words, because whenever i print the pointer, its printing the word before the whitespace.
The usual approach would be to split the input up into words (probably in a std::vector<std::string>), and filter (std::remove_if) the words using a set (probably a std:: unordered_set<std::string>) of "stop words". Then you can try to make sense of what's left.
Technically, a stop word is a word so common that it is pointless to use it in a search. I don't know why they are called "stop words", but it is definitely the usual term and you can use it to find some common lists. Not all of them are "noise", in your sense, but I think all your noise words will be on common stop word lists.

Unrecognizable character in C++

I'm programming an application that converts .txt files to bags of words for text mining. However, I keep getting non-alphabetic characters ( like ¾ and =) even though my application filters non-alphabetic characters:
My vector passes through a loop which erases strings that begins with a char with an ASCII value other than [65,90] (from A to Z). These characters also pass the isalpha test. It seems like these characters can't be distinguished from alphabetic characters.
I don't see how I can remove these weird strings dynamically from my vector of strings. I need help.
My code because it is quite long for a forum post.
This part of my code fails to get rid of the strings beginning with non-aphabetic characters:
for (unsigned int i=0; i<token24.size();i++){
string temp = token24[i];
char c = temp[0];
if(c>90||c<65){
token24.erase(token24.begin()+i);
i--;
}
}
I also tried with the condition
(c>'Z'||c<'A')
You could always do a string replace the characters with whitespace, but that just handles the specific cases of specific characters, not the larger problem.
I don't think we can do anything for you until we see the code.
The most important part in programs like yours is handling the content of .txt file. Such file can be a Unicode text, which in turn can be encoded, for eample, with UTF-8. Then, single byte can be only a part of a character, not character itself. Are you sure you load (and possibly, decode) the file in a proper way?
Also, don't you think that lower letters are also valid alpha characters?

C++ Parsing string to find occurrence

So I need to parse the input of the user in the following way:
If the user enters
C:\Program\Folder\NextFolder\File.txt
OR
C:\Program\Folder\NextFolder\File.txt\
Then I want to remove the file and just save
C:\Program\Folder\NextFolder\
I essentially want to find the first occurrence of \ starting at the end and if they put a trailing slash then I can find the second occurrence. I can decifer first or second with this code:
input.substr(input.size()-1,1)!="/"
But I don't understand how to find the first occurrence starting from the end. Any ideas?
This
input.substr(input.size()-1,1)!="/"
is very inefficient*. Use:
if( ! input.empty() && input[ input.length() - 1 ] == '/' )
{
// something
}
Finding the first occurrence of something, starting from the end is the same as finding the last "something", starting from the beginning. You may use find_last_of, or rfind Or, you may even use standard find, combined with rbegin and rend
*std::string::substr creates one substring, "/" probably creates another (depends on std::string::operator!=), compares the two strings and destroys the temp objects.
Note that
C:\Program\Folder\NextFolder\File.txt\
is not a path to a file, it's a directory.
If your input is of type std::string( that I think it is ) you can search it using string::find for normal search and string::rfind for reverse search( end to start ) and also to check last character you don't need and you shouldn't use substr, since it create a new instance of string just to check one character. You may just say if( input.back() == '/' )
If you are using C++ strings, then try the reverse iterator on the strings, to write your own logic on what is acceptable and what is not. There is a clear example in the link I provided.
From what I guessed, you are trying to store the directory name given a path which could be end with a file or a directory.
If that is the case, you are better of removing the trailing '\' and checking if it is a directory, and stop if it is, or else proceed if it is not.
Alternately, you can try splitting the string on '\' into two parts. Some related notes here.
If those are actual file names, (looks like you are using windows), so try the _splitpath function as well.

How to read a message from a file, modifying only words?

Suppose I have the following text:
My name is myName. I love
stackoverflow .
Hi, Guys! There is more than one space after "Guys!" 123
And also after "123" there are 2 spaces and newline.
Now I need to read this text file as it is. Need to make some actions only with alphanumeric words. And after it I have to print it with changed words but spaces and newlines and punctuations unchanged and on the same position. When changing alphanumeric words length remains same. I have tried this with library checking for alphanumeric values, but code get very messy. Is there anyother way?
You can read your file line-by-line with fgets() function. It will fill char array and you can work with this array, e.g. iterate over this array, split it into alnum words; change the words and then write fixed string into new file with "fwrite()" function.
If you prefer C++ way of working with files (iostream), you can use istream::getline. It will save spaces; but it will consume "\n". If you need to save even "\n" (it can be '\r' and '\r\n' sometimes), you can use istream::get.
Maybe you should look at Boost Tokenizer. It can break of a string into a series of tokens and iterate through them. The following sample breaks up a phrase into words:
int main()
{
std::string s = "Hi, Guys! There is more...";
boost::tokenizer<> tok(s);
for(boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++beg)
{
std::cout << *beg << "\n";
}
return 0;
}
But in your case you need to provide a TokenizerFunc that will break up a string at alphanumeric/non-alphanumeric boundaries.
For more information see Boost Tokenizer documentation and implementation of an already provided char_separator, offset_separator and escaped_list_separator.
The reason that your code got messy is usually because you didn't break down your problem in clear functions and classes. If you do, you will have a few functions that each do precisely one thing (not messy). Your main function will then just call these simple functions. If the function names are well chosen, the main function will become short and clear, too.
In this case, your main function needs to do:
Loop: Read every line of a file
On every line, check if and where a "special" word occurs.
If a special word occurs, replace it
Extra hints: a line of text can be stored as a std::string and can be read by std::getline(std::cin, line)

C++ new line not translating

First off, I'm a complete beginner at C++.
I'm coding something using an API, and would like to pass text containing new lines to it, and have it print out the new lines at the other end.
If I hardcode whatever I want it to print out, like so
printInApp("Hello\nWorld");
it does come out as separate lines in the other end, but if I retrieve the text from the app using a method that returns a const char then pass it straight to printInApp (which takes const char as argument), it comes out as a single line.
Why's this and how would I go about to fix it?
It is the compiler that process escape codes in string literals, not the runtime methods. This is why you can for example have "char c = '\n';" since the compiler just compiles it as "char c = 10".
If you want to process escape codes in strings such as '\' and 'n' as separate characters (eg read as such from a file), you will need to write (or use an existing one) a string function which finds the escape codes and converts them to other values, eg converting a '\' followed by a 'n' into a newline (ascii value 10).