How to implement diacritics in c++? - c++

I need a help getting words from a .txt file which also contains diacritics. (So there are words containing ěščř etc. Btw that's czech diacritics if that helps.)
My function gets words I type, but it won't get words I type in console containing diacritics.
I think I have to set something in my Microsoft Visual c++ 2010 but I'm not sure what and where. In case I'm wrong, there's the function.
bool find(char typedword[50])
{
bool found = false;
char * word = new char [50];
fstream dictionary;
dictionary.open("Dictionary.txt", ios::in);
while (dictionary >> word)
{
if (strcmp(typedword, word) == 0)
{
found = true;
break;
}
}
dictionary.close();
if (found == true)
return true;
else
return false;
}
Thank you for all your help!

You need locale support, so that sequences of combining characters and the composite equivalent compare equal.
The portable way is setlocale and use strcoll instead of strcmp.
The Windows way is to use CompareStringEx (which automatically uses OS locale settings) instead of strcmp. NormalizeString may also be helpful.

Related

Reading from FileStream with arbitrary delimiter

I have encountered a problem to read msg from a file using C++. Usually what people does is create a file stream then use getline() function to fetch msg. getline() function can accept an additional parameter as delimiter so that it return each "line" separated by the new delimiter but not default '\n'. However, this delimiter has to be a char. In my usecase, it is possible the delimiter in the msg is something else like "|--|", so I try to get a solution such that it accept a string as delimiter instead of a char.
I have searched StackOverFlow a little bit and found some interesting posts.
Parse (split) a string in C++ using string delimiter (standard C++)
This one gives a solution to use string::find() and string::substr() to parse with arbitrary delimiter. However, all the solutions there assumes input is a string instead of a stream, In my case, the file stream data is too big/waste to fit into memory at once so it should read in msg by msg (or a bulk of msg at once).
Actually, read through the gcc implementation of std::getline() function, it seems it is much more easier to handle the case delimiter is a singe char. Since every time you load in a chunk of characters, you can always search the delimiter and separate them. While it is different if you delimiter is more than one char, the delimiter itself may straddle between two different chunks and cause many other corner cases.
Not sure whether anyone else has faced this kind of requirement before and how you guys handled it elegantly. It seems it would be nice to have a standard function like istream& getNext (istream&& is, string& str, string delim)? This seems to be a general usecase to me. Why not this one is in Standard lib so that people no longer to implement their own version separately?
Thank you very much
The STL simply does not natively support what you are asking for. You will have to write your own function (or find a 3rd party function) that does what you need.
For instance, you can use std::getline() to read up to the first character of your delimiter, and then use std::istream::get() to read subsequent characters and compare them to the rest of your delimiter. For example:
std::istream& my_getline(std::istream &input, std::string &str, const std::string &delim)
{
if (delim.empty())
throw std::invalid_argument("delim cannot be empty!");
if (delim.size() == 1)
return std::getline(input, str, delim[0]);
str.clear();
std::string temp;
char ch;
bool found = false;
do
{
if (!std::getline(input, temp, delim[0]))
break;
str += temp;
found = true;
for (int i = 1; i < delim.size(); ++i)
{
if (!input.get(ch))
{
if (input.eof())
input.clear(std::ios_base::eofbit);
str.append(delim.c_str(), i);
return input;
}
if (delim[i] != ch)
{
str.append(delim.c_str(), i);
str += ch;
found = false;
break;
}
}
}
while (!found);
return input;
}
if you are ok with reading byte by byte, you could build a state transition table implementation of a finite state machine to recognize your stop condition
std::string delimeter="someString";
//initialize table with a row per target string character, a column per possible char and all zeros
std::vector<vector<int> > table(delimeter.size(),std::vector<int>(256,0));
int endState=delimeter.size();
//set the entry for the state looking for the next letter and finding that character to the next state
for(unsigned int i=0;i<delimeter.size();i++){
table[i][(int)delimeter[i]]=i+1;
}
now in you can use it like this
int currentState=0;
int read=0;
bool done=false;
while(!done&&(read=<istream>.read())>=0){
if(read>=256){
currentState=0;
}else{
currentState=table[currentState][read];
}
if(currentState==endState){
done=true;
}
//do your streamy stuff
}
granted this only works if the delimiter is in extended ASCII, but it will work fine for some thing like your example.
It seems, it is easiest to create something like getline(): read to the last character of the separator. Then check if the string is long enough for the separator and, if so, if it ends with the separator. If it is not, carry on reading:
std::string getline(std::istream& in, std::string& value, std::string const& separator) {
std::istreambuf_iterator<char> it(in), end;
if (separator.empty()) { // empty separator -> return the entire stream
return std::string(it, end);
}
std::string rc;
char last(separator.back());
for (; it != end; ++it) {
rc.push_back(*it);
if (rc.back() == last
&& separator.size() <= rc.size()
&& rc.substr(rc.size() - separator.size()) == separator) {
return rc.resize(rc.size() - separator.size());
}
}
return rc; // no separator was found
}

Error comparing french characters in a c++ string

I was wondering if any of you could help me with a problem I'm having. Currently I have a function that takes in a c-style string, creates a temporary c++ style string and store the c string into it, and uses the find_first_not_of command to look for invalid characters, some of which include french characters like 'à'. However, when I pass in a string containing french characters, it doesn't recognize them as valid.
I am using visual studio 2013 on Windows 8, and a few people have told me that the issue is that how VS encodes it's files is different then how it encodes input from the command prompt, but I do not know how to fix that. Do any of you know how I would go about doing this? Or is is a different problem with my code entirely?
My code for the function is as follow:
bool checkValidCharacters(const char* input)
{
std::string checkString(input);
bool validCharacters = false;
std::size_t found = checkString.find_first_not_of("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZàâäèêëîôùûüÿçÀÂÄÈÉÊÎÏÔÙÛÜŸÇ-. ");
if (found != std::string::npos)
{
printf("Error: Invalid character: %c", input[found]);
}
else
{
printf("All characters valid\n");
validCharacters = true;
}
return validCharacters;
}
Thanks a bunch.

C++ Escape Phrase Substring

I'm trying to parse web data coming from a server, and I'm trying to find a more stl version of what I had.
My old code consisted of a for() loop and checked each character of the string against a set of escape characters and used a stringstream to collect the rest. As I'm sure you can imagine, this sort of loop leads to being a high point of failure when reading web data, as I need strict syntax checking.
I'm trying to instead start using the string::find and string::substr functions, but I'm unsure of the best implementation to do it with.
Basically, I want to read a string of data from a server, different data, separated by a comma. (i.e., first,lastname,email#email.com) and separate it at the commas, but read the data in between.
Can anyone offer any advice?
I'm not sure what kind of data are you parsing, but it's always a good idea to use a multi layer architecture. Each layer should implement an abstract function, and each layer should only do one job (like escaping characters).
The number of layers you use depends on the actual steps needed to decode the stream
for your problem I suggest the following layers:
1st: tokenize by ',' and '\n': convert in to some kind of vector of strings
2nd: resolve escapes: decode escape characers
you should use std::stringstream, and process the characters with a loop. unless your format is REALLY simple (like only a single separator character, without escapes), you can't really use any standard function.
For the learning experience, this is the code I ended up using to parse data into a map. You can use the web_parse_resurn.err to see if an error was hit, or use it for specific error codes.
struct web_parse_return {
map<int,string> parsedata;
int err;
};
web_parse_return* parsewebstring(char* escapechar, char* input, int tokenminimum) {
int err = 0;
map<int,string> datamap;
if(input == "MISSING_INFO") { //a server-side string for data left out in the call
err++;
}
else {
char* nTOKEN;
char* TOKEN = strtok_s(input, escapechar,&nTOKEN);
if(TOKEN != 0) { //if the escape character is found
int tokencount = 0;
while(TOKEN != 0) {//since it finds the next occurrence, keep going
datamap.insert(pair<int,string>(tokencount,TOKEN));
TOKEN = strtok_s(NULL, escapechar,&nTOKEN);
tokencount++;
}
if(tokencount < tokenminimum) //check that the right number was hit
err++; //other wise, up the error count
}
else {
err++;
}
}
web_parse_return* p = new web_parse_return; //initializing a new struct
p->err = err;
p->parsedata = datamap;
return p;
}

C++: how to judge if the path of the file start with a given path

I have a path, for example, named
/my/path/test/mytestpath
, and I want to judge if it start with a given path, for example
/my/path
The C++17 filesystem library is probably the most robust solution. If C++17 is not available to you, Boost.Filesystem provides an implementation for earlier C++ versions. Try something like:
bool isSubDir(path p, path root)
{
while(p != path()) {
if(p == root) {
return true;
}
p = p.parent_path();
}
return false;
}
Substring the length of the string ( /my/path ) of the original (/my/path/test/mytestpath ) from the beginning.
Check whether two strings are equal.
You can do a string compare of the number of characters in the shorter string.
The fact that the characters match of itself won't mean it is a sub-path because you need to check that the next character in the longer string is a '/'
In C you can use strncmp() which takes a length of characters.
In C++ you can use the same or string compare functions. The find() function will work for this but remember to also check that the next character in the main path is a directory separator.
You could "tokenize" your path but that is likely to not be worth it.
std::string::find() returns the index at which a string was found, with an index of 0 being the start of the string:
std::string path("/my/path/test/mytestpath");
// This will check if 'path' begins with "/my/path/".
//
if (0 == path.find("/my/path/"))
{
// 'path' starts with "/my/path".
}

C++: check whether a word is spelled correctly

I'm looking for an easy way to check whether a certain string is a correctly-spelled English word. For example, 'looked' would return True while 'hurrr' would return False. I don't need spelling suggestions or any spelling-correcting features. Just a simple function that takes a string and returns a boolean value.
I could do this easily with Python using PyEnchant, but it seems you have to compile the library yourself if you want to use it with MS Visual C++.
PyEnchant is based on Enchant, which is a C library providing C and C++ interfaces. So you can just use that for C++. The minimal example will be something like this:
#include <memory>
#include <cstdio>
#include "enchant.h"
#include "enchant++.h"
int main ()
{
try
{
enchant::Broker *broker = enchant::Broker::instance ();
std::auto_ptr<enchant::Dict> dict (broker->request_dict ("en_US"));
const char *check_checks[] = { "hello", "helllo" };
for (int i = 0; i < (sizeof (check_checks) / sizeof (check_checks[0])); ++i)
{
printf ("enchant_dict_check (%s): %d\n", check_checks[i],
dict->check (check_checks[i]) == false);
}
} catch (const enchant::Exception &) {
return 1;
}
}
For more examples/tests, see their SVN repository.
If you want to implement such function on your own, you'll need a database to query in order to find out whether a given word is valid (usually a plain text file is enough, like /usr/share/dict/words on Linux).
Otherwise you could rely upon a third party spellcheck library that does just that.
You could take one of the GNU dictionaries out there (like /usr/share/dict/words as mentioned) and build it into an appropriate data structure that'll give you fast lookup and membership checking depending on your performance needs, something like a directed acyclic word graph or even just a trie might be sufficient.
You'd need a word list, for starters. (/usr/share/dict/words maybe?)
You should read your word list into a std::set. Then a correct-spelling test consists simply of checking all the user input words for whether or not they are in the set.
bool spell_check(std::string const& str)
{
std::cout << "Is '" << str << "' spelled correctly? ";
std::string input;
std::getline(input);
return input[0] == 'y' || input[0] == 'Y';
}