C++ std::string capitalize in non-latin language (without third-party libraries) - c++

Considering the method:
void Capitalize(std::string &s)
{
bool shouldCapitalize = true;
for(size_t i = 0; i < s.size(); i++)
{
if (iswalpha(s[i]) && shouldCapitalize == true)
{
s[i] = (char)towupper(s[i]);
shouldCapitalize = false;
}
else if (iswspace(s[i]))
{
shouldCapitalize = true;
}
}
}
It works perfectly for ASCII characters, e.g.
"steve" -> "Steve"
However, once I'm using a non-latin characters, e.g. as with Cyrillic alphabet, I'm not getting that result:
"стив" -> "стив"
What is the reason why that method fails for non-latin alphabets? I've tried using methods such as isalpha as well as iswalpha but I'm getting exactly the same result.
What would be a way to modify this method to capitalize non-latin alphabets?
Note: Unfortunately, I'd prefer to solve this issue without using a third party library such as icu4c, otherwise it would have been a very simple problem to solve.
Update:
This solution doesn't work (for some reason):
void Capitalize(std::string &s)
{
bool shouldCapitalize = true;
std::locale loc("ru_RU"); // Creating a locale that supports cyrillic alphabet
for(size_t i = 0; i < s.size(); i++)
{
if (isalpha(s[i], loc) && shouldCapitalize == true)
{
s[i] = (char)toupper(s[i], loc);
shouldCapitalize = false;
}
else if (isspace(s[i], loc))
{
shouldCapitalize = true;
}
}
}

std::locale works, at least where it is present in system. Also you use it incorrectly.
This code works as expected on Ubuntu with Russian locale installed:
#include <iostream>
#include <locale>
#include <string>
#include <codecvt>
void Capitalize(std::wstring &s)
{
bool shouldCapitalize = true;
std::locale loc("ru_RU.UTF-8"); // Creating a locale that supports cyrillic alphabet
for(size_t i = 0; i < s.size(); i++)
{
if (isalpha(s[i], loc) && shouldCapitalize == true)
{
s[i] = toupper(s[i], loc);
shouldCapitalize = false;
}
else if (isspace(s[i], loc))
{
shouldCapitalize = true;
}
}
}
int main()
{
std::wstring in = L"это пример текста";
Capitalize(in);
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv1;
std::string out = conv1.to_bytes(in);
std::cout << out << "\n";
return 0;
}
Its possible that on Windows you need to use other locale name, I'm not sure.

Well, an external library would be the only practical choice IMHO. The standard functions works well with Latin, and any other locale would be a pain, and I wouldn't bother. Still, if you want support for Latin and Cyrillic without an external library, you can just write it yourself:
wchar_t to_upper(wchar_t c) {
// Latin
if (c >= L'a' && c <= L'z') return c - L'a' + L'A';
// Cyrillic
if (c >= L'а' && c <= L'я') return c - L'а' + L'А';
return towupper(c);
}
Still, it's important to note that you need to painstakingly implement support for all alphabets, and even not all latin characters are supported, so an external library is the best solution. Consider the given solution if you're sure only English and Russian are going to be used.

Related

Is this good enough to check an ascii string?

bool is_ascii(const string &word) {
if((unsigned char)(*word.c_str()) < 128){
return true
}
return false
}
I want to check whether a string is ascii string. I also saw such a function to detect whether a string is ascii chars or not:
bool is_ascii(const string &str){
std::locale loc;
for(size_t i = 0; i < str.size(); i++)
if( !isalpha(str[i], loc) && !isspace(str[i], loc))
return false;
return true;
}
Which one is better or more reliable?
Other answers get the is-char-ASCII part already. I’m assuming it’s right. Putting it together I’d recommend:
#include <algorithm>
bool is_ascii_char(unsigned char c) {
return (c & 0x80) == 0;
}
bool is_ascii(std::string_view s) {
return std::ranges::all_of(s, is_ascii_char);
}
https://godbolt.org/z/nKb673vaM
Or before C++20, that could be return std::all_of(s.begin(), s.end(), is_ascii_char);.
ASCII is a lot more than just alpha characters and spaces. If you want to accept all ASCII, just use your second example and change the if:
if(str[i] < 0 || str[i] > 0x7f)
return false;

Adding the next character on a string C++

I did get the next character on a string (hello-->ifmmp) but in the case of hello* i want to be able to still display the * as the exception, it can be also a number but i guess it does not matter because is not in the alphabet.
this is my code, Where should be the else if?
There is another option but i dont find it optimized, it is to add inside the first for loop this:
string other="123456789!##$%^&*()";
for(int z=0;z<other.length();z++)
{
if(str[i]==other[z])
str2+=other[z];
}
Then this is the main code;
int main()
{
string str = "hello*";
string str2="";
string alphabet = "abcdefghijklmnopqrstuvwxyz";
for(int i=0;i<str.length();i++)
{
for(int j=0;j<alphabet.length();j++)
{
if(str[i]==alphabet[j])
{
str2+=alphabet[j+1];
}
}
}
cout<<str2<<endl;
return 0;
}
I like functions. They solve a lot of problems. For example, if you take the code you already have, paste it into a function, and give it a little tweak
char findreplacement(char ch, const std::string & alphabet)
{
for (int j = 0; j < alphabet.length(); j++)
{
if (ch == alphabet[j])
{
return alphabet[(j+1) % alphabet.length()];
// return the replacement character
// using modulo, %, to handle wrap around z->a
}
}
return ch; // found no replacement. Return original character.
}
you can call the function
for (int i = 0; i < str.length(); i++)
{
str2 += findreplacement(str[i], alphabet);
}
to build str2. Consider using a range-based for here:
for (char ch: str)
{
str2 += findreplacement(ch, alphabet);
}
It's cleaner and a lot harder to screw up.
There is a function isalpha in the standard library which is very useful for classification.
You could do something like this.
(This kind of exercise usually assumes the ASCII encoding of the English alphabet, and this is a very ASCII-specific solution. If you want a different alphabet or a different character encoding, you need to handle that yourself.)
#include <cctype>
#include <string>
#include <iostream>
int main()
{
std::string str = "Hello*Zzz?";
std::string str2;
for (char c: str)
{
if (std::isalpha(c))
{
c += 1;
if (!std::isalpha(c))
{
// Went too far; wrap around to 'a' or 'A'.
c -= 26;
}
}
str2 += c;
}
std::cout << str2 << std::endl;
}
Output:
Ifmmp*Aaa?

Retrieve each token from a file according to specific criteria

I'm trying to create a lexer for a functional language, one of the methods of which should allow, on each call, to return the next token of a file.
For example :
func main() {
var MyVar : integer = 3+2;
}
So I would like every time the next method is called, the next token in that sequence is returned; in that case, it would look like this :
func
main
(
)
{
var
MyVar
:
integer
=
3
+
2
;
}
Except that the result I get is not what I expected:
func
main(
)
{
var
MyVar
:
integer
=
3+
2
}
Here is my method:
token_t Lexer::next() {
token_t ret;
std::string token_tmp;
bool IsSimpleQuote = false; // check string --> "..."
bool IsDoubleQuote = false; // check char --> '...'
bool IsComment = false; // check comments --> `...`
bool IterWhile = true;
while (IterWhile) {
bool IsInStc = (IsDoubleQuote || IsSimpleQuote || IsComment);
std::ifstream file_tmp(this->CurrentFilename);
if (this->eof) break;
char chr = this->File.get();
char next = file_tmp.seekg(this->CurrentCharIndex + 1).get();
++this->CurrentCharInCurrentLineIndex;
++this->CurrentCharIndex;
{
if (!IsInStc && !IsComment && chr == '`') IsComment = true; else if (!IsInStc && IsComment && chr == '`') { IsComment = false; continue; }
if (IsComment) continue;
if (!IsInStc && chr == '"') IsDoubleQuote = true;
else if (!IsInStc && chr == '\'') IsSimpleQuote = true;
else if (IsDoubleQuote && chr == '"') IsDoubleQuote = false;
else if (IsSimpleQuote && chr == '\'') IsSimpleQuote = false;
}
if (chr == '\n') {
++this->CurrentLineIndex;
this->CurrentCharInCurrentLineIndex = -1;
}
token_tmp += chr;
if (!IsInStc && IsLangDelim(chr)) IterWhile = false;
}
if (token_tmp.size() > 1 && System::Text::EndsWith(token_tmp, ";") || System::Text::EndsWith(token_tmp, " ")) token_tmp.pop_back();
++this->NbrOfTokens;
location_t pos;
pos.char_pos = this->CurrentCharInCurrentLineIndex;
pos.filename = this->CurrentFilename;
pos.line = this->CurrentLineIndex;
SetToken_t(&ret, token_tmp, TokenList::ToToken(token_tmp), pos);
return ret;
}
Here is the function IsLangDelim :
bool IsLangDelim(char chr) {
return (chr == ' ' || chr == '\t' || TokenList::IsSymbol(CharToString(chr)));
}
TokenList is a namespace that contains the list of tokens, as well as some functions (like IsSymbol in this case).
I have already tried other versions of this method, but the result is almost always the same.
Do you have any idea how to improve this method?
The solution for your problem is using a std::regex. Understanding the syntax is, in the beginning, a little bit difficult, but after you understand it, you will always use it.
And, it is designed to find tokens.
The specific critera can be expressed in the regex string.
For your case I will use: std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
This means:
Look for one or more characters (That is a word)
Look for one or more digits (That is a integer number)
Or look for all kind of meaningful operators (Like '+', '-', '{' and so on)
You can extend the regex for all the other stuff that you are searching. You can also regex a regex result.
Please see example below. That will create your shown output from your provided input.
And, your described task is only one statement in main.
#include <iostream>
#include <string>
#include <algorithm>
#include <regex>
// Our test data (raw string) .
std::string testData(
R"#(func main() {
var MyVar : integer = 3+2;
}
)#");
std::regex re(R"#((\w+|\d+|[;:\(\)\{\}\+\-\*\/\%\=]))#");
int main(void)
{
std::copy(
std::sregex_token_iterator(testData.begin(), testData.end(), re, 1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n")
);
return 0;
}
You try to parse using single loop, which makes the code very complicated. Instead i suggest something like this:
struct token { ... };
struct lexer {
vector<token> tokens;
string source;
unsigned int pos;
bool parse_ident() {
if (!is_alpha(source[pos])) return false;
auto start = pos;
while(pos < source.size() && is_alnum(source[pos])) ++pos;
tokens.push_back({ token_type::ident, source.substr(start, pos - start) });
return true;
}
bool parse_num() { ... }
bool parse_comment() { ... }
...
bool parse_whitespace() { ... }
void parse() {
while(pos < source.size()) {
if (!parse_comment() && !parse_ident() && !parse_num() && ... && !parse_comment()) {
throw error{ "unexpected character at position " + std::to_string(pos) };
}
}
}
This is standard structure i use, when lexing my files in any scripting language i've written. Lexing is usually greedy, so you don't need to bother with regex (which is effective, but slower, unless some crazy template based implementation). Just define your parse_* functions, make sure they return false, if they didn't parsed a token and make sure they are called in correct order.
Order itself doesn't matter usually, but:
operators needs to be checked from longest to shortest
number in style .123 might be incorrectly recognized as . operator (so you need to make sure, that after . there is no digit.
numbers and identifiers are very lookalike, except that identifiers starts with non-number.

Splitting text into a list of words with ICU

I'm working on a text tokenizer. ICU is one of very few C++ libraries that have this feature, and probably the best maintained one, so I'd like to use it.
I've found the docs about BreakIterator, but there's one problem with it: how do I leave the punctuation out?
#include "unicode/brkiter.h"
#include <QFile>
#include <vector>
std::vector<QString> listWordBoundaries(const UnicodeString& s)
{
UErrorCode status = U_ZERO_ERROR;
BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status);
std::vector<QString> words;
bi->setText(s);
for (int32_t p = bi->first(), prevBoundary = 0; p != BreakIterator::DONE; prevBoundary = p, p = bi->next())
{
const auto word = s.tempSubStringBetween(prevBoundary, p);
char buffer [16384];
word.toUTF8(CheckedArrayByteSink(buffer, 16384));
words.emplace_back(QString::fromUtf8(buffer));
}
delete bi;
return words;
}
int main(int /*argc*/, char * /*argv*/ [])
{
QFile f("E:\\words.TXT");
f.open(QFile::ReadOnly);
QFile result("E:\\words.TXT");
result.open(QFile::WriteOnly);
const QByteArray strData = f.readAll();
for (const QString& word: listWordBoundaries(UnicodeString::fromUTF8(StringPiece(strData.data(), strData.size()))))
{
result.write(word.toUtf8());
result.write("\n");
}
return 0;
}
Naturally, the resulting file looks like this:
“
Come
outside
.
Best
if
we
do
not
wake
him
.
”
What I need is just the words. How can this be done?
QT library include several useful methods for check the char's properties:
QChar.
Indeed, you could create the QString variable from the buffer
and check all properties you need before to insert into the output vector.
For example:
auto token = QString::fromUtf8(buffer);
if (token.length() > 0 && token.data()[0].isPunct() == false) {
words.push_back(std::move(token));
}
With that code I can access the first character of the string and check
whether it is a punctuation mark or not.
Something more robust, I express that as function:
bool isInBlackList(const QString& str) {
const auto len = str.lenght();
if (len == 0) return true;
for(int i = 0; i < len; ++i) {
const auto&& c = str.data()[i];
if (c.isPunct() == true || c.isSpace() == true) {
return true;
}
}
return false;
}
If that function returns true, the token hasn't to be inserted into the vector.

booleans with constraints

How do I write a boolean that checks if a string has only letters, numbers and an underscore?
Assuming String supports iterators, use all_of:
using std::begin;
using std::end;
return std::all_of(begin(String), end(String),
[](char c) { return isalnum(c) || c == '_'; });
In an easier way, run a loop and check all the characters holding the property you mentioned, and if not, just return false.
Code:
bool stringHasOnlyLettersNumbsandUndrscore(std::string const& str)
{
for(int i = 0; i < str.length(); ++i)
{
//Your character in the string does not fulfill the property.
if (!isalnum(str[i]) && str[i] != '_')
{
return false;
}
}
//The whole string fulfills the condition.
return true;
}
bool stringHasOnlyLettersNumbsandUndrscore(std::string const& str)
{
return ( std::all_of(str.begin(), str.end(),
[](char c) { return isalnum(c) || c == '_'; }) &&
(std::count_if(str.begin(), str.end(),
[](char c) { return (c == '_'); }) < 2));
}
Check if each character is a letter, number or underscore.
for c and c++ , this should do.
if(!isalnum(a[i]) && a[i]!='_')
cout<<"No";
You will have to add < ctype > for this code to work.
This is just the quickest way that comes to mind, there might be other more complex and faster ways.