is there a way to detect chinese characters in c++ ? (using boost) - c++

In a data processing project, i need to detect split words in chinese ( words in chinese dont contain spaces).
Is there a way to detect chinese characters using a native c++ feature or boost.locale library ?

Generally speaking, if you want full Unicode support in C++, there is little to no way around ICU. Boost provides some access to its features (through Boost.Locale and Boost.Regex), but it requires Boost to be compiled with ICU support for this. So instead of making sure the Boost of the target platform is compiled thusly you are probably better off using the ICU API directly.
If you are looking for word boundaries, icu::BreakIterator (more specifically, icu::BreakIterator::createWordInstance) is the starting point. You then pass the text to be iterated over via setText and move the iterator via next et al. (yes, ICU is a bit non-idiomatic this way, as it originated in Java land).
Alternatively, if you don't want to go for the full C++ API, there's ublock_getCode which will tell you the UBlockCode of the code point in question.

Here is my attempt using only boost and standard library:
#include <iostream>
#include <boost/regex/pending/unicode_iterator.hpp>
#include <functional>
#include <algorithm>
using Iter = boost::u8_to_u32_iterator<std::string::const_iterator>;
template <::boost::uint32_t a, ::boost::uint32_t b>
class UnicodeRange
{
static_assert(a <= b, "Proper range");
public:
constexpr bool operator()(::boost::uint32_t x) const noexcept
{
return x >= a && x <= b;
}
};
using UnifiedIdeographs = UnicodeRange<0x4E00, 0x9FFF>;
using UnifiedIdeographsA = UnicodeRange<0x3400, 0x4DBF>;
using UnifiedIdeographsB = UnicodeRange<0x20000, 0x2A6DF>;
using UnifiedIdeographsC = UnicodeRange<0x2A700, 0x2B73F>;
using UnifiedIdeographsD = UnicodeRange<0x2B740, 0x2B81F>;
using UnifiedIdeographsE = UnicodeRange<0x2B820, 0x2CEAF>;
using CompatibilityIdeographs = UnicodeRange<0xF900, 0xFAFF>;
using CompatibilityIdeographsSupplement = UnicodeRange<0x2F800, 0x2FA1F>;
constexpr bool isChineese(::boost::uint32_t x) noexcept
{
return UnifiedIdeographs{}(x)
|| UnifiedIdeographsA{}(x) || UnifiedIdeographsB{}(x) || UnifiedIdeographsC{}(x)
|| UnifiedIdeographsD{}(x) || UnifiedIdeographsE{}(x)
|| CompatibilityIdeographs{}(x) || CompatibilityIdeographsSupplement{}(x);
}
int main()
{
std::string s;
while (std::getline(std::cin, s))
{
auto start = std::find_if(Iter{s.cbegin()}, Iter{s.cend()}, isChineese);
auto stop = std::find_if_not(start, Iter{s.cend()}, isChineese);
std::cout << std::string{start.base(), stop.base()} << '\n';
}
return 0;
}
https://wandbox.org/permlink/FtxKa8D2LtR3ko9t
Probably you should be able to polish that approach to something fully functional.
I do not know how to properly cover this by tests and not sure which characters should be included in this check.

Related

Does C++ have templating literals like javascript?

I am looking for a templating literals feature like the one that was introduced to ES6 JavaScript. Is there something comparable?
Javascript:
for (let i = 0; i < 10; i++) {
console.log(`Liftoff in ${i} seconds`)
}
I am looking for a clean way to iterate through several directories using a for loop.
If you have C++20 available, you could use std::format(). Here's a usage example from the linked page:
#include <iostream>
#include <format>
int main() {
std::cout << std::format("Hello {}!\n", "world");
}
If you don't have C++20 yet, Boost has a similar feature.

x3 modify parser at runtime

I wonder if it is possible to change the parser at runtime given it does not change the compound attribute.
Lets say I want to be able to modify at runtime the character of my parser that detects whether I have to join a line from ; to ~. Both are just characters and since the c++ types and the template instantiations dont vary (in both cases we are talking about a char) I think there must be some way, but I dont find it. So is this possible?
My concrete situation is that I am calling the X3 parser via C++/CLI and have the need that the character shall be adjustable from .NET. I hope the following example is enough to be able to understand my problem.
http://coliru.stacked-crooked.com/a/1cc2f2836dbfaa46
Kind regards
You cannot change the parser at runtime (except a DSO trick I described under your other question https://stackoverflow.com/a/56135824/3621421), but you can make your parser context-sensitive via semantic actions and/or stateful parsers (like x3::symbols).
The state for semantic actions (or probably for your custom parser) can also be stored in a parser context. However, usually I see that folks use global or function local variables for this purpose.
A simple example:
#include <boost/spirit/home/x3.hpp>
#include <iostream>
namespace x3 = boost::spirit::x3;
int main()
{
char const* s = "sep=,\n1,2,3", * e = s + std::strlen(s);
auto p = "sep=" >> x3::with<struct sep_tag, char>('\0')[
x3::char_[([](auto& ctx) { x3::get<struct sep_tag>(ctx) = _attr(ctx); })] >> x3::eol
>> x3::int_ % x3::char_[([](auto& ctx) { _pass(ctx) = x3::get<struct sep_tag>(ctx) == _attr(ctx); })]
];
if (parse(s, e, p) && s == e)
std::cout << "OK\n";
else
std::cout << "Failed\n";
}

Access pre-compiled functions within a class C++/11

Sorry if the title is misleading, I'm currently looking for solutions to the following:
I'm developing a library, for other people to use. They have to follow a strict design concept and the way they structure any additional features within the library. They all use Linux and (Vim) and as such as are allowed to use terminal commands (i.e to be able to compile etc..) and we all use clang as a compiler.
My question is this: Let's suppose I write a function called: "checkCode":
template<typename T>
void checkCode(T&& codeSnippet)
{
//// code
}
I want to make this function run so whenever they type "checkCode" in a terminal this function is therefore called. I know using clang thy have similar functionality, however, this is understandable as you're using the whole of clang. So:
1) Is it possible to just compile a class, and then access each of the functions through
the .dylab | .so file?
2) Might it be a better idea, or, better to take a copy of the source of clang, add this functionality and role it out to those using and contributing to the library? This would be like an additional add-on to clang?
Thanks
you could use one executable and symbolic links to it like busybox:
int main(int argc, char **argv)
{
string programName = argv[0];
size_t lastSlash = programName.find_last_of('/');
if(lastSlash != string::npos)
programName = programName.substr(lastSlash + 1);
if(programName == "function_1")
{
function_1();
return 0;
}
if(programName == "function_2")
{
function_2();
return 0;
}
// ...
// normal main code
return 0;
}

What's the safest way to define short function name aliases in C++?

Suppose I have a class Utility in a file utility.h:
class Utility {
public:
static double longDescriptiveName(double x) { return x + 42; }
};
And then I find that I use the function longDescriptiveName(...) a LOT. So like an irresponsible C++ programmer that I am when I've had too much coffee, I create a new file utilitymacros.h and add the following there:
#define ldn Utility::longDescriptiveName
Now I include "utilitymacros.h" in any *.cpp where I use ldn(...) and my heart is filled with joy over how much more convinient it is to type 3 letters vs 28.
Question: Is there a safer (more proper) way of doing this than with #define?
I've noticed that I have to include "utilitymacros.h" after including boost headers, which I obviously don't like because it's a sign of clashes (though the Boost errors I get are not very clear as to what the clash is).
Clarification 1: On Code Readability
In case you might say that this negatively affects code readability, I assure you it does not, because it's a small set of functions that are used A LOT. An example that is widely know is stoi for stringToInteger. Another is pdf for probabilityDensityFunction, etc. So if I want to do the following, stoi is more readable in my opinion:
int x = stoi(a) + stoi(b) + stoi(c) + stoi(d);
Than:
int x = Utility::stringToInteger(a) + Utility::stringToInteger(b)
+ Utility::stringToInteger(c) + Utility::stringToInteger(d);
Or:
int x = Utility::stringToInteger(a);
x += Utility::stringToInteger(b);
x += Utility::stringToInteger(c);
x += Utility::stringToInteger(d);
Clarification 2: Editor Macro
I use Emacs as my IDE of choice and a Kinesis keyboard so you KNOW I use a ton of keyboard macros, custom keyboard shortcuts, as well as actually modifying what I see in the editor vs what's actually stored in the h/cpp file. But still, I feel like the simplicity and visual readability (as argued above) of using a function abbreviation in a few select cases really is the result I'm looking for (this is certainly subject to a degree).
Instead of macro, you could write inline function that forwards the call to the actual function:
inline double ldn(double x)
{
return Utility::longDescriptiveName(x);
}
That is certainly safer than macro.
You could use a function reference:
double (&ldn)(double) = Utility::longDescriptiveName;
How about configuring a snippit/macro/similar thing in your text editor? This way you only have to type ldn or something like that and the code doesn't have to run through the preprocessor risking difficult to find bugs later.
I don't know if this helps, but I think part of the problem may be the use of overly general namespaces (or class names, in this case), such as Utility.
If instead of Utility::stringToInteger, we had
namespace utility {
namespace type_conversion {
namespace string {
int to_int(const std::string &s);
}
}
}
Then the function could locally be used like this:
void local_function()
{
using namespace utility::type_conversion::string;
int sum = to_int(a) + to_int(b) + to_int(c) + to_int(d);
}
Analogously, if classes/structs and static functions are used (and there can be good reasons for this), we have something like
strut utility {
struct type_conversion {
struct string {
static int to_int(const std::string &s);
};
};
};
and the local function would look something like this:
void local_function()
{
typedef utility::type_conversion::string str;
int sum = str::to_int(a) + str::to_int(b)
+ str::to_int(c) + str::to_int(d);
}
I realize I am not telling you anything about syntax you didn't know already; it's more a reminder of the fact that the organization and structure of namespaces and classes itself plays an important role in making code more readable (and writable).
One alternative is to rename your function and put it in a namespace instead of a class, since it is static anyway. utility.h becomes
namespace Utility {
// long descriptive comment
inline double ldn(double x) { return x + 42; }
}
Then you can put using namespace Utility; in your client code.
I know there are lots of style guides out there saying short names are a bad thing, but I don't see the point of obeying some style and then circumventing it.
You can use alias template (since C++11).
using shortName = my::complicate::function::name;

C++ string parsing ideas

I have the output of another program that was more intended to be human readable than machine readable, but yet am going to parse it anyway. It's nothing too complex.
Yet, I'm wondering what the best way to do this in C++ is. This is more of a 'general practice' type of question.
I looked into Boost.Spirit, and even got it working a bit. That thing is crazy! If I was designing the language that I was reading, it might be the right tool for the job. But as it is, given its extreme compile-times, the several pages of errors from g++ when I do anything wrong, it's just not what I need. (I don't have much need for run-time performance either.)
Thinking about using C++ operator <<, but that seems worthless. If my file has lines like "John has 5 widgets", and others "Mary works at 459 Ramsy street" how can I even make sure I have a line of the first type in my program, and not the second type? I have to read the whole line and then use things like string::find and string::substr I guess.
And that leaves sscanf. It would handle the above cases beautifully
if( sscanf( str, "%s has %d widgets", chararr, & intvar ) == 2 )
// then I know I matched "foo has bar" type of string,
// and I now have the parameters too
So I'm just wondering if I'm missing something or if C++ really doesn't have much built-in alternative.
sscanf does indeed sound like a pretty good fit for your requirements:
you may do some redundant parsing, but you don't have performance requirements prohibiting that
it localises the requirements on the different input words and allows parsing of non-string values directly into typed variables, making the different input formats easy to understand
A potential problem is that it's error prone, and if you have lots of oft-changing parsing phrases then the testing effort and risk can be worrying. Keeping the spirit of sscanf but using istream for type safety:
#include <iostream>
#include <sstream>
// Str captures a string literal and consumes the same from an istream...
// (for non-literals, better to have `std::string` member to guarantee lifetime)
class Str
{
public:
Str(const char* p) : p_(p) { }
const char* c_str() const { return p_; }
private:
const char* p_;
};
bool operator!=(const Str& lhs, const Str& rhs)
{
return strcmp(lhs.c_str(), rhs.c_str()) != 0;
}
std::istream& operator>>(std::istream& is, const Str& str)
{
std::string s;
if (is >> s)
if (s.c_str() != str)
is.setstate(std::ios_base::failbit);
return is;
}
// sample usage...
int main()
{
std::stringstream is("Mary has 4 cats");
int num_dogs, num_cats;
if (is >> Str("Mary") >> Str("has") >> num_dogs >> Str("dogs"))
{
std::cout << num_dogs << " dogs\n";
}
else if (is.clear(), is.seekg(0), // "reset" the stream...
(is >> Str("Mary") >> Str("has") >> num_cats >> Str("cats")))
{
std::cout << num_cats << " cats\n";
}
}
The GNU tools flex and bison are very powerful tools you could use that are along the lines of Spirit but (according to some people) easier to use, partially because the error reporting is a bit better since the tools have their own compilers. This, or Spirit, or some other parser generator, is the "correct" way to go with this because it affords you the greatest flexibility in your approach.
If you're thinking about using strtok, you might want to instead take a look at stringstream, which splits on whitespace and lets you do some nice formatting conversions between strings, primitives, etc. It can also be plugged into the STL algorithms, and avoids all the messy details of raw C-style string memory management.
I've written extensive parsing code in C++. It works just great for that, but I wrote the code myself and didn't rely on more general code written by someone else. C++ doesn't come with extensive code already written, but it's a great language to write such code in.
I'm not sure what your question is beyond just that you'd like to find code someone has already written that will do what you need. Part of the problem is that you haven't really described what you need, or asked a question for that matter.
If you can make the question more specific, I'd be happy to try and offer a more specific answer.
I've used Boost.Regex (Which I think is also tr1::regex). Easy to use.
there is always strtok() I suppose
Have a look at strtok.
Depending on exactly what you want to parse, you may well want a regular expression library.
See msdn or earlier question.
Personally, again depending the exact format, I'd consider using perl to do an initial conversion into a more machine readable format (E.g. variable record CSV) and then import into C++ much more easily.
If sticking to C++, you need to:
Identify a record - hopefully just a
line
Determine the type of the record - use regex
Parse the record - scanf is fine
A base class on the lines of:
class Handler
{
public:
Handler(const std::string& regexExpr)
: regex_(regexExpr)
{}
bool match(const std::string& s)
{
return std::tr1::regex_match(s,regex_);
}
virtual bool process(const std::string& s) = 0;
private:
std::tr1::basic_regex<char> regex_;
};
Define a derived class for each record type, stick an instance of each in a set and search for matches.
class WidgetOwner : public Handler
{
public:
WidgetOwner()
: Handler(".* has .* widgets")
{}
virtual bool process(const std::string& s)
{
char name[32];
int widgets= 0;
int fieldsRead = sscanf( s.c_str(), "%32s has %d widgets", name, & widgets) ;
if (fieldsRead == 2)
{
std::cout << "Found widgets in " << s << std::endl;
}
return fieldsRead == 2;
}
};
struct Pred
{
Pred(const std::string& record)
: record_(record)
{}
bool operator()(Handler* handler)
{
return handler->match(record_);
}
std::string record_;
};
std::set<Handler*> handlers_;
handlers_.insert(new WidgetOwner);
handlers_.insert(new WorkLocation);
Pred pred(line);
std::set<Handler*>::iterator handlerIt =
std::find_if(handlers_.begin(), handlers_.end(), pred);
if (handlerIt != handlers_.end())
(*handlerIt)->process(line);