ICU: simple case mapping for whole strings - c++

I would like to find a substring in UTF-8 string, case-insensitive. from what I read, the usual way it's done is case folding the strings in order to bring them to canonical form.
However, since case folding can change the length of the strings (and I don't want to change the length of the strings because I need to know what is the exact offset of the substring match in the original string), it seems I should be using Simple case mapping. although the case-insensitive comparison won't be accurate, it will be best effort.
However, I cannot find in ICU API functions that operate on strings with Simple case mapping. I can find Simple case mapping only for single char functions (u_foldCase() in uchar.h). Is there an option to use Simple case folding for whole strings?

Related

Java Regex to find if a given String contains a set of characters in the same order of their occurrence.

We need Java Regex to find if a given String contains a set of characters in the same order of their occurrence.
E.g. if the given String is "TYPEWRITER",
the following strings should return a match:
"YERT", "TWRR" & "PEWRR" (character by character match in the order of occurrence),
but not
"YERW" or "YERX" (this contains characters either not present in the given string or doesn't match the order of occurrence).
This can be done by character by character matching in a for loop, but it will be more time consuming. A regex for this or any pointers will be highly appreciated.
First of all REGEX has nothing to do with it. Regex is powerful but not that much powerful to accomplish this.
The thing you are asking is a part of Longest Common Subsequence(LCS) Algorithm implementation. For your case you need to change the algorithm a bit. I mean instead of matching part of string from both, you'll require to match your one string as a whole subsequence from the Larger one.
The LCS is a dynamic algorithm and so far this is the fastest way to achieve this. If you take a look at the LCS Example here you'll find that what I am talking about.

Finding match between optional tokens?

For the strings:
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
I have the current regex:
m/(handle:|chat_identifier:)(.+?)(:{2}|&)/
And I am currently using $2 in order to obtain the value I wish (in the first string e#ma.il and in the second, chat0123456789).
Is there a better/faster/simpler way to solve this problem, though?
Whether it's "better" or not depends on the context, but you could take this approach: split the string on ":" and take the fourth element of the resulting list. That's arguably more readable than the regex and more robust if the third field can be something other than "handle" or "chat_identifier".
I think the speed would be very similar for either approach but probably for almost any implementation in perl. I'd want to show that speed was critical for this step before worrying about it...
For a regex solution, this one is slightly simpler and doesn't need to backtrack:
m/(handle|chat_identifier):([^:&]+)/
Note the slight difference: yours allows single colons within the value, mine doesn't (it stops at the first colon encountered). If that is not a problem, you can use my variant. Or as I mentioned in a comment, split at : and use the fourth element in the result.
An equivalent version that does only stop at double colons is this:
m/(handle|chat_identifier):((?:(?!::|&).)+)/
Not so beautiful, but it still avoids backtracking (the lookahead might make it slower, though... you will need to profile that, if speed matters at all).
Looks like you have allot of good solutions already here. The split method seems like the simplest. But depending on your requirements you could also use a more generic regex that breaks the string in its basic pieces. It will work for other datatypes and property names than in your examples.
([^:]+)::([^:]+):([^:&]+)(?:::|&)\1
The captures groups are as follows:
Group 1: the datatype. (the keyword "text" from your examples.)
Group 2: The property name. (The keywords "handle" and "chat_identifier"
from your examples.)
Group 3: The property value.
If the values you want are always in the same position and it's safe to split on : and &, then perhaps the following will work for you:
use Modern::Perl;
say +( split /[:&]+/ )[2] for <DATA>;
__DATA__
text::handle:e#ma.il::text
text::chat_identifier:chat0123456789&text
Output:
e#ma.il
chat0123456789

Parse std::string for a selection of characters

Is there an easy way to parse a std::string in search of a list of certain charcters? For example, let's say the user enters this<\is a.>te!st string. I'd like to be able to spot those non-letter characters are there and do something about it. I'm looking for a general purpose solution that allows me to simply specify a list of chars so I can reuse the function in different situations. I'm guessing regular expressions will play a key role in any solution, and obviously the more compact and effience, the better.
You could use std::string::find_first_not_of() for this. It'll find the characters except those in the set that you give it. Its counterpart, find_first_of(), will search for characters that are in the set.
Both functions allow you to specify the starting index. This will enable you you to continue the search from where you left off.
How about using a regex library like boost::regex?
This should exactly do what you are looking for.
If your compiler supports C++11 you can use std::regex.
Regex seems like overkill. You can use std::string's methods: find_first_of() and/or find_last_of(). Here you can find documentation and examples.

Most Efficient way to 'look up' Keywords

Alright so I am writing a function as part of a lexical analyzer that 'looks up' or searches for a match with a keyword. My lexer catches all the obvious tokens such as single and multi character operators (+ - * / > < = == etc) (also comments and whitespace are already taken out) so I call a function after I've collected a stream of only alphanumeric characters (including underscores) into a string, this string then needs to be matched as either a known keyword or an identifier.
So I was wondering how I might go about identifying it? I know I basically need to compare it to some list or array or something of all the built in keywords, and if it matches one return that match to it's corresponding enum value; otherwise, if there is no match, then it must be a function or variable identifier. So how should I look for matches? I read somewhere that something called a Binary Search Tree is an efficient way to do it or by using Hash Tables, problem is I've never used either so I am not sure if it's the right way. Could I possibly use a MySQL database?
If your set of keywords is fixed, a perfect hash can be built for O(1) lookup. Check out gperf or cmph.
A "trie" will surely be the most efficient way.
Whatever implementation of std::map you have will probably be sufficient.
This is for a language, with a specific set of keywords that never change, and there aren't very many of them?
If so, it probably doesn't matter what you use. You will have bigger fish to fry.
However, since the list doesn't change, it would be hard to beat a hard coded search like this:
// search on first letter
switch(s[0]){
case 'a':
// search on 2nd letter, etc.
break;
case 'b':
// search on 2nd letter, etc.
break;
........
case '_':
// search on 2nd letter, etc.
break;
}
For singe character keywords a lookup table would be perfect. For multicharacter (especially if the lengths differs): a hash table. If you need performance, you could even use source code generation to create the hash tables (using a simple hash function that is able or not to ignore case, depending on your syntax).
So I'd implement it with a LUT and a hash table: first you check the first character with the LUT (if it's a simple operator, it would start with a non-alpha-numeric value), and, if not found, check the hash table.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.