Regex collating symbols

Regex collating symbols - regex

I tried to understand how 'collating symbols' match works but I did not come out this. I understood that it means matching an exact sequence instead of just the character(s), that is:
echo "ciiiao" | grep '[oa]' --> output 'ciiiao'
echo "ciiiao" | grep '[[.oa.]]' --> no output
echo "ciiiao" | grep '[[.ia.]]' --> output 'ciiiao'
However, the third command does not work. Am I wrong or I misinterpret something?
I have read this regexp tutorial.

Collating symbols are typically used when a digraph is treated like a single character in a language. They are an element of the POSIX regular expression specification, and are not widely supported.
For example, the Welsh alphabet has a number of digraphs that are treated as a single letter (marked with a * below)
a b c ch d dd e f ff g ng h i j l ll m n o p ph r rh s t th u w y
* * * * * * *
Assuming the locale file defines it (a collating symbol will only work if it is defined in the current locale), the collating symbol [[.ng.]] is treated like a single character. Likewise, a single character expression like . or [^a] will also match "ff" or "th." This also affects sorting, so that [p-t] will include the digraphs "ph" and "rh" in addition to the expected single letters.

Related

Matching a string containing special characters against a string without special characters

I'm trying to match strings using a Google Sheets query where both strings are uncertain if they contain special characters. I'll try to explain as best as I can:
I have a table of data
+-----+------+-----+-----+
| A | B | C | D |
|x | y |ø |z |
|xx | yy |á |zz |
|xxx | yyy |e |zzz |
+-----+------+-----+-----+
My query function would look something like this:
=QUERY(A1:D3, "SELECT * WHERE (C = 'ø') OR (C = 'è') OR (C = 'a')")
Currently, using this query will only return 1 row because, (C = 'ø') is an exact match with 'ø', however none of the others have a match.
For (C = 'è'), we can just replace all of the accented characters in the string with their un-accented equivalent.
In this case 'è' becomes 'e' and has a match - now the query will return a second row.
(I found a nice way to replace all accented characters in a string here.)
Finally, here is where my main problem sits: (C = 'a'). I can't figure out a way to make it match 'á', unless I check every accented variant of 'a', but that just seems silly.
It's not possible to do something like "... WHERE (CUSTOM_FUNCTION(C) = 'a')" either, sadly.
As I previously mentioned, either side of the match may or may not contain accented/special characters.
I should also mention that it wont be just a single character, it will be a whole string.
If anyone has any possible solutions to this, it would be greatly appreciated.

Instead of the QUERY formula, you could use a FILTER formula.
=FILTER(A2:D22,REGEXMATCH(C2:C22,"ø|è|á")=TRUE)
(Please adjust ranges to your needs. You can also add/remove more special characters.)
Functions used:
FILTER
REGEXMATCH

boost char_separator with digits

Can we use boost/tokenizer or boost::char_separator to separate with digits as well?
Let us say we have a line as :
1 *1:0 *2:0 0.01
We can break above line with delimiter, multiple delimiters with
boost::char_separator<char> space_star_sep{" ", ":"};
This will give me tokens as:
1
*1
:
0
*2
:
0
0.01
If I use single delimiter as
boost::char_separator<char> space_star_sep{" "};
I will get:
1
*1:0
*2:0
0.01
Is there any way to break up the string by digits along with delimiters directly, instead of getting a token and parsing. Say, If I want tokens as:
1
*1
*2
0.01
I tried giving generic things in char_seperator such as \d et all but they are an unknown sequence for char separator.

If your question is, can you tokenize the string by passing the delimiter string ":0" into char_seperator or a similar function (let's say strtok)?
No.
By their intent, those functions only work by using a single character as the delimiter or tokenizer.

You're trying to do two things here.
Tokenize on whitespace
Strip a trailing :0 (or tokenize each token on : and get the first token)
Those are two operations, and will need to be performed as such.
Your char_seperator examples perform one tokenization, just with multiple candidate delimiters.
It sounds like your attempt is moving away from tokenisation and towards pattern matching, trying to extract subsequences of digits from the input string. That's fine (and may be a use case for regular expressions), though it doesn't match the sample output you've provided, since neither * nor . is a digit.
I would probably stick with the two-phase tokenisation, myself, though a regular expression for your use case may look a bit like this:
Pattern: /(\*)?(\d+(?:\.\d+)?)(?::0)?(?:\s+|$)/g
Input: "1 *1:0 *2:0 0.01"
| Captures:
+-----+-------
Match: | A | B
-------+-----+-------
#1 | | 1
#2 | * | 1
#3 | * | 2
#4 | | 0.01
(live demo)
(Disclaimer: we don't know enough about the input syntax and about your expectations to guarantee that this is accurate.)
I deliberately kept the '*' character in its own capture so that you can handle the numeric portion on its own without any further extraction from strings; that is, you could pass capture B to std::stod directly and use capture A == "*" as a boolean flag.

regular expression to represent two of the same vowel in a row

I try to write regular expression to represent two of the same vowel in a row.
I know this code grep a, but how about e,i,o,u
(a[aeiou]{2})
Should I'write the codes as like that to grep tow of the same vowel?
(a[aeiou]{2}|i[aeiou]{2}|i[aeiou]{2}|o[aeiou]{2}|u[aeiou]{2})

You can simply use a group reference :
([aeiou])\1
See demo https://regex101.com/r/dI9kB9/1

Why not just do:
aa|ee|ii|oo|uu
The bar ( | ) is used for "or".
So this reads as:
aa OR ee OR ii OR oo OR uu
It is also known as "alternation".
See: http://www.regular-expressions.info/alternation.html
It has an example where you can search for dog|cat|mouse|fish, which I would read as "dog OR cat OR mouse OR fish".

Multiple character lookup within square brackets with regex

I’m using regex in JavaScript for certain text replacements to convert legacy encoded text to unicode (it’s an indic language). Suppose I anywhere I find either of a,b,c followed by either of x,y,z followed by e I have to replace it so that e comes first. So I have code like this:
modified_substring = modified_substring.replace( /([abc])([xyz]*)e/g , "e$1$2" ) ;
Now let us say I want to modify this rule as a or b or c or klm followed by either of x,y,z followed by e. So what would the code be?
modified_substring = modified_substring.replace( /([abc]klm)([xyz]*)e/g , "e$1$2" ) ;
That apparently doesn’t work. Is there a way to do this?

You need to use alternation operator |.
modified_substring = modified_substring.replace( /([abc]|klm)([xyz]*)e/g , "e$1$2" ) ;
^

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

I need a regexp to match something like this,
'text' | 'text' | ... | 'text'(~text) = 'text' | 'text' | ... | 'text'
I just want to divide it up into two sections, the part on the left of the equals sign and the part on the right. Any of the 'text' entries can have "=" between the ' characters though. I was thinking of trying to match an even number of 's followed by a =, but I'm not sure how to match an even number of something.. Also note I don't know how many entries on either side there could be. A couple examples,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33) = '51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
Should extract,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'QFN'(~51NL9637X33)
and,
'51NL9637X33' | 'ISL6262ACRZ-T' | 'INTERSIL' | 'QFN7SQ-HT1_P49' | '()'
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637) = '227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Should extract,
'227637' | 'SMTU2032_1' | 'SKT W/BAT'(~227637)
and,
'227637' | 'SMTU2032_1' | 'RENATA' | 'SKT28_5X16_1-HT5_4_P2' | '()' :SPECIAL_A ='BAT_CR2032', PART_NUM_A='202649'
Also note the little tilda bit at the end of the first section is optional, so I can't just look for that.

Actually I wouldn't use a regex for that at all. Assuming your language has a split operation, I'd first split on the | character to get a list of:
'51NL9637X33'
'ISL6262ACRZ-T'
'QFN'(~51NL9637X33) = '51NL9637X33'
'ISL6262ACRZ-T'
'INTERSIL'
'QFN7SQ-HT1_P49'
'()'
Then I'd split each of them on the = character to get the key and (optional) value:
'51NL9637X33' <null>
'ISL6262ACRZ-T' <null>
'QFN'(~51NL9637X33) '51NL9637X33'
'ISL6262ACRZ-T' <null>
'INTERSIL' <null>
'QFN7SQ-HT1_P49' <null>
'()' <null>
You haven't specified why you think a regex is the right tool for the job but most modern languages also have a split capability and regexes aren't necessarily the answer to every requirement.

I agree with paxdiablo in that regular expressions might not be the most suitable tool for this task, depending on the language you are working with.
The question "How do I match an even number of characters?" is interesting nonetheless, and here is how I'd do it in your case:
(?:'[^']*'|[^=])*(?==)
This expression matches the left part of your entry by looking for a ' at its current position. If it finds one, it runs forward to the next ' and thereby only matching an even number of quotes. If it does not find a ' it matches anything that is not an equal sign and then assures that an equal sign follows the matched string. It works because the regex engine evaluates OR constructs from left to right.
You could get the left and right parts in two capturing groups by using
((?:'[^']*'|[^=])*)=(.*)
I recommend http://gskinner.com/RegExr/ for tinkering with regular expressions. =)

As paxdiablo said, you almost certainly don't want to use a regex here. The split suggestion isn't bad; I myself would probably use a parser here—there's a lot of structure to exploit. The idea here is that you formally specify the syntax of what you have—sort of like what you gave us, only rigorous. So, for instance: a field is a sequence of non-single-quote characters surrounded by single quotes; a fields is any number of fields separated by white space, a |, and more white space; a tilde is non-right-parenthesis characters surrounded by (~ and ); and an expr is a fields, optional whitespace, an optional tilde, a =, optional whitespace, and another fields. How you express this depends on the language you are using. In Haskell, for instance, using the Parsec library, you write each of those parsers as follows:
import Text.ParserCombinators.Parsec
field :: Parser String
field = between (char '\'') (char '\'') $ many (noneOf "'\n")
tilde :: Parser String
tilde = between (string "(~") (char ')') $ many (noneOf ")\n")
fields :: Parser [String]
fields = field `sepBy` (try $ spaces >> char '|' >> spaces)
expr :: Parser ([String],Maybe String,[String])
expr = do left <- fields
spaces
opt <- optionMaybe tilde
spaces >> char '=' >> spaces
right <- fields
(char '\n' >> return ()) <|> eof
return (left, opt, right)
Understanding precisely how this code works isn't really important; the basic idea is to break down what you're parsing, express it in formal rules, and build it back up out of the smaller components. And for something like this, it'll be much cleaner than a regex.
If you really want a regex, here you go (barely tested):
^\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?(\(~[^)\n]*\))?\s*=\s*('[^']*'((\s*\|\s*)'[^'\n]*')*)?\s*$
See why I recommend a parser? When I first wrote this, I got at least two things wrong which I picked up (one per test), and there's probably something else. And I didn't insert capturing groups where you wanted them because I wasn't sure where they'd go. Now yes, I could have made this more readable by inserting comments, etc. And after all, regexen have their uses! However, the point is: this is not one of them. Stick with something better.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex collating symbols - regex

Related

Matching a string containing special characters against a string without special characters

boost char_separator with digits

regular expression to represent two of the same vowel in a row

Multiple character lookup within square brackets with regex

How can I use a regular expression to match something in the form 'stuff=foo' 'stuff' = 'stuff' 'more stuff'

Categories

Resources