BigQuery REGEXP_MATCH and accents : boundary wildcard fails? - regex

In GAS I can correctly match accents with regular expression having boundary characters, such as \bà\b. The character à is matched only when it is a separate word. This works in GAS:
function test_regExp() {
var str = "la séance est à Paris";
var RegExp = "\\bà\\b";
var PatReg= new RegExp( RegExp);
var found=PatReg.exec(str);
if (found) {
Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
} else Logger.log("oops! Did not match");
In BigQuery, if boundary characters are next to accents the patterns do not match. \bséance\b matches séance:
SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bséance\\b") ) LIMIT 100;
\bà\b does not match à as a word:
SELECT [row],etext,ftext FROM [hcd.hdctextx] WHERE (REGEXP_MATCH(ftext,"\\bà\\b") ) LIMIT 100;
I'm assuming that BigQuery, unlike GAS, is including accents in the boundary character set. So \bséance\b works because é can function properly as a boundary in that configuration. \bà\b or \bétranger\b or \bmarché\b do not work because accent + \b is interpreted as \b\b, which never matches anything. (Ok, I'm grasping at straws here, because I can't find a better explanation....besides a bug.)
I don't think it is a unicode problem, because it only crops up at boundary positions.
For the moment therefore, no way to use boundary in those particular configurations of accents.
Is there a way to set the Locale in BigQuery or other fix?
Workaround: substitute (?:[^a-zA-Zéàïëâê]) and so on for \b.
Thanks!

BigQuery's behavior is correct with respect to the RE2 syntax documentation. (No surprise, because BigQuery uses RE2 to implement regexps.)
RE2's character classes are:
\b = at word boundary (\w on one side and \W, \A, or \z on the other)
\w = word characters (≡ [0-9A-Za-z_])
\W = not word characters (≡ [^0-9A-Za-z_])
\A = beginning of text
\z = end of text
In other words, you can only use \b to match boundaries of non-accented characters. RE2 has plenty of support for Unicode characters, though, so you can most likely craft an alternative regexp using something like \pL.
I'm not sure why Google Apps Script doesn't follow the RE2 spec here, but I'll follow up with that team to figure out what's going on.

Check this out:
SElect Regexp_extract(StringToParse,r'\b?(à)\b?') as Extract,
Regexp_match(StringToParse,r'\b?(à)\b?') as match,
FROM
(SELECT 'la séance est à Paris' as StringToParse)
Hope this helps

The answer is: in BQ don't use \b with accents; rewrite the regular expresssion:
frenRegExp = frenRegExp.replace(/\\b/g, "(?:[- .,;!?()]|$|^)");
frenRegExp = frenRegExp.replace(/\\w/g, "(?:[A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])");
frenRegExp = frenRegExp.replace(/\\W/g, "(?:[^A-Za-zÀàÂâÄäÆæÇçÈèÉéÊêËëÎîÏïÔôÙùÛûÜüñ])");
Also, though the GAS specification has RE2 as its re engine (oops! I really don't know what it uses, since it does not exclude accented characters from \w like BQ), it is only partially implemented. For example \pL does not match a letter.
Here is some test code that works in apps scripts, but not in BQ without a substitution.
////////////////////// TEST ///////////////////
function test_regExp() {
var str = " Voilà la séance générale qui est à Paris";
var RegExpString ="\\bs\\w+an\\w*"
Logger.log(RegExpString);
var RegExpCompiled= new RegExp( RegExpString,"i");
Logger.log(RegExpCompiled.source);
var found=RegExpCompiled.exec(str);
if (found) {
Logger.log("|"+found[0]+"|")
Logger.log( [str.substring(0,found.index),found[0],str.substring(found[0].length+found.index)] );
} else Logger.log("Oops: not found");
}
Output:
[16-02-09 22:15:59:659 PST] \bs\w+anc\w*
[16-02-09 22:15:59:660 PST] \bs\w+an\w*
[16-02-09 22:15:59:660 PST] |séance|
[16-02-09 22:15:59:661 PST] [ Voilà la , séance, générale qui est à Paris]

Related

Remove spaces (apostrophes) around quotes with regex in ruby

I'm trying to remove all spaces around quotes with one Ruby regex. (not the same question as this)
Input: l' avant ou l 'après ou encore ' maintenant'
Output: l'avant ou l'après ou encore 'maintenant'
What I tried:
(/'\s|\s'/, '')
It's matching a few cases, but not all.
How to perform this ? Thanks.
TLDR:
I assume the spaces were inserted by some automation software and there can only be single spaces around the words.
s = "l' avant ou l 'apres ou encore ' maintenant' ou bien 'ceci ' et ' encore de l ' huile ' d 'accord d' accord d ' accord Je n' en ai pas .... s ' entendre Je m'appelle Victor"
first_rx = /(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i
# If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc],
# i.e. first letters of word that are usually contracted
second_rx = /\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/
puts s.gsub(first_rx, "'")
.gsub(second_rx) { $~[1] ? "'#{$~[1]}'" : "" }
Output:
l'avant ou l'apres ou encore 'maintenant' ou bien 'ceci' et 'encore de l'huile' d'accord d'accord d'accord Je n'en ai pas .... s'entendre Je m'appelle Victor
Explanation
The problem is really complex. There are several words that can be abbreviated and used with an apostrophe in French, de, le/la, ne, se, me, te, ce to name a few, but these are all consonants. You may remove all spaces between a single, standalone consonant, apostrophe and the next word using
s.gsub(/(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i, "'")
If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc], i.e. first letters of word that are usually contracted. See the regex demo.
Next step is to remove spaces after initial and before trailing apostrophes. This is tricky:
s.gsub(/\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/) { $~[1] ? "'#{$~[1]}'" : "" }
where \b'\b is meant to match all apsotrophes in between word chars, those that we fixed at the previous step. See this regex demo. As there is no (*SKIP)(*F) support in Onigmo regex, the regex is a bit simplified but the replacement is a conditional one: if Group 1 matched, replace with ' + Group 1 value ($1) + ', else, replace with an empty string (since \K reset the match, dropped all text from the match memory buffer).
NOTE: this approach can be extended to handle some specific cases like aujourd'hui, too.
To remove all whitespace around the ', use gsub!, applied in several steps for proper whitespace removal:
str = "l' avant ou l 'apres ou encore ' maintenant'"
str.gsub!(/\b'\s+\b/, "'").gsub!(/\b\s+'\b/, "'").gsub!(/\b(\s+')\s+\b/, '\1')
puts str
# l'avant ou l'apres ou encore 'maintenant'
Here,
\b : word boundary,
\s+ : 1 or more whitespace,
string.gsub!(regex, replacement_string) : replace in the string argument regex with specified replacement_string (during this, the original string is changed),
\1 : in the replacement string, this refers to the first group captured in parenthesis in the regex: (...).
So if you have alot of data like this, all the answers I have seen are wrong, and will not work. No regex can guess wether the preceding word should have a space or not. Unless you came up with a list of words (or patterns) that either do or don't.
The problem is, sometimes a space should be left, sometimes not. The only way to script that is to find a pattern which describes when the space should be there, or when not. You must teach your regex French grammar. It may be possible lol. But probably not, or difficult.
If this is a one off, my advice is to create regexes for 2 or 3 different situations, and use something like vim, to go through the data, and select manually yes or no to substitute each occurrence.
There may be some cases you can run - eg remove all spaces to the right of quotes? - but unfortunately I don't think you can automate this process.
I believe the following should work for you
s.gsub(/'.*?'/){ |e| "'#{e[1...-1].strip}'" }
The regex portion lazy matches all text within single quotes (including quotes). Then, for each match you substitute for the quoted text with leading and trailing whitespace removed, and return this text in quotes.

Split sentence to words containing apostrophe

Supposing I have a group of words as a sentence like this :
Aujourd'hui séparer l'élément en deux
And want the result to be as an individual words (after the split) :
Aujourd'hui | séparer | l' | élément | en | deux
Note : as you can see, « aujourd'hui » is a single word.
What would be the best regex to use here ?
With my current knowledge, all i can achieve is this basic operation :
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" ");
Output :
Aujourd'hui / Séparer / l'élément / en / deux
Here are the two questions closest to mine : this and this.
Since the contractions that you want to consider as separate words are usually a single letter + an apostrophe in French (like l'huile, n'en, d'accord) you can use a pattern that either matches 1+ whitespace chars, or a location that is immediately preceded with a start of a word, then 1 letter and then an apostrophe.
I also suggest taking into account curly apostrophes. So, use
\s+|(?<=\b\p{L}['’])\b
See the regex demo.
Details
\s+ - 1+ whitespaces
| - or
(?<=\b\p{L}['’])\b - a word boundary (\b) location that is preceded with a start of word (\b), a letter (\p{L}) and a ' or ’.
In Qt, you may use
QStringList result = text.split(
QRegularExpression(R"(\s+|(?<=\b\p{L}['’])\b)",
QRegularExpression::PatternOption::UseUnicodePropertiesOption)
);
The R"(...)" is a raw string literal notation, you may use "\\s+|(?<=\\b\\p{L}['’])\\b" if you are using a C++ environment that does not allow raw string literals.
Not sure if I understood what you are saying but this might help you
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" '");
I don't know C++ but I guees it supports negative lookbehind.
Have a try with:
(?: |(?<!\w{2})')
This will split on space or apostroph if there are not 2 letters before.
Demo & explanation
Well, you're dealing with a natural language, here, and the first - and toughest - problem to answer is: Can you actually come up with a fixed rule, when splits should happen? In this particular case, there is really no logical reason, why French considers "aujourd'hui" as a single word (when logically, it could be parsed as "au jour de hui").
I'm not familiar with all the possible pitfalls in French, but if you really want to make sure to cover all obscure cases, you'll have to look for a natural language tokenizer.
Anyway, for the example you give, it may be good enough to use a QRegularExpression with negative lookbehind to omit splits when more than one letter precedes the apostrophe:
sentence.split(QRegularExpression("(?<![\\w][\\w])'"));

How can I use RegEx to remove certain words in from string

I need to clean some cells and only keep important words to generate a search index.
Eg. "How to make an account recovery request" would be trimmed to "Make Account Recovery Request" because "How, To, An" would be in a list of words to be filtered out.
The other complexity is that it will also be in French and Spanish, which means that I have to deal with part-word like d'.
So far I've been trying to use a function but it doesn't work with part-word (d') and if "de" and "des" are listed in the same cell, it will remove DE from DES and then only keep the lonely S because DES is not recognized anymore:
Function ClearWords(s As String, rWords As Range) As String
Static RX As Object
If RX Is Nothing Then
Set RX = CreateObject("VBScript.RegExp")
RX.Global = True
RX.IgnoreCase = True
End If
RX.Pattern = "\b" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & "\b"
ClearWords = Application.Trim(RX.Replace(s, ""))
End Function
If you plan to support English, French, and other European languages you may leverage the regex I posted at Regular expression not working for at least one European character
, (?![×÷])[A-Za-zÀ-ÿ]. This is a pattern that is supposed to match all the alphabetic chars you need to support. Since you are going to use it in VBA, it makes sense to replace literal extended letters with \uXXXX entities, and convert it to a single character class, [A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF] ([A-Za-zÀ-ÖØ-öø-ÿ] with literal chars).
Now, you need to build the custom boundaries. The initial boundary is either start of the string, ^, or any char other than the letters above (and possibly digits, and _, if you want to fully emulate \b). Since you want to replace, you need to put these two patterns into a (^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]) capturing group and use $1 in the replacement pattern to restore the value in order not to lose it. The trailing boundary is any char other than the letters above (or digits / _) and end of the string. Since VBA regex supports lookaheads, we may just use a negative lookahead, (?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]).
Putting it together:
RX.Pattern = "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" & Replace(Join(Application.Transpose(rWords), "|"), ".", "\.") & ")(?![A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])"
ClearWords = Application.Trim(RX.Replace(s, "$1"))
See this regex demo.
To also remove spaces before, replace "(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF])(?:" with (?:\s+|(^|[^A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF]))(?:. See this regex demo.
Bonus: you seem to need to escape the words to use them in a regex:
Dim regExEscape As New RegExp
With regExEscape
.pattern = "[-/\\^$*+?.()|[\]{}]"
.Global = True
.MultiLine = False
End With
Just make sure you process all words you have instead of Replace(Join(Application.Transpose(rWords), "|"), ".", "\.").

Building a Regex String - Any assistance provided

Im very new to REGEX, I understand its purpose, but Im struggling to yet fully comprehend how to use it. Im trying to build a REGEX string to pull the A8OP2B out from the following (or whatever gets dumped in that 5th group).
{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}}
The other items in above line, will change in character length, so I cannot say the 51st to the 56th character. It will always be the 5th group in quotation marks though that I want to pull out.
Ive tried building various regex strings up, but its still mostly a foreign language to me and I still have much reading to do on it.
Could anyone provide me a working example with the above, so I can reverse engineer and understand better?
Thanks
Demo 1: Reference the JSON to a var, then use either dot or bracket notation.
Demo 2: Using RegEx is not recommended, but here's one in JavaScript:
/\b(\w{6})(?=","RfKey":)/g
First Match
non-consuming match: :"A
meta border: \b: A non-word=:, any char=", and a word=A
consuming match: A8OP2B
begin capture: (, Any word =\w, 6 times={6}
end capture: )
non-consuming match: ","RfKey":
Look ahead: (?= for: ","RfKey": )
Demo 1
var obj = {"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}};
var dataDot = obj.RfReceived.Data;
var dataBracket = obj['RfReceived']['Data'];
console.log(dataDot);
console.log(dataBracket)
Demo 2
Note: This is consuming a string of 3 consecutive patterns. 3 matches are expected.
var rgx = /\b(\w{6})(?=","RfKey":)/g;
var str = `{"RfReceived":{"Sync":9480,"Low":310,"High":950,"Data":"A8OP2B","RfKey":"None"}},{"RfReceived":{"Sync":8080,"Low":102,"High":1200,"Data":"PFN07U","RfKey":"None"}},{"RfReceived":{"Sync":7580,"Low":471,"High":360,"Data":"XU89OM","RfKey":"None"}}`;
var res = str.match(rgx);
console.log(res);

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.