Split sentence to words containing apostrophe - c++

Supposing I have a group of words as a sentence like this :
Aujourd'hui séparer l'élément en deux
And want the result to be as an individual words (after the split) :
Aujourd'hui | séparer | l' | élément | en | deux
Note : as you can see, « aujourd'hui » is a single word.
What would be the best regex to use here ?
With my current knowledge, all i can achieve is this basic operation :
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" ");
Output :
Aujourd'hui / Séparer / l'élément / en / deux
Here are the two questions closest to mine : this and this.

Since the contractions that you want to consider as separate words are usually a single letter + an apostrophe in French (like l'huile, n'en, d'accord) you can use a pattern that either matches 1+ whitespace chars, or a location that is immediately preceded with a start of a word, then 1 letter and then an apostrophe.
I also suggest taking into account curly apostrophes. So, use
\s+|(?<=\b\p{L}['’])\b
See the regex demo.
Details
\s+ - 1+ whitespaces
| - or
(?<=\b\p{L}['’])\b - a word boundary (\b) location that is preceded with a start of word (\b), a letter (\p{L}) and a ' or ’.
In Qt, you may use
QStringList result = text.split(
QRegularExpression(R"(\s+|(?<=\b\p{L}['’])\b)",
QRegularExpression::PatternOption::UseUnicodePropertiesOption)
);
The R"(...)" is a raw string literal notation, you may use "\\s+|(?<=\\b\\p{L}['’])\\b" if you are using a C++ environment that does not allow raw string literals.

Not sure if I understood what you are saying but this might help you
QString sentence("Aujourd'hui séparer l'élément en deux");
QStringList list = sentence.split(" '");

I don't know C++ but I guees it supports negative lookbehind.
Have a try with:
(?: |(?<!\w{2})')
This will split on space or apostroph if there are not 2 letters before.
Demo & explanation

Well, you're dealing with a natural language, here, and the first - and toughest - problem to answer is: Can you actually come up with a fixed rule, when splits should happen? In this particular case, there is really no logical reason, why French considers "aujourd'hui" as a single word (when logically, it could be parsed as "au jour de hui").
I'm not familiar with all the possible pitfalls in French, but if you really want to make sure to cover all obscure cases, you'll have to look for a natural language tokenizer.
Anyway, for the example you give, it may be good enough to use a QRegularExpression with negative lookbehind to omit splits when more than one letter precedes the apostrophe:
sentence.split(QRegularExpression("(?<![\\w][\\w])'"));

Related

Vi: Substitution pattern SQL file. Issue with the Regex

I have to modify a SQL file with vi to delete columns that we do not use. As we have a lot of data, I use the search and replace option with a Regex Pattern.
For instance we have :
(1,2956,2026442,4,NULL,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',NULL,1,'27229',NULL,NULL,NULL,NULL,NULL,' Rue DU LUXEMBOURG, 9999 EVREUX',NULL,NULL,NULL,NULL,
NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'2020-07-08 16:34:40',NULL,NULL)
So we have 40 columns and I keep 13 ones. My regex is :
(1),2,(3),4-5,(6-14),15-22,(23),24-39,(40)
:%s/(\(.\{-}\),.\{-},\(.\{-}\),.\{-},.\{-},\(.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-}\),.\{-},
.\{-}, .\{-},.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\),.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},.\{-},
.\{-},.\{-},.\{-},.\{-},.\{-},\(.\{-}\))/(\1,\2,\3,\4,\5)/g
I enclose in my parenthesis the parts that interest me by putting them in parenthesis (I only get the values in parenthesis on the line above my regex ). Then with the replace I recover these groups.
So normally my result is suppose to be :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,
'9999','EVREUX',' Rue DU LUXEMBOURG, 9999 EVREUX',NULL)
But Because in ' Rue DU LUXEMBOURG, 9999 EVREUX' there is a comma (,). My result become :
(1,2026442,NULL,'ZAC DU BOIS DES COMMUNES','',NULL,NULL,'Rue DU LUXEMBOURG',NULL,'9999','EVREUX',' Rue DU LUXEMBOURG',NULL,NULL)
Does Someone who is good in Regex can help me ? thanks in advance. If I wasn't clear tell me too, i will try to explain better next time.
I suggest matching fields that can be strings with a %('[^']*'|\w*) pattern, that is, a non-capturing group that finds either ' + zero or more non-'s and then a ' char, or any zero or more alphanumeric characters.
Also, the use of non-capturing groups (in Vim, it is %(...) in very magic mode, or \%(...\) in a regular mode) and very magic mode can help shorten the pattern.
The whole pattern will look like
:%s/\v\(([^,]*),[^,]*,([^,]*),[^,]*,[^,]*,(%('[^']*'|\w*)%(,%('[^']*'|\w*)){8})%(,%('[^']*'|\w*)){8},('[^']*'|\w*)%(,%('[^']*'|\w*)){16},([^,]*)\)/(\1,\2,\3,\4,\5)/g
See the regex demo converted to a PCRE regex.
Note some fields that are not strings are matched with [^,]* that matches zero or more chars other than a comma. The %(,%('[^']*'|\w*)){8} like patterns match (here) 8 occurrences of a sequence of a , char + '...' substring or zero or more word chars.

Remove spaces (apostrophes) around quotes with regex in ruby

I'm trying to remove all spaces around quotes with one Ruby regex. (not the same question as this)
Input: l' avant ou l 'après ou encore ' maintenant'
Output: l'avant ou l'après ou encore 'maintenant'
What I tried:
(/'\s|\s'/, '')
It's matching a few cases, but not all.
How to perform this ? Thanks.
TLDR:
I assume the spaces were inserted by some automation software and there can only be single spaces around the words.
s = "l' avant ou l 'apres ou encore ' maintenant' ou bien 'ceci ' et ' encore de l ' huile ' d 'accord d' accord d ' accord Je n' en ai pas .... s ' entendre Je m'appelle Victor"
first_rx = /(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i
# If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc],
# i.e. first letters of word that are usually contracted
second_rx = /\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/
puts s.gsub(first_rx, "'")
.gsub(second_rx) { $~[1] ? "'#{$~[1]}'" : "" }
Output:
l'avant ou l'apres ou encore 'maintenant' ou bien 'ceci' et 'encore de l'huile' d'accord d'accord d'accord Je n'en ai pas .... s'entendre Je m'appelle Victor
Explanation
The problem is really complex. There are several words that can be abbreviated and used with an apostrophe in French, de, le/la, ne, se, me, te, ce to name a few, but these are all consonants. You may remove all spaces between a single, standalone consonant, apostrophe and the next word using
s.gsub(/(?<=\b[b-df-hj-np-tv-z]) ' ?(?=\p{L})|(?<=\b[b-df-hj-np-tv-z]) ?' (?=\p{L})/i, "'")
If you find it overmatches, replace [b-df-hj-np-tv-z] with [dlnsmtc], i.e. first letters of word that are usually contracted. See the regex demo.
Next step is to remove spaces after initial and before trailing apostrophes. This is tricky:
s.gsub(/\b'\b\K|' *((?:\b'\b|[^'])+)(?<=\S) *'/) { $~[1] ? "'#{$~[1]}'" : "" }
where \b'\b is meant to match all apsotrophes in between word chars, those that we fixed at the previous step. See this regex demo. As there is no (*SKIP)(*F) support in Onigmo regex, the regex is a bit simplified but the replacement is a conditional one: if Group 1 matched, replace with ' + Group 1 value ($1) + ', else, replace with an empty string (since \K reset the match, dropped all text from the match memory buffer).
NOTE: this approach can be extended to handle some specific cases like aujourd'hui, too.
To remove all whitespace around the ', use gsub!, applied in several steps for proper whitespace removal:
str = "l' avant ou l 'apres ou encore ' maintenant'"
str.gsub!(/\b'\s+\b/, "'").gsub!(/\b\s+'\b/, "'").gsub!(/\b(\s+')\s+\b/, '\1')
puts str
# l'avant ou l'apres ou encore 'maintenant'
Here,
\b : word boundary,
\s+ : 1 or more whitespace,
string.gsub!(regex, replacement_string) : replace in the string argument regex with specified replacement_string (during this, the original string is changed),
\1 : in the replacement string, this refers to the first group captured in parenthesis in the regex: (...).
So if you have alot of data like this, all the answers I have seen are wrong, and will not work. No regex can guess wether the preceding word should have a space or not. Unless you came up with a list of words (or patterns) that either do or don't.
The problem is, sometimes a space should be left, sometimes not. The only way to script that is to find a pattern which describes when the space should be there, or when not. You must teach your regex French grammar. It may be possible lol. But probably not, or difficult.
If this is a one off, my advice is to create regexes for 2 or 3 different situations, and use something like vim, to go through the data, and select manually yes or no to substitute each occurrence.
There may be some cases you can run - eg remove all spaces to the right of quotes? - but unfortunately I don't think you can automate this process.
I believe the following should work for you
s.gsub(/'.*?'/){ |e| "'#{e[1...-1].strip}'" }
The regex portion lazy matches all text within single quotes (including quotes). Then, for each match you substitute for the quoted text with leading and trailing whitespace removed, and return this text in quotes.

regex doesn't match last word

I have this simple regex:
RegEx_Seek_1 := TDIPerlRegEx.Create{$IFNDEF DI_No_RegEx_Component}(nil){$ENDIF};
s1 := '(doesn''t|don''t|can''t|cannot|shouldn''t|wouldn''t|couldn''t|havn''t|hadn't)';
// s1 contents this text: (doesn't|don't|can't|cannot|shouldn't|wouldn't|couldn't|havn't|hadn't)
RegEx_Seek_1.MatchPattern := '(*UCP)(?m)'+s1+' (a |the )(ear|law also|multitude|son)(?(?= of)( \* | \w+ )| )([^»Ô¶ ][^ »Ô¶]\w*)';
Which is targeted on finding noun with an article, which can be followed by of. If there is of, then I need to search for noun \w+ (and \* too; substitude for verb). The last word should be verb.
The sample text:
. some text . Doesn't the ear try ...
. some text doesn't the law also say ...
. some text doesn't the son bear ...
. some text . Shouldn't the multitude of words be answered? ...
. some text . Why doesn't the son of * come to eat ...
My results:
Doesn't the ear try
doesn't the law also say
doesn't the son bear
Shouldn't the multitude of words
And it fails to get the last sentence:
doesn't the son of * come
My plan is to add \K before the last word to get the verb.
The exclusion of the characters:
[^»Ô¶] is made because », Ô, ¶ already represent some mark in the text, to decribe a existing verb. They may or may be not present. I am using spaces. Tabs are delimitors and are not part of any sentence.
In this regex I included a space [^»Ô¶ ] to get the last word.
So the question is how to correct the regex to get one more line:
doesn't the son of * come
Edit:
I need to refer the verbs in the same group while replacing (I will refer to verb).
Your mistake is in (?(?= of)( \* | \w+ )| ).
Remember that lookaheads don't move the cursor forward, so the ( \* | \w+ ) will match of , so the remainder is now * come which can't be matched by ([^»Ô¶ ][^ »Ô¶]\w*) as the second character is a space.
I guess you should match the of already in your condition, like (?(?= of) of( \* | \w+ )| )
I modified the Wiktor's pattern to match:
(*UCP)(?m)'+s1+' (a |the )(ear|law also|multitude|son)(?:\s+of Words|\s+of \*)*\s+\K(?P<verb>[^\s»Ô¶]+)
Now I can refer to the last group like this:
char(182)+'$<verb>'
I show my results how the verb was changed using Replace2 function of TDIRegEx. You see it works:
Why doesn't the son of * ¶come to eat
Doesn't the ear ¶try words,
Why doesn't the son ¶bear the
doesn't the law also ¶say the same thing?
Shouldn't the multitude of words ¶be answered?
Both answers, the one from Wictor and the one from Sebastian helped me to solve the question. Thank you.

How to use regex correctly with this REGEXEXTRACT function?

I'm dealing with a large spreadsheet of order data which uses a unique, sort of hashed string of syntax to represent order items and attributes.
I currently have this data in Google Sheets and I'm hoping to be able to make use of the REGEXEXTRACT function (https://support.google.com/docs/answer/3098244) to retrieve the pieces of information I need from each row.
Example of function: REGEXEXTRACT("Needle in a haystack", ".e{2}dle")
Order data is huge and I believe I can use this regex function to isolate the piece of information I want.
Examples of the portion of string snippets I need to work with. Keeping in mind the actual order string is much longer than this:
"Location\";s:5:\"value\";s:7:\"Atlanta\"
"Location\";s:2:\"value\";s:8:\"New York\"
"Location\";s:5:\"value\";s:15:\"barrio de boedo\"
So the common string in each row is Location, as value is used multiple times throughout each order.
Let me see if I can articulate this correctly: How would I use regex to specify the value between the 4th and 5th double quotes occurring immediately after the string 'Location' so that in my examples above, the results would be Atlanta, New York, barrio de boedo?
For reference, the barrio de boedo example in its entirety:
\";s:7:\"product\";s:2:\"31\";s:8:\"form_key\";s:16:\"aasdf\";s:7:\"options\";a:2:{i:1;s:1:\"2\";i:2;s:15:\"barrio de boedo\";}s:15:\"super_attribute\";a:2:{i:92;s:1:\"4\";i:132;s:1:\"9\";}s:3:\"qty\";s:1:\"1\";}s:7:\"options\";a:2:{i:0;a:7:{s:5:\"label\";s:15:\"Language-Gender\";s:5:\"value\";s:8:\"spa-male\";s:11:\"print_value\";s:8:\"spa-male\";s:9:\"option_id\";s:1:\"1\";s:11:\"option_type\";s:9:\"drop_down\";s:12:\"option_value\";s:1:\"2\";s:11:\"custom_view\";b:0;}i:1;a:7:{s:5:\"label\";s:8:\"Location\";s:5:\"value\";s:15:\"barrio de boedo\";s:11:\"print_value\";s:15:\"barrio de boedo\";s:9:\"option_id\";s:1:\"2\";s:11:\"option_type\";s:5:\"field\";s:12:\"option_value\";s:15:\"barrio de boedo\";s:11:\"custom_view\";b:0;}}s:15:\"attributes_info\";a:2:{i:0;a:2:{s:5:\"label\";s:5:\"Color\";s:5:\"value\";s:4:\"Grey\";}i:1;a:2:{s:5:\"label\";s:4:\"Size\";s:5:\"value\";s:1:\"L\";}}s:11:\"simple_name\";s:14:\"T-Shirt-Grey-L\";s:10:\"simple_sku\";s:14:\"t-shirt-Grey-L\";s:20:\"product_calculations\";i:1;s:13:\"shipment_type\";i:0;}"
You need to use the following pattern in REGEXEXTRACT:
"Location\\?(?:""[^""]*){3}""([^""]+)\\"""
See the regex demo.
The pattern is Location\\?(?:"[^"]*){3}"([^"]+)\\" and it matches:
Location - a substring Location
\\? - 1 or 0 \ symbols (the ? makes a pattern optional)
(?:"[^"]*){3} - exactly 3 occurrences (due to the limiting quantifier {3}) of a " followed with zero or more (due to the * quantifier) chars other than " (the [^...] is a negated character class that matches any chars but those defined in the class)
" - a single double quote
([^"]+) - capturing group #1 (whose contents will be returned with REGEXEXTRACT): 1 or more (due to + quantifier) chars other than "
\\" - a \" substring.

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"