Removing everything between 2 strings with Google sheets RE2 - regex

I'm trying to remove something from a product title as part of a Google sheet
Example Johner Gladstone Pinot Noir 2015, 75CL
Stella Artois Premium Lager Bottle, 1 X 660 Ml
Pepesza Ppsh-40 Vodka Tommy Gun, 1 L
And I want to be able to remove everything from the , and either the CL, ML or L.
The problem I'm running into is that I don't know enough about regex and I'm struggling to find a good place to learn!
What I've tried so far is below
=REGEXREPLACE(A2,"[, ]\QML|CL\E","")
but this doesn't work and I think its because [, ] isn't a valid part.
=REGEXREPLACE(A2,"\*\QML|CL\E","")
because I know that , is the only punctuation in the titles - I've also tried this but not been successful.

What you are trying to get is
(?i), .*?[CM]?L
See the regex demo. Details:
(?i) - case insensitive flag
, .*? - comma, space, and then any zero or more chars other than line break chars, as few as possible (due to *?, if you need as many as possible use * instead)
[CM]?L - C or M (optionally due to ?) and then an L char.
However, you can simply match from a , + space till the end of the line:
", .*
See this regex demo. Here, the first comma+space is matched and then the rest of the string (line, since . does not match line breaks by default).
See the regular expression syntax accepted by RE2.

Related

Regex to remove double lines ignoring the punctation marks and spaces in Notepad++ [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 months ago.
Is it possible to remove duplications with ignoring the punctation marks and spaces in Notepad++? I would keep one of them matching lines (doesn't matter which to keep).
My examples are from the txt file:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rough work, iconoclasm, but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Rule No.1: Never lose money. Rule No.2: Never forget rule No.1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
Self-esteem isn't everything it's just that there's nothing without it. Gloria Steinem
You said she's a senior? Babe we're all crazy.
You said, she's a senior! Babe we're ALL crazy.
You said, she's a senior? Babe we're ALL crazy!
Result I need:
Rough work iconoclasm but the only way to get the truth. Oliver Wendell Holmes
Rule No. 1: Never lose money. Rule No. 2: Never forget rule No. 1. Warren Buffett
Self-esteem isn't everything, it's just that there's nothing without it. Gloria Steinem
You said, she's a senior! Babe we're ALL crazy.
I can delete 100% matching duplications with regex, but can't find a regex rule to ignore spaces and marks.
I don't think regex is the best tool for this task, but it's a nice challenge. You can match single words using a nested structure like:
((\w+)\W+((\w+)\W+( ... ((\w+)\W+)? ... )?)?(\w*))
When matching this, capture groups 2 to n contain the words 1 to n-1 of a line. The nested structure is necessary to make it non-ambiguous - otherwise, running the regex takes too long.
To match the duplicate lines, we use a similar structure with back-references:
\1\W+(\2\W+( ... (\9\W+)? ... )?)?
This will also match lines that are substrings of the previous line, which is again helpful to improve performance.
Notice that you have to use the \g{n}-notation when using more than 9 references in Notepad++. Moreover, to avoid matching line breaks you should use [^\w\n\r] instead of \W. To further improve performance, unnecessary groups should be non-matching, i.e., (?: ... ).
To generate the rather long regex that solves the problem for, e.g., up to 20 words per line, you can use the following script:
MAX_WORDS = 20
punct = "[^\\w\\n\\r]"
backref = (i) => `\\g{${i}}`
patternKeep = (i) => "(\\w+)[^\\w\\n\\r]+" + (i < 0 ? "" : `(?:${patternKeep(i-1)})?`)
patternRemove = (i) => `${backref(MAX_WORDS-i + 2)}(?:${punct}+` + (i < 0 ? "" : patternRemove(i-1)) + ")?"
console.log("^(" + patternKeep(MAX_WORDS) + "(\\w*))(\\r?\\n" + patternRemove(MAX_WORDS)+ `${punct}*${backref(MAX_WORDS+4)}${punct}*)+$`)
When copying this to Notepad++ with settings "Wrap around" on and "Match case" off and replacing with $1, it will remove all duplicate lines in your example.
I doubt that it can be done purely with regular expressions. If it can then I imagine that the expression would be difficult to understand and difficult to maintain. Instead I would suggest a multi-step approach.
Step 1 - modify each line to be: original-line separator original-line.
Step 2 - convert it to be line-without-punctuation separator original-line.
Step 3 - sort the lines
Step 4 - remove duplicated lines
Step 5 - remove line-without-punctuation and separator leaving just the original line.
In more detail:
In all the replaces below: select "Wrap around", unselect "Dot matches newline", unselect "Match whole word only" and unselect "Match case".
Step 1 - choose a separator, some text that is not punctuation and does not occur in the file. Here I use qqq. Do a regular expression replace of ^(.+)$ with \1qqq\1.
Step 2 - remove any punctuation before the separator. Repeatedly do a regular expression replace of [!',-.:?]+(.*qqq) with \1 until no more replacements are made. This expression matches all the punctuation in the example, but you may need to add more for your full text. Also need to reduce multiple spaces to singles, so repeatedly do a regular expression replace of +(.*qqq) with \1 until no more replacements are made. One final step to handle spaces before the qqq do a regular expression replace of qqq with qqq (this could also use a non-regular expression replace).
Step 3 - sort the lines lexicographically.
Step 4 - remove duplicated lines. Repeatedly do a regular expression replace of ^(.*qqq).*\R\1 with \1 until no more replacements are made.
Step 5 - Remove unwanted text leaving the original line. Do a regular expression replace of ^.*qqq with nothing (the empty string).
If all punctuation can be deleted and the result being a line without punctuation then could simple do a regular expression replace of [!',-.:? ]+ with , a sort and finally a remove duplicates.
Previously this question attracted an answer, but the author deleted it. To me it was so interesting because a special technique was illustrated. In a comment the answerer pointed me towards another thread to read more about it.
After experimenting a bit with that answer, an idea was the following pattern. Settings in NP++ are to uncheck: [ ] match case, [ ] .matches newline - Replace with emptystring.
^(?>[^\w\n]*(\w++)(?=.*\R(\2?+[^\w\n]*\1\b)))+[^\w\n]*\R(?=\2[^\w\n]*$)
Here is the demo in Regex101 - Assumption is, that duplicate lines are consecutive (like sample).
Most of the used regex-tokens can be looked up in the Stack Overflow Regex FAQ.
In short words, the mechanism used is to capture words from one line to the first group (\w++) while inside the lookahead (?=.*\R(\2?+...\1\b))) a second group in the consecutive line is "growing" from itself plus the captures until \R(?=\2...$) it either matches all words or fails.
Illustration of some steps from the regex101 debugger:
The second group holds the substring of the consecutive line that matches words and order of the previous line. It expands at each repetition from optionally itself and a word from the previous line. Separated by [^\w\n]* any amount of characters that are not word characters or newline.
For making it work, matching is done without giving back at crucial points (prevent backtracking).

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.
Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Combining 2 regular expressions

I have 2 strings and I would like to get a result that gives me everything before the first '\n\n'.
'1. melléklet a 37/2018. (XI. 13.) MNB rendelethez\n\nÁltalános kitöltési előírások\nI.\nA felügyeleti jelentésre vonatkozó általános szabályok\n\n1.
'12. melléklet a 40/2018. (XI. 14.) MNB rendelethez\n\nÁltalános kitöltési előírások\n\nKapcsolódó jogszabályok\naz Önkéntes Kölcsönös Biztosító Pénztárakról szóló 1993. évi XCVI. törvény (a továbbiakban: Öpt.);\na személyi jövedelemadóról szóló 1995. évi CXVII.
I have been trying to combine 2 regular expressions to solve my problem; however, I could be on a bad track either. Maybe a function could be easier, I do not know.
I am attaching one that says that I am finding the character 'z'
extended regex : [\z+$]
I guess finding the first number is: [^0-9.].+
My problem is how to combine these two expressions to get the string inbetween them?
Is there a more efficient way to do?
You may use
re.findall(r'^(\d.*?)(?:\n\n|$)', s, re.S)
Or with re.search, since it seems that only one match is expected:
m = re.search(r'^(\d.*?)(?:\n\n|$)', s, re.S)
if m:
print(m.group(1))
See the Python demo.
Pattern details
^ - start of a string
(\d.*?) - Capturing group 1: a digit and then any 0+ chars, as few as possible
(?:\n\n|$) - a non-capturing group matching either two newlines or end of string.
See the regex graph:

Why the * regular expression indicates what can or cannot be it's previous character

Take this for an example which I found in some blog,
"How about searching for apple word which was spelled wrong in a given file where apple is misspelled as ale, aple, appple, apppple, apppppple etc. To find all patterns
grep 'ap*le' filename
Readers should observe that the above pattern will match even ale word as * indicates 0 or more of previous character occurrence."
Now it's saying that "ale" will be accept when we are having ap*le, isn't the "ap" and "le" fixed?
The * is a quantifier meaning 0 or more times for the previous pattern -- in this case a single literal p. You can also state the same as * with a quantifier:
ap{0,}le
The interesting question sometimes is 'what is the previous pattern?' It is often helpful to put a pattern in a group to aid understand of what the 'previous pattern' is.
Consider wanting to find any of:
ale, aple, appple, apppple, apppppple, able, abbbbbbble
Your first try might be:
/ap|b*le/
^ literal 'p' is the first alternative #WRONG regex will use 'ap'
^ or
^ literal 'b'
Demo
What you want in this case is:
/a(?:p|b)*le/
Demo
If you do not want to match ale and only match aple, appple, apppple, apppppple, use the + instead of the * which means one or more:
/ap+le/
And is equivalent to /ap{1,}le/
Demo
And if you want to only match aple, appple and leave out the variants with more than 3 'p's use the additional max quantifier:
/ap{1,3}le/
All the variants above will match apple correctly spelled. If you what only aple, appple, and not match apple, use alteration:
/a(?:p|p{3})le/
Demo
No its not.
"*" in your case means zero or any occurrence of p. While a and le is fixed. If you need fixed ap and le then this is what you need:
ap+le
"+" means at least once but no limit on number of occurrences.
This means now any number of p after a but before l. So it wont select ale now.