Regex Select groups not found in a pattern - regex

I have been looking at the various topics on Regex on SO, and they are all saying that to find the invert (select all that doesn't fit the criteria) you simply use the[^] syntax or negative lookahead.
I have tried using both of these methods on my Regex but the results are not adequate the [^] especially seems to take all its contents literally (even when escaped).
What I need this for:
I have a massive SQL line with a SQL dump I'm trying to remove all characters that are not the line id, and the numerical value of one column.
My regex works in matching exactly what I'm looking for; what I need to do is to invert this match so I can remove all non-matching parts in my IDE.
My regex:
/(\),\(\d{1,4},)|(,\d{10},)/
This matches a "),(<number upto 4 digits>," or ",<number of ten digits>," .
The subject
My subject is a 500Kb line of an SQL dump looking something like this (I have already removed a-z and other unwanted characters in previous simple find/replaces):
),(39,' ',1,'01761472100','#','9 ','20',1237213277,0,1237215419,''),(40,' ',3,'01445731203','#',' ','-','22 2','210410//816',1237225423,0,1484651768,''),(4270,' /
My aim is to use a regex to achive the following output:
),(39,,1237213277,,1237215419,),(40,,1237225423,,1484651768,),(4270,
Which I can then go over again and easily remove repetitions such as commas.
I have read that Negation in Regex is tricky, So, what is the syntax to get the regex I've made to work inverted? To remove all non-matching groups? What can you recommend as a way of solving this without spending hours manually reading the lines?

You may use a really helpful (*SKIP)(?!) (=(*SKIP)(*F) or (*SKIP)(*FAIL)) construct in PCRE to match these texts you know and then skip and match all other text to remove:
/(?:\),\(\d{1,4},|,\d{10},)(*SKIP)(?!)|./s
See the regex demo
Details:
(?:\),\(\d{1,4},|,\d{10},) - match 1 of the 2 alternatives:
\),\(\d{1,4}, - ),(, then 1 to 4 digits and then ,
| - or
,\d{10}, - a comma, 10 digits, a comma
(*SKIP)(?!) - omit the matched text and proceed to the next match
| - or
. - any char (since /s DOTALL modifier is passed to the regex)
The same can be done with
/(\),\(\d{1,4},|,\d{10},)?./s
and replacing with $1 backreference (since we need to put back the text captured with the patterns we need to keep), see another regex demo.

Related

Replace a sequence of 3 to 30 characters by another one with regex

I need to replace sequences (of various length) of characters by another character. I am working in Eclipse on xml files
For exemple -------- should be replaced by ********.
The replacement should be done only for sequences of at least 3 characters, not 1 or 2.
It is easy to find the matching sequences in regex for example with -{3,30} but I don't understand how to specify the replacement sequence.
I made this regex solution ready when question was posted but didn't submit an answer because I kept testing in eclipse and even though regex was working for find feature, a * in replacement wasn't changing text in Eclipse editor.
Here is a shorter and a bit more efficient regex:
(?!^)\G-|(?=-{3})-
Replace with a *
RegEx Demo
Breakdown:
(?!^)\G: Match from end of the previous match
-: Match a -
|: OR
(?=-{3}): Make sure we have 3 hyphens ahead
-: Match a -
Here is a screenshot from my Eclipse that shows selected match for this regex:
You can use
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))-
See the regex pattern. Details:
(?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-))) - either
\G(?!\A) - end of the preceding match
| - or
(?<!-)(?=-{3,30}(?!-)) - a position that is not immediately preceded with a - char and is immediately followed with 3 to 30 hyphens (not followed with another hyphen).
- - a hyphen.
The (?:\G(?!\A)|(?<!-)(?=-{3,30}(?!-)))- regex goes into the "Find What" filed, * goes to the "Replace With" field.
Note that regular expressions are only meant to be used in search fields, replacement fields must only contain replacement patterns. Usually, the replacement pattern is a string containing literal string(s) and/or backreferences. Here, we do not need any backreference as the regex does not capture anything.

RegEx to match only a specific column with lookaround

I have a .CSV which I'm handling in a large file editor (BssEditor):
DOC;NAME;A_TYPE;ADDRESS;NUMBER;COMPLEMENT;NEIGHBORHOOD;CITY;STATE;ZIPCODE
7971530;Obi Wan Kenobi;R;OF THE PITANGUEIRAS;0000731;;MATATU;DUBAI;BA;40255436
7971541;Anakim Skywalker;AV;VISCONDE OF JEQUITINHONHA;0000243;AP 601;GOOD VOYAGE;RECIFE;PE;51021190
7971974;Jabba the Hutt;;DOS ILHEUS;0000118;APT 600;CENTER;FLOWERPOLIS;SC;88010560
7972512;Mando;;JUNDIACANGA;0000037;HOUSE;IPAVA CITY;SAINT PAUL;SP;04950150
The column delimiter is ;, and I wanna match all zeros in the beginning of the NUMBER column to replace with nothing.
Ex.: 0000731→731
It's easy to match everything with ^((.*?;){4})0+ and replace by $1, but not with lookaround...
I tried RegEx like that
/^(?<=.*?;){4}0+/
/(?<=^.*?;.*?;.*?;.*?;)0+/
but it looks like the greedy wildcard only works within a lookahead, not a lookbehind.
There are a way?
And having a way, is there a performance issue when dealing with millions of entries?
An infinite quantifier in a lookbehind is only supported by a few regex engines (.NET, Python PyPi module, newer Javascript like V8), but not in notepad++ which uses boost.
If you are using notepad++, you don't need lookarounds or capture groups. You could repeat semicolon separated parts until you get to the number column and use \K to clear the current match buffer.
In the replacement use an empty string.
^(?:[^;\n]*;){4}\K0+
^ start of string
(?:[^;\n]*;){4} Repeat 4 times matching any char except ; or a newline, then match ;
\K Forget what is matched so far
0+ Match one or more times a zero
Regex demo
The capture group solution seems like a good solution, you could write it using a single capture group and use a negated character class instead of .*? to prevent some backtracking.
^((?:[^;\n]*;){4})0+
In the replacement use group 1, often notated as $ or \1
Regex demo
I don't know about BssEditor, but the following works in Notepad++
(?<=;)0+(?=\d+;(?:[^;]*;){4}[^;]*?$)
A positive lookahead is used to only match if there are exactly five semicolons ahead in the string on that line.
is there a performance issue when dealing with millions of entries?
Possibly.

Regular Expression Replace Time Value between Date-Time Format

I have an XML file with date-time formats looking like this:
<published>2019-01-03T23:54:00.000+10:00</published>
and this
<published>2019-01-07T14:22:00.001+10:00</published>
and so on, where the time value is 23:54:00.000 and 14:22:00.001.
How do I replace just the time value between the <published></published> tags with regular expressions? For example, I want to replace both time values with 03:00:00.000 so the first example becomes
<published>2019-01-03T03:00:00.000+10:00</published>
My aim is to use any existing tools/apps Notepad++ or websites since it is much faster, not any specific programming languages.
First, the obligatory warning to not try to parse xml/html with regex. It's fine if this is a once-off reformatting task and you have control over the data. A regex solution will not be very robust...
That out of the way, you will need a tool that can handle capture groups with regex, so you can match on the whole published tag and avoid false positives. A regex like this might do the trick (adjust the capture grouping as appropriate for your tool):
(\<published\>\d\d\d\d-\d\d-\d\dT)\d\d:\d\d:\d\d\.\d\d\d(\+\d\d:\d\d\<\/published\>)
Note that the above is a regex in PCRE format - demo on regex101. You may need to adjust to suit the format your tool uses.
In this regex, there are two capture groups, one before and one after the time you want to replace. An example string that you could use in the replace field of your chosen tool would be: \103:00:00.000\2 (using \1 syntax for backreferences).
Try this regex:
(<published>\d{4}(?:-\d{2}){2}T)\d{2}(?::\d{2}){2}\.\d{3}([^<]*<\/published>)
Click for Demo
Replace each match with \103:00:00.000\2 i.e. Group 1 contents followed by 03:00:00.000 followed by Group 2 contents.
Explanation:
(<published>\d{4}(?:-\d{2}){2}T) - matches <published> followed by 4 digits followed by - followed by 2 digits followed by - followed by 2 digits followed by the letter T. This sub-match is captured in Group 1
\d{2}(?::\d{2}){2}\.\d{3} - matches time of the format XX:XX:XX.XXX where X is a digit.
([^<]*<\/published>) - matches 0+ occurrences of any character that is not a < followed by </published>. This sub-match is captured in Group 2.
Before Replace:
After Replace:

Capture number between two whitespaces (RegEx)

I have the following data:
SOMEDATA .test 01/45/12 2.50 THIS IS DATA
and I want to extract the number 2.50 out of this. I have managed to do this with the following RegEx:
(?<=\d{2}\/\d{2}\/\d{2} )\d+.\d+
However that doesn't work for input like this:
SOMEDATA .test 01/45/12 2500 THIS IS DATA
In this case, I want to extract the number 2500.
I can't seem to figure out a regex rule for that. Is there a way to extract something between two spaces ? So extract the text/number after the date until the next whitespace ? All I know is that the date will always have the same format and there will always be a space after the text and then a space after the number I want to extract.
Can someone help me out on this ?
Capture number between two whitespaces
A whitespace is matched with \s, and non-whitespace with \S.
So, what you can use is:
\d{2}\/\d{2}\/\d{2} +(\S+)
^^^
See the regex demo
The 1+ non-whitespace symbols are captured into Group 1.
If - for some reason - you need to only get the value as a whole match, use your lookbehind approach:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Or - if you are using PCRE - you may leverage the match reset operator \K:
\d{2}\/\d{2}\/\d{2} +\K\S+
^^
See another demo
NOTE: the \K and a capture group approaches allow 1 or more spaces after the date and are thus more flexible.
I see some people helped you already, but if you would want an alternative working one for some reason, here's what works too :)
.+ \d+\/\d+\/\d+ (\d+[\.\d]*)
So the .+ matches anything plus the first space
then the \d+/\d+/\d+ is the date parsing plus a space
the capturing group is the number, as you can see I made the last part optional, so both floating point values and normal values can be matched. Hope this helped!
Proof: https://regex101.com/r/fY3nJ2/1
Just make the fractal part optional:
(?<=\d{2}\/\d{2}\/\d{2} )\d+(?:\.\d+)?
Demo: https://regex101.com/r/jH3pU7/1
Update following clarifications in comments:
To match anything (but space) surrounded by spaces and prepended by date use:
(?<=\d{2}\/\d{2}\/\d{2} )\S+
Demo: https://regex101.com/r/jH3pU7/3
Rather than capture, you can make your entire match be the target text by using a look behind:
(?<=\d\d(\/\d\d){2} )\S+
This matches the first series of non-whitespace that follows a "date like" part.
Note also the reduction in the length of the "date like" pattern. You may consider using this part of the regex in whatever solution you use.

Regular Expression in sas, not matching a word after a matching word

Maybe this is easy, but i could not find a solution.
I am working in Sas 9.3 with perl regex.
I am searching for a regular Expression, which matches only some words which are not followed by a specific other word. For example, it should match all text where you have "the car" and in all other text after this there should be no "not". (Case can be ignored, because i upcase everything in my code)
Should match
This is not the car i want
The car is green
should not match
The car is not green
This is the car i want, but its not available
One solution would be to split it in two matches:
prxmatch("/The car/",mytext) > 0 and prxmatch("/The car.+not/",mytext)=0
But i have to use the logic a lot of times, also in more complex cases, so i dont want to always use 2 prxmatch and instead combine the logic in one prxmatch.
I read a lot about look aheads and tried some examples, but they did not work correct, e.g.:
"/The Car.+[^(not)]/"
or
"/The Car.+(?!not)/"
or
"/^(?!.*not.*).*?The car.*$/"
1st and second return all 4 texts as results, third none result at all.
So can somebody provide me a solution for this, a simple not Operator for a word or a correct look ahead/behind Approach?
You can use
(?im)^.*\bthe car\b(?!.*\bnot\b).*
The regex demo is available here
Pattern breakdown:
(?im)- enable case-insensitive and multiline matching modes
^ - start of a line (since (?m) is used)
.* - match 0+ any characters but a newline
\bthe car\b - 2 whole words "the car" (a sequence of 2 words)
(?!.*\bnot\b) - a negative lookahead that fails the match if there is a whole word "not" somewhere to the right of the car
.* - the rest of the line up to the newline or end of string