RegEx to match only a specific column with lookaround - regex

I have a .CSV which I'm handling in a large file editor (BssEditor):
DOC;NAME;A_TYPE;ADDRESS;NUMBER;COMPLEMENT;NEIGHBORHOOD;CITY;STATE;ZIPCODE
7971530;Obi Wan Kenobi;R;OF THE PITANGUEIRAS;0000731;;MATATU;DUBAI;BA;40255436
7971541;Anakim Skywalker;AV;VISCONDE OF JEQUITINHONHA;0000243;AP 601;GOOD VOYAGE;RECIFE;PE;51021190
7971974;Jabba the Hutt;;DOS ILHEUS;0000118;APT 600;CENTER;FLOWERPOLIS;SC;88010560
7972512;Mando;;JUNDIACANGA;0000037;HOUSE;IPAVA CITY;SAINT PAUL;SP;04950150
The column delimiter is ;, and I wanna match all zeros in the beginning of the NUMBER column to replace with nothing.
Ex.: 0000731→731
It's easy to match everything with ^((.*?;){4})0+ and replace by $1, but not with lookaround...
I tried RegEx like that
/^(?<=.*?;){4}0+/
/(?<=^.*?;.*?;.*?;.*?;)0+/
but it looks like the greedy wildcard only works within a lookahead, not a lookbehind.
There are a way?
And having a way, is there a performance issue when dealing with millions of entries?

An infinite quantifier in a lookbehind is only supported by a few regex engines (.NET, Python PyPi module, newer Javascript like V8), but not in notepad++ which uses boost.
If you are using notepad++, you don't need lookarounds or capture groups. You could repeat semicolon separated parts until you get to the number column and use \K to clear the current match buffer.
In the replacement use an empty string.
^(?:[^;\n]*;){4}\K0+
^ start of string
(?:[^;\n]*;){4} Repeat 4 times matching any char except ; or a newline, then match ;
\K Forget what is matched so far
0+ Match one or more times a zero
Regex demo
The capture group solution seems like a good solution, you could write it using a single capture group and use a negated character class instead of .*? to prevent some backtracking.
^((?:[^;\n]*;){4})0+
In the replacement use group 1, often notated as $ or \1
Regex demo

I don't know about BssEditor, but the following works in Notepad++
(?<=;)0+(?=\d+;(?:[^;]*;){4}[^;]*?$)
A positive lookahead is used to only match if there are exactly five semicolons ahead in the string on that line.
is there a performance issue when dealing with millions of entries?
Possibly.

Related

Regex, avoid matching consecutive characters

I m trying to improve my regex skills.
I can't manage this exercise.
https://alf.nu/RegexGolf
You have to match words without consecutive identical characters.
To make it clear, we should avoid patterns like abba, or baab, czzc.
The only way I see is to use capture groups:
([a-z])([a-z])\2\1
Then have a negative lookahead:
(?!([a-z])([a-z])\2\1)
But on the site it doesn't work since it doesn't match anything.
Any advice?
Thank you
Use a negative lookahead:
^(?:(.)(?!\1))*$
Explanation:
^ from the start of the input
(?:
(.) match AND capture a single character
(?!\1) then assert that what follows is a different character (not the same)
)* match zero or more such matching characters
$ end of the input
Demo
Another, possibly cleaner, way to do this would be to just have a global negative lookahead at the very start of the pattern:
^(?!.*(.)\1).*$
This would assert at the very beginning that no character is duplicated, anywhere in the string.
^(?!cr|pal|tar)[a-z]{1,4}([a-z])\1[a-z]{0,5}$
This worked for me in the link you gave. I guess we had to match patterns with consecutive letters. But there were some exceptions for which I had to use negative look ahead at the beginning. I have used ([a-z])\1 to match consecutive characters surrounded by possible characters of possible limit. Hope this helps!
Attached the screenshot for reference.
https://i.stack.imgur.com/va1Uq.png
Thanks to Tim Biegeleisen, here is the answer.
^(?!.*(.)(.)\2\1).*$

Regex to capture a group of delimited words that must end with a specific word

I'm normalizing a bunch of Ansible group names, which have to change to use underscores instead of hyphens (thanks, Ansible). However, there's tons of other stuff in the file that is hyphenated, so I want to leave those lines alone. The ones I want to change always end with -servers. So, with a small sample, we might have:
foo-bar
foo-bar-servers
foo-bar-baz-servers
(\w)-(\w?)? very nicely captures things so I can just sub to $1_$2 to change the hyphens to underscores. However, as soon as I add -servers or ervers on the end, it grabs only the very last pair around the hyphen. I have tried many variations, read up a little on lookaheads, and I am thoroughly stumped. It seems like it ought to be simple. What is the magic incantation to match all the groups around the hyphens, for lines ending in -servers? Many thanks in advance.
Edit: desired results, with apologies:
foo-bar
foo_bar_servers
foo_bar_baz_servers
As long as your regex engine supports positive lookaheads and (fixed-length) positive lookbehinds (as do most engines, including PCRE (PHP) and Python, for example), you may use the following regular expression to match the desired hyphens, which may then be replaced with underscores.
(?<=\w)-(?=(?:\w+-)*servers$)
Demo
The regex engine performs the following operations.
(?<=\w) match a word char in a positive lookbehind
- match a hypen
(?= begin a positive lookahead
(?:\w+-) match 1+ word chars then '-', in a non-capture group
* execute non-capture group 0+ times
servers match string
$ match end of line
) end positive lookahead

Notepad++ Regex Find all endline without periods

I'm trying to find all lines without ending period (dot) but without finding blank (empty) lines. And after that I want to add ending period to that sentence.
Example:
The good is whatever stops such things from happening.
Meaning as the Higher Good
It was from this that I drew my fundamental moral conclusions.
I have tried few regex but they also find empty lines as well.
Is there a regex for Notepad++ that can achieve that?
Enable Regular Expression match, then search for:
\S(?<!\.)\K\s*$
and replace with:
.$0
Breakdown:
\S Match a non-whitespace character
(?<!\.) It shouldn't be a period
\K Reset match
\s* Match optional whitespace characters
$ End of line
You could use something like this to find the lines that you are interested in adding capture group to it and appending you needed chars.
(?<!\.)\r\n
This works by using negative look behind (?<!\.) to check that there is no . before \r
There is a group or regex operators that can be used to accomplish this type of tasks.
Look ahead positive (?=)
Look ahead negative (?!)
Look behind positive (?<=)
Look behind negative (?
Try this short and effective solution too.
Search: \w$
Replace: $0.

Mixing Lookahead and Lookbehind in 1 Regexp

I'm trying to match first occurrence of window.location.replace("http://stackoverflow.com") in some HTML string.
Especially I want to capture the URL of the first window.location.replace entry in whole HTML string.
So for capturing URL I formulated this 2 rules:
it should be after this string: window.location.redirect("
it should be before this string ")
To achieve it I think I need to use lookbehind (for 1st rule) and lookahead (for 2nd rule).
I end up with this Regex:
.+(?<=window\.location\.redirect\(\"?=\"\))
It doesn't work. I'm not even sure that it legal to mix both rules like I did.
Can you please help me with translating my rules to Regex? Other ways of doing this (without lookahead(behind)) also appreciated.
The pattern you wrote is really not the one you need as it matches something very different from what you expect: text window.location.redirect("=") in text window.location.redirect("=") something. And it will only work in PCRE/Python if you remove the ? from before \" (as lookbehinds should be fixed-width in PCRE). It will work with ? in .NET regex.
If it is JS, you just cannot use a lookbehind as its regex engine does not support them.
Instead, use a capturing group around the unknown part you want to get:
/window\.location\.redirect\("([^"]*)"\)/
or
/window\.location\.redirect\("(.*?)"\)/
See the regex demo
No /g modifier will allow matching just one, first occurrence. Access the value you need inside Group 1.
The ([^"]*) captures 0+ characters other than a double quote (URLs you need should not have it). If these URLs you have contain a ", you should use the second approach as (.*?) will match any 0+ characters other than a newline up to the first ").

Regex to match number in #define statement

I have a line like this:
#define PROG_HWNR "36084"
or this:
#define PROG_HWNR "#37595"
I'd like to extract the number (and increase it, but that's not the matter here)
I wrote a regex, but it's not working (at least in http://gskinner.com/RegExr/ )
(?<="#?)(.*?)(?=")
I also tried variations like
(?<=("#?))(.*?)(?=")
or
(?<=("|"#)))(.*?)(?=")
But no success. The problem is, that I want to match only the number, no matter if there is a # or not ...
Can you point me in the right direction? Thanks!!
Try this regex:
"#?(\d+)"$
It will match:
" a quote
#? optional hash
( (start capturing)
\d+ one or more digits
) (stop capturing)
" a quote
$ anchor to end
Here is a JSFiddle, and here is a RegExr
The problem is the variable length of the lookbehind. Only few regex engines can deal with this. Because there are only two possible lookbehinds (including the # or not), you can expand that into two lookbehinds:
(?:(?<="#)|(?<=")).*?(?=")
Note that you don't need to capture the .*? if you use lookarounds, as they are excluded from the match anyway. Also, a better way than using non-greedy .*? is to use a greedy expression that can never go past the ending delimiter:
(?:(?<="#)|(?<="))[^"]*(?=")
Alternatively (if you can access captured submatches), you can use a capturing approach and get rid of the lookarounds:
"#?([^"]*)"
Try this:
^#define \w+ "#?(\d+)"$
That will match the whole line, with the first/single group being the number you are looking for.
This is actually pretty basic regex functionality: match an optional character (?) and match a group of characters (the parentheses).
You can even go one simpler:
\d+
will match a string of digits. Only the digits. And ignore the rest of the input string.
Use this tool for testing this stuff, I found it pretty handy: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx