Regex to match number in #define statement - regex

I have a line like this:
#define PROG_HWNR "36084"
or this:
#define PROG_HWNR "#37595"
I'd like to extract the number (and increase it, but that's not the matter here)
I wrote a regex, but it's not working (at least in http://gskinner.com/RegExr/ )
(?<="#?)(.*?)(?=")
I also tried variations like
(?<=("#?))(.*?)(?=")
or
(?<=("|"#)))(.*?)(?=")
But no success. The problem is, that I want to match only the number, no matter if there is a # or not ...
Can you point me in the right direction? Thanks!!

Try this regex:
"#?(\d+)"$
It will match:
" a quote
#? optional hash
( (start capturing)
\d+ one or more digits
) (stop capturing)
" a quote
$ anchor to end
Here is a JSFiddle, and here is a RegExr

The problem is the variable length of the lookbehind. Only few regex engines can deal with this. Because there are only two possible lookbehinds (including the # or not), you can expand that into two lookbehinds:
(?:(?<="#)|(?<=")).*?(?=")
Note that you don't need to capture the .*? if you use lookarounds, as they are excluded from the match anyway. Also, a better way than using non-greedy .*? is to use a greedy expression that can never go past the ending delimiter:
(?:(?<="#)|(?<="))[^"]*(?=")
Alternatively (if you can access captured submatches), you can use a capturing approach and get rid of the lookarounds:
"#?([^"]*)"

Try this:
^#define \w+ "#?(\d+)"$
That will match the whole line, with the first/single group being the number you are looking for.
This is actually pretty basic regex functionality: match an optional character (?) and match a group of characters (the parentheses).
You can even go one simpler:
\d+
will match a string of digits. Only the digits. And ignore the rest of the input string.
Use this tool for testing this stuff, I found it pretty handy: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

Related

RegEx to match only a specific column with lookaround

I have a .CSV which I'm handling in a large file editor (BssEditor):
DOC;NAME;A_TYPE;ADDRESS;NUMBER;COMPLEMENT;NEIGHBORHOOD;CITY;STATE;ZIPCODE
7971530;Obi Wan Kenobi;R;OF THE PITANGUEIRAS;0000731;;MATATU;DUBAI;BA;40255436
7971541;Anakim Skywalker;AV;VISCONDE OF JEQUITINHONHA;0000243;AP 601;GOOD VOYAGE;RECIFE;PE;51021190
7971974;Jabba the Hutt;;DOS ILHEUS;0000118;APT 600;CENTER;FLOWERPOLIS;SC;88010560
7972512;Mando;;JUNDIACANGA;0000037;HOUSE;IPAVA CITY;SAINT PAUL;SP;04950150
The column delimiter is ;, and I wanna match all zeros in the beginning of the NUMBER column to replace with nothing.
Ex.: 0000731→731
It's easy to match everything with ^((.*?;){4})0+ and replace by $1, but not with lookaround...
I tried RegEx like that
/^(?<=.*?;){4}0+/
/(?<=^.*?;.*?;.*?;.*?;)0+/
but it looks like the greedy wildcard only works within a lookahead, not a lookbehind.
There are a way?
And having a way, is there a performance issue when dealing with millions of entries?
An infinite quantifier in a lookbehind is only supported by a few regex engines (.NET, Python PyPi module, newer Javascript like V8), but not in notepad++ which uses boost.
If you are using notepad++, you don't need lookarounds or capture groups. You could repeat semicolon separated parts until you get to the number column and use \K to clear the current match buffer.
In the replacement use an empty string.
^(?:[^;\n]*;){4}\K0+
^ start of string
(?:[^;\n]*;){4} Repeat 4 times matching any char except ; or a newline, then match ;
\K Forget what is matched so far
0+ Match one or more times a zero
Regex demo
The capture group solution seems like a good solution, you could write it using a single capture group and use a negated character class instead of .*? to prevent some backtracking.
^((?:[^;\n]*;){4})0+
In the replacement use group 1, often notated as $ or \1
Regex demo
I don't know about BssEditor, but the following works in Notepad++
(?<=;)0+(?=\d+;(?:[^;]*;){4}[^;]*?$)
A positive lookahead is used to only match if there are exactly five semicolons ahead in the string on that line.
is there a performance issue when dealing with millions of entries?
Possibly.

Notepad++: Can I use regex to find some values and remove only one character instead of the whole pattern?

I want to use regex in notepad to find this pattern: "[0-9]+[\.][0-9]+[,][0-9]+" e.g. 1.010,80260
However from these kind of numbers I just want to remove the '.' , so the new value should be 1010,80260 .
So far I can only replace the whole pattern. Is there a way to do it?
Thank you in advance!
You can make use of the \K meta escape since PCRE doesn't support variable width lookbehinds:
regex:
[0-9]+\K[\.](?=[0-9]+[,][0-9]+)
[0-9]+ - capture digits
\K - forget what we've captured
[\.] - capture a period; just \. can be used, no need for the char class brackets
(?=[0-9]+[,][0-9]+) - ahead of me should be digits followed by a comma and digits
replace:
Nothing
\K is bugged in Notepad++ so you could use this regex instead since you only care that at least one digit is behind the period:
(?<=\d)\.(?=[0-9]+[,][0-9]+)
You can use \K, which basically says throw away whatever was matched up until that point, then add a lookahead. Like so
[0-9]+\K\.(?=[0-9]+[,][0-9]+)
Change the regular expression to: ([0-9]+)[\.]([0-9]+[,][0-9]+)
The () pieces are groups which you can refer to in the replace with \1 for the first group, and \2 for the second group.
The docs also explain this here: https://npp-user-manual.org/docs/searching/#substitution-grouping (even better, and in more detail, than my usage in this answer...)
EDIT: I just wanted to share the animated gif showing that 'Replace' in Notepad++ 7.9.5. does not seem to work.

Regex, avoid matching consecutive characters

I m trying to improve my regex skills.
I can't manage this exercise.
https://alf.nu/RegexGolf
You have to match words without consecutive identical characters.
To make it clear, we should avoid patterns like abba, or baab, czzc.
The only way I see is to use capture groups:
([a-z])([a-z])\2\1
Then have a negative lookahead:
(?!([a-z])([a-z])\2\1)
But on the site it doesn't work since it doesn't match anything.
Any advice?
Thank you
Use a negative lookahead:
^(?:(.)(?!\1))*$
Explanation:
^ from the start of the input
(?:
(.) match AND capture a single character
(?!\1) then assert that what follows is a different character (not the same)
)* match zero or more such matching characters
$ end of the input
Demo
Another, possibly cleaner, way to do this would be to just have a global negative lookahead at the very start of the pattern:
^(?!.*(.)\1).*$
This would assert at the very beginning that no character is duplicated, anywhere in the string.
^(?!cr|pal|tar)[a-z]{1,4}([a-z])\1[a-z]{0,5}$
This worked for me in the link you gave. I guess we had to match patterns with consecutive letters. But there were some exceptions for which I had to use negative look ahead at the beginning. I have used ([a-z])\1 to match consecutive characters surrounded by possible characters of possible limit. Hope this helps!
Attached the screenshot for reference.
https://i.stack.imgur.com/va1Uq.png
Thanks to Tim Biegeleisen, here is the answer.
^(?!.*(.)(.)\2\1).*$

Regex for selecting words ending in 'ing' unless

I want to select words ending in with a regular expression, but I want exclude words that end in thing. For example:
everything
running
catching
nothing
Of these words, running and catching should be selected, everything and nothing should be excluded.
I've tried the following:
.+ing$
But that selects everything. I'm thinking look aheads/look arounds could be the solution, but I haven't been able to get one that works.
Solutions that work in Python or R would be helpful.
In python you can use negative lookbehind assertion as this:
^.*(?<!th)ing$
RegEx Demo
(?<!th) is negative lookbehind expression that will fail the match if th comes before ing at the end of string.
Note that if you are matching words that are not on separate lines then instead of anchors use word boundaries as:
\w+(?<!th)ing\b
Something like \b\w+(?<!th)ing\b maybe.
You might also use a negative lookahead (?! to assert that what is on the right is not 0+ times a word character followed by thing and a word boundary:
\b(?!\w*thing\b)\w*ing\b
Regex demo | Python demo

Notepad++ Regex Find all endline without periods

I'm trying to find all lines without ending period (dot) but without finding blank (empty) lines. And after that I want to add ending period to that sentence.
Example:
The good is whatever stops such things from happening.
Meaning as the Higher Good
It was from this that I drew my fundamental moral conclusions.
I have tried few regex but they also find empty lines as well.
Is there a regex for Notepad++ that can achieve that?
Enable Regular Expression match, then search for:
\S(?<!\.)\K\s*$
and replace with:
.$0
Breakdown:
\S Match a non-whitespace character
(?<!\.) It shouldn't be a period
\K Reset match
\s* Match optional whitespace characters
$ End of line
You could use something like this to find the lines that you are interested in adding capture group to it and appending you needed chars.
(?<!\.)\r\n
This works by using negative look behind (?<!\.) to check that there is no . before \r
There is a group or regex operators that can be used to accomplish this type of tasks.
Look ahead positive (?=)
Look ahead negative (?!)
Look behind positive (?<=)
Look behind negative (?
Try this short and effective solution too.
Search: \w$
Replace: $0.