eclipse - regex: replace multiple group - regex

someText
1
2
3
4
moreText
I would like to add a prefix before each digit.
but using (\w+\R)(\d+\R)+(\w+) and \1prefix\2\3 will only prefix the last digit and erase the others.
Is there a way to do it with a single regex or should i write a script on the side?

The problem with your regex is the use of greedy matching in the (\d+\R)+, specifically the last +. That reads, "match this group as many times as you can so long as it doesn't cause the miss of a match". So for your text it gobbles up 1, 2, 3, and 4 before it can't gobble any more and puts the last match into the second capture group. Obviously, it's in the nature of regex engines to be unable to express variadic groups, how would you address them anyway? So the short answer, I think is that regexes are the wrong tool for a fully automated process and you'll have to write a script.
However, for a slightly less automated process that still incorporates your surrounding text, you could try
find: (\w+\R)((?:\d+\R)+)(\w+)
replace: \1prefix\2\3
We wrap the second group plus it's greedy modifier in an extra set of capturing parens and enclose the actual matching text in a non-capturing group. Now, we have the full set of digits in their own group and can add the prefix to the first one. The interesting side effect of this is that the first number then matches the first group (\w+\R) and if you run the find/replace again it hits the next number in the line until it no longer matches.
This way, you should be able to run through your files at least only hitting the areas you are interested in adding this prefix to and it shouldn't take nearly as long as finding every digit in every file.

Related

Recurse subpattern doesn't seem to work with alternation

I want to match strings with numbers separated by commas. In a nutshell I want to match at most 8 numbers in range 1-16. So that string 1,2,3,4,5,6,7,8 is OK and 1,2,3,4,5,6,7,8,9 is not since it has 9 numbers. Also 16 is OK but 17 is not since 17 is not in range.
I tried to use this regex ^(?:(?:[1-9]|1[0-6]),){0,7}(?:[1-9]|1[0-6])$
and it worked fine. I use alternation to match numbers from 1-16, then I use 0..7 repetition with comma at the end and then the same without comma at the end. But I do not like the repetition of the subpattern, so I tried (?1) to recurse the first capturing group. My regex looks like ^(?:([1-9]|1[0-6]),){0,7}(?1)$. However this do not produce match when the last number has two letters (10-16). It does match 1,1, but not 1,10. I do not understand why.
I created an example of the problem.
https://regex101.com/r/VkuPqP/1
In the debugger I see that the engine do not try the second alternation from the group, when the pattern recurse. I expect it to work. Where is the problem?
That happens because the regex subroutines in PCRE are atomic.
The regex you have can be re-written as ^(?:([1-9]|1[0-6]),){0,7}(?>[1-9]|1[0-6])$, see its demo. (?>...|...) will not allow backtracking into this group, so if the first branch "wins" (as in your example), the subsequent ones will not be tried upon the next subpattern failure (here, $ fails to match the end of string after matching 1 - it is followed with 0).
You may swap the alternatives in this case, the longer should come first:
^(?:(1[0-6]|[1-9]),){0,7}(?1)$
See the regex demo.
In general, the best practice is that each alternative in a group must match in different location inside a string. They should not match at the same locations.
If you can't rewrite an alternation group so that each alternative matches at unique locations in the string, you should repeat the group without using a regex subroutine.

Regex how to match two similar numbers in separate match groups?

I got the following string:
[13:49:38 INFO]: Overall : Mean tick time: 4.126 ms. Mean TPS:
20.000
the bold numbers should be matched, each into its own capture group.
My current expression is (\d+.\d{3}) which matches 4.126 how can I match my 20.000 now into a second capture group? Adding the same capture group again makes it find nothing. So what I basically need is, "search for first number, then ignore everything until you find next digit."
You could use something like so: (\d+\.\d{3}).+?(\d+\.\d{3})$ (example here) which essentially is your regex (plus a minor fix) twice, with the difference that it will also look for the same pattern again at the end of the string.
Another minor note, your regex contains, a potential issue in which you are matching the decimal point with the period character. In regular expression language, the period character means any character, thus your expression would also match 4s222. Adding an extra \ in front makes the regex engine treat is as an actual character, and not a special one.

Regex PCRE: Validate string to match first set of string instead of last

I tried quite a few things but Im stuck with my regex whenever meets the criteria 2 consecutive times. In this case it just considers it as one expressions instead of 2.
\[ame\=[^\.]+(.+)youtube\.(.+)v\=([^\]\&\"]+)[\]\'\"\&](.+)\[\/ame\]
E.g.
[ame="http://www.youtube.com/watch?v=brfr5CD2qqY"][B][COLOR=yellow]http://www.youtube.com/watch?v=brfrx5D2qqY[/COLOR][/B][/ame][/U]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]B[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=M9ak3rKIBAU"][B][COLOR=yellow]http://www.youtube.com/watch?v=M9a3arKIBAU[/COLOR][/B][/ame]
[B][COLOR=yellow]or[/COLOR][/B] [B][COLOR=yellow]C[/COLOR][/B]
[ame="http://www.youtube.com/watch?v=7vh--3pyq5U"][COLOR=yellow]http://www.youtube.com/watch?v=7vh--3pyq5U[/COLOR][/ame]
In that case, this regex would instead of matching all 3 options, it takes it as one.
Any ideas how to make an expression that would say match the first "[/ame]"?
The problem is the use of .+ - they are "greedy", meaning they will consume as much input as possible and still match.
Change them to reluctant quantifiers: .+?, which won't skip forward over the end of the first match to match the end if the last match.
I'm not sure what your objective is (you haven't made that clear yet)
But this will match and capture out the youtube URL for you, ensuring you only match each single instance between [ame= and [/ame]
/\[ame=["'](.*?)["'](.*?)\/ame\]/i
Here's a working example, and a great sandbox to play around in: http://regex101.com/r/jR4lK2

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Regex multiple matches and $1, $2 variables (quick and easy!)

I need to extract numeric values from strings like "£17,000 - £35,000 dependent on experience"
([0-9]+k?[.,]?[0-9]+)
That string is just an example, i can have 17k 17.000 17 17,000, in every string there can be 0,1 or 2 numbers (not more than 2), they can be everywhere in the string, separated by anything else. I just need to extract them, put the first extracted in a place and the second in another.
I could come up with this, but it gives me two matches (don't mind the k?[,.], it's correct), in the $1 grouping. I need to have 17,000 in $1 and 35,000 in $2, how can i accomplish this? I can also manage to use 2 different regex
Using regex
With every opening round bracket you create a new capturing group. So to have a second capturing group $2, you need to match the second number with another part of your regex that is within brackets and of course you need to match the part between the to numbers.
([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+)
See here on Regexr
But could be that Solr has regex functions that put all matches into an array, that would maybe be easier to use.
Match the entire dollar range with 2 capture groups rather than matching every dollar amount with one capture group:
([0-9]+k?[.,]?[0-9]+) - ([0-9]+k?[.,]?[0-9]+)
However, I'm worried (yeah, I'm minding it :p) about that regex as it will match some strange things:
182k,938 - 29.233333
will both be matched, it can definitely be improved if you can give more information on your input types.
What about something along the lines of
[£]?([0-9]+k?[.,]?[0-9]+) - [£]([0-9]+k?[.,]?[0-9]+)
This should now give you two groups.
Edit: Might need to clean up the spaces too