Regex href match a number - regex

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.

Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>

To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.

Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.

Related

How to write a regular expression inside awk to IGNORE a word as a whole? [duplicate]

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.

Regular Expression look ahead in log

https://regex101.com/r/9kfa7D/4
I can never get the look ahead portion correct. I've tried a few different things, but I'm trying to get to the next date and parse it like that. Mainly because I don't know what the message will look like and it could be pretty random. Any help would be great.
I need to group the message portion of it.
Edit: Updated to make it a little more clear of what I'm trying to do. Never everything from each date.
You can just tweak your regex without tinkering lookahead like this:
^\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}
Updated Regex Demo
EDIT:
As per updated question OP can use this negative lookahead based regex to capture log text:
^[^\[]+\[[^\]]+\] +[^:]+ +(.*(?:\n(?!\d{2}-[a-zA-Z]{3}-).*)*)
This regex doesn't use DOTALL flag by unrolling the loop in last segment. This makes above regex pretty fast to complete the parsing.
New Demo
If you care for the message between log timestamps use this (it's in the 2-nd group):
/(\d{2}-\w{3}-\d{4} \S+ \S+ \[[^\]]++\] )(?=(.+)((?1)|\z))/gms
^(?:\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}) ((?:[^\n]+(?:\n+(?!\d{2}-\w{3}-\d{4})|))+)
The first part is the date pattern, which is non-grouping since you do not want to keep the date.
The second part is [^\n]+ which is followed by a \n provided it is not followed by \d{2}-\w{3}-\d{4} (hence the negative look ahead).
The second part is then repeated any number of times.
You can see the demo on regex101.
What you need
(^\d+.[A-Z].*?)[A-Z]
how it works
Lots of people like the complex thinking when they are confront a regex. But you should know exactly what you want.
you just need to match this: 29-Jun-2016 09:33:43.565 INFO and nothing else. So let's begin:
First: two digit,
next: A word with capital letter
next: everything from this word to the next capitalize word
finish.
the main rule
Non-greedy mantch: .*?
prove
NOTE
do you want to match from beginning to log
very easy just add .*?log at the end. that's it.
Do you ever pay attention to how many steps it take?
First of mine: 7952
Second of mine: 13751
Compare it with other
After putting the picture here. some guys update their regex. I do
not want to argue. no problem. I just wanted to show it.
Otherwise I can ( as you can ) makes it less by choice the specific
pattern For example: ^\d+-[A-Za-z]+-\d+\s\d+:\d+:\d+\.\d+ Now 7952 become 3878
Do you want to learn how lock-head assertion works?
Very easy. The main concept is that (?=) is never matches anything. It only matches the position just one point before you want.
like:
^\d+-[A-Z].+(?=[A-Z]+ ).
It still matches: 29-Jun-2016 09:33:43.565 INFO
Pay attention to . at the end. So here the look head assertion point to between F and O
If would like to match this 29-Jun-2016 09:33:43.565 then what can you do?
Think about this:
^\d+-[A-Za-z].+(?=[\d] ).
and figure out it by yourself.

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

Look behinds: all the rage in regex?

Many regex questions lately have some kind of look-around element in the query that appears to me is not necessary to the success of the match. Is there some teaching resource that is promoting them? I am trying to figure out what kinds of cases you would be better off using a positive look ahead/behind. The main application I can see is when trying to not match an element. But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
And this one from another question:
$url = "www.example.com/id/1234";
preg_match("/\d+(?<=id\/[\d])/",$url,$matches);
When is it truly better to use a positive look-around? Can you give some examples?
I realize this is bordering on an opinion-based question, but I think the answers would be really instructive. Regex is confusing enough without making things more complicated... I have read this page and am more interested in some simple guidelines for when to use them rather than how they work.
Thanks for all the replies. In addition to those below, I recommend checking out m.buettner's great answer here.
You can capture overlapping matches, and you can find matches which could lie in the lookarounds of other matches.
You can express complex logical assertions about your match (because many engines let you use multiple lookbehind/lookahead assertions which all must match in order for the match to succeed).
Lookaround is a natural way to express the common constraint "matches X, if it is followed by/preceded by Y". It is (arguably) less natural to add extra "matching" parts that have to be thrown out by postprocessing.
Negative lookaround assertions, of course, are even more useful. Combined with #2, they can allow you do some pretty wizard tricks, which may even be hard to express in usual program logic.
Examples, by popular request:
Overlapping matches: suppose you want to find all candidate genes in a given genetic sequence. Genes generally start with ATG, and end with TAG, TAA or TGA. But, candidates could overlap: false starts may exist. So, you can use a regex like this:
ATG(?=((?:...)*(?:TAG|TAA|TGA)))
This simple regex looks for the ATG start-codon, followed by some number of codons, followed by a stop codon. It pulls out everything that looks like a gene (sans start codon), and properly outputs genes even if they overlap.
Zero-width matching: suppose you want to find every tr with a specific class in a computer-generated HTML page. You might do something like this:
<tr class="TableRow">.*?</tr>(?=<tr class="TableRow">|</table>)
This deals with the case in which a bare </tr> appears inside the row. (Of course, in general, an HTML parser is a better choice, but sometimes you just need something quick and dirty).
Multiple constraints: suppose you have a file with data like id:tag1,tag2,tag3,tag4, with tags in any order, and you want to find all rows with tags "green" and "egg". This can be done easily with two lookaheads:
(.*):(?=.*\bgreen\b)(?=.*\begg\b)
There are two great things about lookaround expressions:
They are zero-width assertions. They require to be matched, but they consume nothing of the input string. This allows to describe parts of the string which will not be contained in a match result. By using capturing groups in lookaround expressions, they are the only way to capture parts of the input multiple times.
They simplify a lot of things. While they do not extend regular languages, they easily allow to combine (intersect) multiple expressions to match the same part of a string.
Well one simple case where they are handy is when you are anchoring the pattern to the start or finish of a line, and just want to make sure that something is either right ahead or behind the pattern you are matching.
I try to address your points:
some kind of look-around element in the query that appears to me is not necessary to the success of the match
Of course they are necessary for the match. As soon as a lookaround assertions fails, there is no match. They can be used to ensure conditions around the pattern, that have additionally to be true. The whole regex does only match, if:
The pattern does fit and
The lookaround assertions are true.
==> But the returned match is only the pattern.
When is it truly better to use a positive look-around?
Simple answer: when you want stuff to be there, but you don't want to match it!
As Bergi mentioned in his answer, they are zero width assertions, this means they don't match a character sequence, they just ensure it is there. So the characters inside a lookaround expression are not "consumed", the regex engine continues after the last "consumed" character.
Regarding your first example:
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
I think there is a misunderstanding on your side, when you write "has a simple solution to capturing the .*". The .* is not "captured", it is the only thing that the expression does match. But only those characters are matched that have a "<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">" before and a "<\/a><span" after (those two are not part of the match!).
"Captured" is only something that has been matched by a capturing group.
The second example
\d+(?<=id\/[\d])
Is interesting. It is matching a sequence of digits (\d+) and after the sequence, the lookbehind assertion checks if there is one digit with "id/" before it. Means it will fail if there is more than one digit or if the text "id/" before the digit is missing. Means this regex is matching only one digit, when there is fitting text before.
teaching resources
www.regular-expressions.info
perlretut on Looking ahead and looking behind
I'm assuming you understand the good uses of lookarounds, and ask why they are used with no apparent reason.
I think there are four main categories of how people use regular expressions:
Validation
Validation is usually done on the whole text. Lookarounds like you describe are not possible.
Match
Extracting a part of the text. Lookarounds are used mainly due to developer laziness: avoiding captures.
For example, if we have in a settings file with the line Index=5, we can match /^Index=(\d+)/ and take the first group, or match /(?<=^Index=)\d+/ and take everything.
As other answers said, sometimes you need overlapping between matches, but these are relatively rare.
Replace
This is similar to match with one difference: the whole match is removed and is being replaced with a new string (and some captured groups).
Example: we want to highlight the name in "Hi, my name is Bob!".
We can replace /(name is )(\w+)/ with $1<b>$2</b>,
but it is neater to replace /(?<=name is )\w+/ with <b>$&</b> - and no captures at all.
Split
split takes the text and breaks it to an array of tokens, with your pattern being the delimiter. This is done by:
Find a match. Everything before this match is token.
The content of the match is discarded, but:
In most flavors, each captured group in the match is also a token (notably not in Java).
When there are no more matches, the rest of the text is the last token.
Here, lookarounds are crucial. Matching a character means removing it from the result, or at least separating it from its token.
Example: We have a comma separated list of quoted string: "Hello","Hi, I'm Jim."
Splitting by comma /,/ is wrong: {"Hello", "Hi, I'm Jim."}
We can't add the quote mark, /",/: {"Hello, "Hi, I'm Jim."}
The only good option is lookbehind, /(?<="),/: {"Hello", "Hi, I'm Jim."}
Personally, I prefer to match the tokens rather than split by the delimiter, whenever that is possible.
Conclusion
To answer the main question - these lookarounds are used because:
Sometimes you can't match text that need.
Developers are shiftless.
Lookaround assertions can also be used to reduce backtracking which can be the main cause for a bad performance in regexes.
For example: The regex ^[0-9A-Z]([-.\w]*[0-9A-Z])*#(1) can also be written ^[0-9A-Z][-.\w]*(?<=[0-9A-Z])#(2) using a positive look behind (simple validation of the user name in an e-mail address).
Regex (1) can cause a lot of backtracking essentially because [0-9A-Z] is a subset of [-.\w] and the nested quantifiers. Regex (2) reduces the excessive backtracking, more information here Backtracking, section Controlling Backtracking > Lookbehind Assertions.
For more information about backtracking
Best Practices for Regular Expressions in the .NET Framework
Optimizing Regular Expression Performance, Part II: Taking Charge of Backtracking
Runaway Regular Expressions: Catastrophic Backtracking
I typed this a while back but got busy (still am, so I might take a while to reply back) and didn't get around to post it. If you're still open to answers...
Is there some teaching resource that is promoting them?
I don't think so, it's just a coincidence I believe.
But, for example, this query from a recent question has a simple solution to capturing the .*, but why would you use a look behind?
(?<=<td><a href="\/xxx\.html\?n=[0-9]{0, 5}">).*(?=<\/a><span
This is most probably a C# regex, since variable width lookbehinds are not supported my many regex engines. Well, the lookarounds could be certainly avoided here, because for this, I believe it's really simpler to have capture groups (and make the .* lazy as we're at it):
(<td><a href="\/xxx\.html\?n=[0-9]{0,5}">).*?(<\/a><span)
If it's for a replace, or
<td><a href="\/xxx\.html\?n=[0-9]{0,5}">(.*?)<\/a><span
for a match. Though an html parser would definitely be more advisable here.
Lookarounds in this case I believe are slower. See regex101 demo where the match is 64 steps for capture groups but 94+19 = 1-3 steps for the lookarounds.
When is it truly better to use a positive look-around? Can you give some examples?
Well, lookarounds have the property of being zero-width assertions, which mean they don't really comtribute to matches while they contribute onto deciding what to match and also allows overlapping matches.
Thinking a bit about it, I think, too, that negative lookarounds get used much more often, but that doesn't make positive lookarounds less useful!
Some 'exploits' I can find browsing some old answers of mine (links below will be demos from regex101) follow. When/If you see something you're not familiar about, I probably won't be explaining it here, since the question's focused on positive lookarounds, but you can always look at the demo links I provided where there's a description of the regex, and if you still want some explanation, let me know and I'll try to explain as much as I can.
To get matches between certain characters:
In some matches, positive lookahead make things easier, where a lookahead could do as well, or when it's not so practical to use no lookarounds:
Dog sighed. "I'm no super dog, nor special dog," said Dog, "I'm an ordinary dog, now leave me alone!" Dog pushed him away and made his way to the other dog.
We want to get all the dog (regardless of case) outside quotes. With a positive lookahead, we can do this:
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*$)
to ensure that there are even number of quotes ahead. With a negative lookahead, it would look like this:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
to ensure that there are no odd number of quotes ahead. Or use something like this if you don't want a lookahead, but you'll have to extract the group 1 matches:
(?:"[^"]+"[^"]+?)?(\bdog\b)
Okay, now say we want the opposite; find 'dog' inside the quotes. The regex with the lookarounds just need to have the sign inversed, first and second:
\bdog\b(?!(?:[^"]*"[^"]*")*[^"]*$)
\bdog\b(?=(?:[^"]*"[^"]*")*[^"]*"[^"]*$)
But without the lookaheads, it's not possible. the closest you can get is maybe this:
"[^"]*(\bdog\b)[^"]*"
But this doesn't get all the matches, or you can maybe use this:
"[^"]*?(\bdog\b)[^"]*?(?:(\bdog\b)[^"]*?)?"
But it's just not practical for more occurrences of dog and you get the results in variables with increasing numbers... And this is indeed easier with lookarounds, because they are zero width assertions, you don't have to worry about the expression inside the lookaround to match dog or not, or the regex wouldn't have obtained all the occurrences of dog in the quotes.
Of course now, this logic can be extended to groups of characters, such as getting specific patterns between words such as start and end.
Overlapping matches
If you have a string like:
abcdefghijkl
And want to extract all the consecutive 3 characters possible inside, you can use this:
(?=(...))
If you have something like:
1A Line1 Detail1 Detail2 Detail3 2A Line2 Detail 3A Line3 Detail Detail
And want to extract these, knowing that each line starts with #A Line# (where # is a number):
1A Line1 Detail1 Detail2 Detail3
2A Line2 Detail
3A Line3 Detail Detail
You might try this, which fails because of greediness...
[0-9]+A Line[0-9]+(?: \w+)+
Or this, which when made lazy no more works...
[0-9]+A Line[0-9]+(?: \w+)+?
But with a positive lookahead, you get this:
[0-9]+A Line[0-9]+(?: \w+)+?(?= [0-9]+A Line[0-9]+|$)
And appropriately extracts what's needed.
Another possible situation is one where you have something like this:
#ff00fffirstword#445533secondword##008877thi#rdword#
Which you want to convert to three pairs of variables (first of the pair being a # and some hex values (6) and whatever characters after them):
#ff00ff and firstword
#445533 and secondword#
#008877 and thi#rdword#
If there were no hashes inside the 'words', it would have been enough to use (#[0-9a-f]{6})([^#]+), but unfortunately, that's not the case and you have to resort to .*? instead of [^#]+, which doesn't quite yet solve the issue of stray hashes. Positive lookaheads however make this possible:
(#[0-9a-f]{6})(.+?)(?=#[0-9a-f]{6}|$)
Validation & Formatting
Not recommended, but you can use positive lookaheads for quick validations. The following regex for instance allow the entry of a string containing at least 1 digit and 1 lowercase letter.
^(?=[^0-9]*[0-9])(?=[^a-z]*[a-z])
This can be useful when you're checking for character length but have patterns of varying length in the a string, for example, a 4 character long string with valid formats where # indicates a digit and the hyphen/dash/minus - must be in the middle:
##-#
#-##
A regex like this does the trick:
^(?=.{4}$)\d+-\d+
Where otherwise, you'd do ^(?:[0-9]{2}-[0-9]|[0-9]-[0-9]{2})$ and imagine now that the max length was 15; the number of alterations you'd need.
If you want a quick and dirty way to rearrange some dates in the 'messed up' format mmm-yyyy and yyyy-mm to a more uniform format mmm-yyyy, you can use this:
(?=.*(\b\w{3}\b))(?=.*(\b\d{4}\b)).*
Input:
Oct-2013
2013-Oct
Output:
Oct-2013
Oct-2013
An alternative might be to use a regex (normal match) and process separately all the non-conforming formats separately.
Something else I came across on SO was the indian currency format, which was ##,##,###.### (3 digits to the left of the decimal and all other digits groupped in pair). If you have an input of 122123123456.764244, you expect 1,22,12,31,23,456.764244 and if you want to use a regex, this one does this:
\G\d{1,2}\K\B(?=(?:\d{2})*\d{3}(?!\d))
(The (?:\G|^) in the link is only used because \G matches only at the start of the string and after a match) and I don't think this could work without the positive lookahead, since it looks forward without moving the point of replacement.)
Trimming
Suppose you have:
this is a sentence
And want to trim all the spaces with a single regex. You might be tempted to do a general replace on spaces:
\s+
But this yields thisisasentence. Well, maybe replace with a single space? It now yields " this is a sentence " (double quotes used because backticks eats spaces). Something you can however do is this:
^\s*|\s$|\s+(?=\s)
Which makes sure to leave one space behind so that you can replace with nothing and get "this is a sentence".
Splitting
Well, somewhere else where positive lookarounds might be useful is where, say you have a string ABC12DE3456FGHI789 and want to get the letters+digits apart, that is you want to get ABC12, DE3456 and FGHI789. You can easily do use the regex:
(?<=[0-9])(?=[A-Z])
While if you use ([A-Z]+[0-9]+) (i.e. the captured groups are put back in the resulting list/array/etc, you will be getting empty elements as well.
Note that this could be done with a match as well, with [A-Z]+[0-9]+
If I had to mention negative lookarounds, this post would have been even longer :)
Keep in mind that a positive/negative lookaround is the same for a regex engine. The goal of lookarounds is to perform a check somewhere in your "regular expression".
One of the main interest is to capture something without using capturing parenthesis (capturing the whole pattern), example:
string: aaabbbccc
regex: (?<=aaa)bbb(?=ccc)
(you obtain the result with the whole pattern)
instead of: aaa(bbb)ccc
(you obtain the result with the capturing group.)

Regex - Get string between two words that doesn't contain word

I've been looking around and could not make this happen. I am not totally noob.
I need to get text delimited by (including) START and END that doesn't contain START. Basically I can't find a way to negate a whole word without using advanced stuff.
Example string:
abcSTARTabcSTARTabcENDabc
The expected result:
STARTabcEND
Not good:
STARTabcSTARTabcEND
I can't use backward search stuff. I am testing my regex here: www.regextester.com
Thanks for any advice.
Try this
START(?!.*START).*?END
See it here online on Regexr
(?!.*START) is a negative lookahead. It ensures that the word "START" is not following
.*? is a non greedy match of all characters till the next "END". Its needed, because the negative lookahead is just looking ahead and not capturing anything (zero length assertion)
Update:
I thought a bit more, the solution above is matching till the first "END". If this is not wanted (because you are excluding START from the content) then use the greedy version
START(?!.*START).*END
this will match till the last "END".
START(?:(?!START).)*END
will work with any number of START...END pairs. To demonstrate in Python:
>>> import re
>>> a = "abcSTARTdefENDghiSTARTjlkENDopqSTARTrstSTARTuvwENDxyz"
>>> re.findall(r"START(?:(?!START).)*END", a)
['STARTdefEND', 'STARTjlkEND', 'STARTuvwEND']
If you only care for the content between START and END, use this:
(?<=START)(?:(?!START).)*(?=END)
See it here:
>>> re.findall(r"(?<=START)(?:(?!START).)*(?=END)", a)
['def', 'jlk', 'uvw']
The really pedestrian solution would be START(([^S]|S*S[^ST]|ST[^A]|STA[^R]|STAR[^T])*(S(T(AR?)?)?)?)END. Modern regex flavors have negative assertions which do this more elegantly, but I interpret your comment about "backwards search" to perhaps mean you cannot or don't want to use this feature.
Update: Just for completeness, note that the above is greedy with respect to the end delimiter. To only capture the shortest possible string, extend the negation to also cover the end delimiter -- START(([^ES]|E*E[^ENS]|EN[^DS]|S*S[^STE]|ST[^AE]|STA[^RE]|STAR[^TE])*(S(T(AR?)?)?|EN?)?)END. This risks to exceed the torture threshold in most cultures, though.
Bug fix: A previous version of this answer had a bug, in that SSTART could be part of the match (the second S would match [^T], etc). I fixed this but by the addition of S in [^ST] and adding S* before the non-optional S to allow for arbitrary repetitions of S otherwise.
May I suggest a possible improvement on the solution of Tim Pietzcker?
It seems to me that START(?:(?!START).)*?END is better in order to only catch a START immediately followed by an END without any START or END in between. I am using .NET and Tim's solution would match also something like START END END. At least in my personal case this is not wanted.
[EDIT: I have left this post for the information on capture groups but the main solution I gave was not correct.
(?:START)((?:[^S]|S[^T]|ST[^A]|STA[^R]|STAR[^T])*)(?:END)
as pointed out in the comments would not work; I was forgetting that the ignored characters could not be dropped and thus you would need something such as ...|STA(?![^R])| to still allow that character to be part of END, thus failing on something such as STARTSTAEND; so it's clearly a better choice; the following should show the proper way to use the capture groups...]
The answer given using the 'zero-width negative lookahead' operator "?!", with capture groups, is: (?:START)((?!.*START).*)(?:END) which captures the inner text using $1 for the replace. If you want to have the START and END tags captured you could do (START)((?!.*START).*)(END) which gives $1=START $2=text and $3=END or various other permutations by adding/removing ()s or ?:s.
That way if you are using it to do search and replace, you can do, something like BEGIN$1FINISH. So, if you started with:
abcSTARTdefSTARTghiENDjkl
you would get ghi as capture group 1, and replacing with BEGIN$1FINISH would give you the following:
abcSTARTdefBEGINghiFINISHjkl
which would allow you to change your START/END tokens only when paired properly.
Each (x) is a group, but I have put (?:x) for each of the ones except the middle which marks it as a non-capturing group; the only one I left without a ?: was the middle; however, you could also conceivably capture the BEGIN/END tokens as well if you wanted to move them around or what-have-you.
See the Java regex documentation for full details on Java regexes.