Transform negative regex lookahead to greedy needed - regex

The task I'm trying to solve seems pretty simple - I need to choose all font-changing tags except for the particular one (AIGDT). I'm going to cut them out in order to simplify further text processing.
I'm trying to use negative regex lookahead like this:
Font='(?!(AIGDT))(.*)'
But for the single-line text sample:
<StyleOverride Font='Arial' FontSize='0,32971'>[</StyleOverride><StyleOverride FontSize='0,21558'> </StyleOverride><StyleOverride Font='AIGDT' Italic='False'>n</StyleOverride><DimensionValue/> <StyleOverride Font='Arial' FontSize='0,32971'>]</StyleOverride>
It returns single 200+symbol match ... while I'm expecting two 12-symbol matches (Font='Arial').
I believe this is because the lookahead is greedy.
Can anybody hint me to what is my mistake?
Thanks in advance.

How does Font='(?!(AIGDT))([^']+)' work for you?
Basically, narrow down the second capture to "anything but a single quote".
(Full disclosure: On my phone at the moment so I haven't run it, but in theory it works nicely)

Related

Regular Expression look ahead in log

https://regex101.com/r/9kfa7D/4
I can never get the look ahead portion correct. I've tried a few different things, but I'm trying to get to the next date and parse it like that. Mainly because I don't know what the message will look like and it could be pretty random. Any help would be great.
I need to group the message portion of it.
Edit: Updated to make it a little more clear of what I'm trying to do. Never everything from each date.
You can just tweak your regex without tinkering lookahead like this:
^\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}
Updated Regex Demo
EDIT:
As per updated question OP can use this negative lookahead based regex to capture log text:
^[^\[]+\[[^\]]+\] +[^:]+ +(.*(?:\n(?!\d{2}-[a-zA-Z]{3}-).*)*)
This regex doesn't use DOTALL flag by unrolling the loop in last segment. This makes above regex pretty fast to complete the parsing.
New Demo
If you care for the message between log timestamps use this (it's in the 2-nd group):
/(\d{2}-\w{3}-\d{4} \S+ \S+ \[[^\]]++\] )(?=(.+)((?1)|\z))/gms
^(?:\d{2}-\w{3}-\d{4} (?:\d{2}:){2}\d{2}\.\d{3}) ((?:[^\n]+(?:\n+(?!\d{2}-\w{3}-\d{4})|))+)
The first part is the date pattern, which is non-grouping since you do not want to keep the date.
The second part is [^\n]+ which is followed by a \n provided it is not followed by \d{2}-\w{3}-\d{4} (hence the negative look ahead).
The second part is then repeated any number of times.
You can see the demo on regex101.
What you need
(^\d+.[A-Z].*?)[A-Z]
how it works
Lots of people like the complex thinking when they are confront a regex. But you should know exactly what you want.
you just need to match this: 29-Jun-2016 09:33:43.565 INFO and nothing else. So let's begin:
First: two digit,
next: A word with capital letter
next: everything from this word to the next capitalize word
finish.
the main rule
Non-greedy mantch: .*?
prove
NOTE
do you want to match from beginning to log
very easy just add .*?log at the end. that's it.
Do you ever pay attention to how many steps it take?
First of mine: 7952
Second of mine: 13751
Compare it with other
After putting the picture here. some guys update their regex. I do
not want to argue. no problem. I just wanted to show it.
Otherwise I can ( as you can ) makes it less by choice the specific
pattern For example: ^\d+-[A-Za-z]+-\d+\s\d+:\d+:\d+\.\d+ Now 7952 become 3878
Do you want to learn how lock-head assertion works?
Very easy. The main concept is that (?=) is never matches anything. It only matches the position just one point before you want.
like:
^\d+-[A-Z].+(?=[A-Z]+ ).
It still matches: 29-Jun-2016 09:33:43.565 INFO
Pay attention to . at the end. So here the look head assertion point to between F and O
If would like to match this 29-Jun-2016 09:33:43.565 then what can you do?
Think about this:
^\d+-[A-Za-z].+(?=[\d] ).
and figure out it by yourself.

RegEx to match acronyms

I am trying to write a regular expression that will match values such as U.S., D.C., U.S.A., etc.
Here is what I have so far -
\b([a-zA-Z]\.){2,}+
Note how this expression matches but does not include the last letter in the acronym.
Can anyone help explain what I am missing here?
SOLUTION
I'm posting the solution here in case this helps anyone.
\b(?:[a-zA-Z]\.){2,}
It seems as if a non-capturing group is required here.
Try (?:[a-zA-Z]\.){2,}
?: (non-capturing group) is there because you want to omit capturing the last iteration of the repeated group.
For example, without ?:, 'U.S.A.' will yield a group match 'A.', which you are not interested about.
None of these proposed solutions do what yours does - make sure that there are at least 2 letters in the acronym. Also, yours works on http://rubular.com/ . This is probably some issue with the regex implementation - to be fair, all of the matches that you got were valid acronyms. To fix this, you could either:
Make sure there's a space or EOF succeeding your expression ((?=\s|$) in ruby at least)
Surround your regex with ^ and $ to make sure it catches the whole string. You'd have to split the whole string on spaces to get matches with this though.
I prefer the former solution - to do this you'd have:
\b([a-zA-Z]\.){2,}(?=\s|$)
Edit: I've realized this doesn't actually work with other punctuation in the string, and a couple of other edge cases. This is super ugly, but I think it should be good enough:
(?<=\s|^)((?:[a-zA-Z]\.){2,})(?=[[:punct:]]?(?:\s|$))
This assumes that you've got this [[:punct:]] character class, and allows for 0-1 punctuation marks after an acronym that won't be captured. I've also fixed it up so that there's a single capture group that gets the whole acronym. Check out validation at http://rubular.com/r/lmr0qERLDh
Bonus: you now get to make this super confusing to anyone reading it.
This should work:
/([a-zA-Z]\.)+/g
I have slightly modified the solution above:
\b(?:[a-zA-Z]+\.){2,}
to enable capturing acronyms containing more than one letter between the dots, like in 'GHQ.AFP.X.Y'

Smallest possible match / nongreedy regex search

I first thought that this answer will totaly solve my issue, but it did not.
I have a string url like this one:
http://www.someurl.com/some-text-1-0-1-0-some-other-text.htm#id_76
I would like to extract some-other-text so basically, I come with the following regex:
/0-(.*)\.htm/
Unfortunately, this matches 1-0-some-other-text because regex are greedy. I can not succeed make it nongreedy using .*?, it just does not change anything as you can see here.
I also tried with the U modifier but it did not help.
Why the "nongreedy" tip does not work?
In case you need to get the closest match, you can make use of a tempered greedy token.
0-((?:(?!0-).)*)\.htm
See demo
The lazy version of your regex does not work because regex engine analyzes the string from left to right. It always gets leftmost position and checks if it can match. So, in your case, it found the first 0-and was happy with it. The laziness applies to the rightmost position. In your case, there is 1 possible rightmost position, so, lazy matching could not help achieve expected results.
You also can use
0-((?!.*?0-).*)\.htm
It will work if you have individual strings to extract the values from.
You want to exclude the 1-0? If so, you can use a non capturing group:
(?:1-0-)+(.*?)\.htm
Demo

Regex for any string not ending on .js

This has been driving me nuts. I'm trying to match everything that doesn't end in .js. I'm using perl, so ?<! etc. is more than welcome.
What I'm trying to do:
Do match these
mainfile
jquery.1.1.11
my.module
Do NOT match these
mainfile.js
jquery.1.1.11.js
my.module.js
This should be an insanely simple task, but I'm just stuck. I looked in the docs for both regex, sed, perl and was even fiddling around for half an hour on regexr. Intuitively, this example (/^.*?(?!\.js)$/) should do it. I guess I just stared myself blind.
Thanks in advance.
You can use this regex to make sure your match doesn't end with .js:
^(?!.+\.js$).+$
RegEx Demo
(?!.+\.js$) is a negative lookahead condition to fail the match if line has .js at the end.
This one should suit your needs:
^.*(?<![.]js)$
The simplest approach when you only have negative matching conditions is to construct a positive regex and then check that it doesn't match.
if ($string !~ /\.js$/)
{
print "Doesn't end in .js";
}
This is easier to understand and more efficient than a negative look-around.
Look-arounds are only needed when you need to mix positive and negative conditions (for example, "I need to match "foo" out of a string, but only when it is not followed by "bar"). Even then, sometimes it is easier to use multiple simple patterns and logic, rather than meeting all your requirements with one complex pattern.

Regex href match a number

Well, here I am back at regex and my poor understanding of it. Spent more time learning it and this is what I came up with:
/(.*)
I basically want the number in this string:
510973
My regex is almost good? my original was:
"/<a href=\"travis.php?theTaco(.*)\">(.*)<\/a>/";
But sometimes it returned me huge strings. So, I just want to get numbers only.
I searched through other posts but there is such a large amount of unrelated material, please give an example, resource, or a link directing to a very related question.
Thank you.
Try using a HTML parser provided by the language you are using.
Reason why your first regex fails:
[0-9999999] is not what you think. It is same as [0-9] which matches one digit. To match a number you need [0-9]+. Also .* is greedy and will try to match as much as it can. You can use .*? to make it non-greedy. Since you are trying to match a number again, use [0-9]+ again instead of .*. Also if the two number you are capturing will be the same, you can just match the first and use a back reference \1 for 2nd one.
And there are a few regex meta-characters which you need to escape like ., ?.
Try:
<a href=\"travis\.php\?theTaco=([0-9]+)\">\1<\/a>
To capture a number, you don't use a range like [0-99999], you capture by digit. Something like [0-9]+ is more like what you want for that section. Also, escaping is important like codaddict said.
Others have already mentioned some issues regarding your regex, so I won't bother repeating them.
There are also issues regarding how you specified what it is you want. You can simply match via
/theTaco=(\d+)/
and take the first capturing group. You have not given us enough information to know whether this suits your needs.