Negative Lookbehind stops after first occurrence of an optional regex - regex

I'm removing protocol from links in HTML files using the following regex in Python:
re.sub(r"((http:|https:)?(\/\/website.com))", r"\3", result)
This works as expected, but I don't want to replace the protocol when the attribute is content. So I started looking into using Regex Negative Lookbehind.
(?<!content=")(http:|https:)?(\/\/website.com)
This regex should basically mean that if the string starts with <content=", then it should not match the rest. But the problem is that it only rejects the optional regex, (http:|https:)?, likely because it's optional. It rejects the whole line if it's not optional.
Here's a screenshot that shows the problem clearly. The first line should be rejected completely, but it only rejected the protocol.
Any suggestions? :)
Thanks!

The problem with the original regex is that it matches //website.com that does not have content=" directly before it, because the http:/https: is optional. To workaround it, you can include the protocol in the negative lookbehind.
As variable length lookbehinds are not supported in Python, you can do the following:
(?<!content=")(?<!content="https:)(?<!content="http:)((https?:)?(//website.com))
Demo

The regex finds //website.com that does not have content=" directly in front of it. So returns a match.
How about
(?<!content="|content="http:|content="https:)(http:|https:)?(\/\/website.com)

Related

Matching Everything That Does Not Contain Anything in an Array with Regex

I have the following regex query I'm trying to use to exclude assets from being cached:
^((?!(\.css|\.js|\.|\.json|\.xml|\.svg|\.ico|\.png|\.mp3|\.jpg|\.svg|\.woff|\.woff2|\.eot|\.ttf|\/api\/play\/add|\/api\/favorite|\/Listen\/channel|getAccountInfo)).)*$
Except it doesn't match https://exampl.com/home for some reason. Does anyone know how I can fix this? Also, is there anyway I can make the Regex expression better?
Your regex contains a |\.| part (after |\.js). That alternative makes your regex fail the match with any string containing a dot. You need to remove that alternative:
^((?!(\.css|\.js|\.json|\.xml|\.svg|\.ico|\.png|\.mp3|\.jpg|\.svg|\.woff|\.woff2|\.eot|\.ttf|\/api\/play\/add|\/api\/favorite|\/Listen\/channel|getAccountInfo)).)*$
See the regex demo

Negative lookahead Regex Issue

I started looking lookaheads and tried to create a simple example, but for some reason it's not working properly when I try using negative lookahead.
I have the following simple regex:
href="(.+?)"(?!\s)
and this string:
test
test
Testing enviorment: https://regex101.com/r/JztPUe/1
I'm trying to take the url beween the href only if it's not followed by a space, but it seems that it doesn't undestand me, since it's getting the first and the second URL.
When I change it to a positive lookahead it's working as it should be and it takes only the second URL, but the negative one is not working as expected.
Can someone point where is my mistake?
You should consider using an HTML parser instead of trying to do this with a regex. That being said, you could just phrase your regex by insisting that what follows the href clause is not a space:
href="([^"]*)"[^ ]
Demo
Your current regex:
href="(.+?)"(?!\s)
works as expected in Regex 101 when slightly rewritten as this:
href="([^"]*)"(?!\s)
Demo
The issue you were having appears to be caused by the flavor of regex in your demo not supporting the lazy dot (.+?). This is a Perl extension and is not supported by all engines.
With space href="\K(\S+)"\s\K demo
Without space href="\K(\S+)">\K demo
\K escapes string sequences.

Having trouble parsing tcpdump output with regex

In particular, I'm trying to get the "Host: ..." part of an HTTP header of the HTTP request packet.
One instance is something like this:
.$..2~.:Ka3..E..D'.#.#..M....J}.e...P...q...W................g.o3GET./.HTTP/1.1...$..2~.:Ka3..E..G'.#.#..I....J}.e...P.......W................g..\host:.domain.com..
Another is this:
.$..2~.:Ka3..E..D'.#.#..M....J}.e...P...q...W................g.o3GET./.HTTP/1.1...$..2~.:Ka3..E..G'.#.#..I....J}.e...P.......W................g..\host:.domain.com..Connection:.Keep-Alive....
Note this is the ascii output. I want to extract that host. My initial regex was:
[hH]ost:\.(.*)..
This works for the first case, but does not work for the second one. In particular, for the second one it will extract: "domain.com..Connection.Keep-Alive.."
I would appreciate some help with creating a general regex that works in all cases.
Use this:
(?<=host:\.)(?:\.?[^.])+
See demo
The lookbehind (?<=host:\.) asserts that what precedes is host:.
(?:\.?[^.]) matches an optional period, then one character that is not a period.
And the + makes us match one or more of these characters
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>

Can you put optional tokens within a positive look behind of a regular expression?

I have the following content with what I think are the possible cases of someone defining an link:
hello link what <a href=something.jpg>link</a>
I also have the following regular expression with a positive look behind:
(?<=href=["\'])something
The expression matches the word "something" in the first two links. In an attempt to capture the third instance of "something" in the link without any quotes, I thought making the ["\'] token optional (using ?) would capture it. The expression now looks like this:
(?<=href=["\']?)something
Unfortunately it now does not mach any of the instances of "something". What could I be doing incorrectly? I'm using http://gskinner.com/RegExr/ to test this out.
Many regex flavors only support fixed-length lookbehind assertions. If you have an optional token in your lookbehind, its length isn't fixed, rendering it invalid.
So the real question is: What regex flavor are you actually targeting with your regex?