RegEx for IP Address with no range limit - regex

I'm looking for a regex that finds IP Addresses with no range limit (i.e. 0-999). This is "simpler" than a regular IP Address regex but I'm learning regex and am stumped on how to essentially end the regex and not match IP Addresses with more than 4 periods or characters before/after it.
This is what I have: "/\b(\d{1,3}\.){3}(\d{1,3})\b/"
So, with this regex it will find most IP Addresses but will fail when there is an IP Address like this:
1.2.3.4.5
Appreciate the help. And it doesn't matter what flavor or regex, just need to know how to not match the case above.

You may use lookarounds to restrict the context around your expected matches:
\b(?<!\d\.)(?:\d{1,3}\.){3}\d{1,3}\b(?!\.\d)
^^^^^^^^^ ^^^^^^^^
See the regex demo
Here,
(?<!\d\.) is a negative lookbehind that fails the match if, immediately to the left of the current location, there is a digit + .
(?!\.\d) is a negative lookahead that fails the match if, immediately to the right of the current location, there is a . + a digit.
To also make sure the octets of 1 to 3 digits are matched, you may add more restriction:
\b(?<!\d\.|\d)(?:\d{1,3}\.){3}\d{1,3}\b(?!\.?\d)
^^^^^^^^^^^^ ^^^^^^^^^
See another regex demo.
Here, (?<!\d\.|\d) also fails if there is a digit immediately in front of the current location, and the lookahead is also failing when there is a digit without a dot in front after the expected match.

You can use this one also.
^[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$

Related

Match specific pattern that does not contain other pattern in one expression

I'm looking for a regex to use in nginx location matching, that would match a specified end pattern not being anywhere preceded by a specified other pattern.
Like, I have files:
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip-dev/streaming-wasm-gzip-dev.data.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.wasm.framework.unityweb
webgl-0.4.0-alpha.1-gzip/streaming-wasm-gzip.data.unityweb
I want to match all \.unityweb except those that are anywhere preceded by dev. Basically, I need to match last two lines. I cannot hardcode it, as the files/directories might be named arbitrary.
The usual ((?!dev\/).)*$ doesn't suffice, because it still gets the ends. (?<!dev) also cannot be added anwyhere as it will only match directly before.
I am out of clues and also out of regex fu!
The solution does not have to be strictly regex, might be nginx based too.
It might have been asked before, but I cannot seem to know the correct keywords to find it.
Try
^(?!.*?dev\/.*).+\.unityweb$
See the demo here
Description:
^ From the start of the line
(?! _______ ) Negative Lookahead
.*?dev\/ Match any character any amount of times, until you reach dev followed by a slash
.* Match any characters any amount of times
Negative lookahead closes
.+ Match any character, more than once
\.unityweb - until you reach .unityweb
$ End of the line
Use the full match for what you need
EDIT
Just realised that you also state a contradiction in your question, as you say you don't want to match anything preceded by dev/ but you also want to match the first two examples you gave.
That can be done by changing the negative lookahead to a positive lookahead:
^(?=.*?dev\/.*).+\.unityweb$
See the demo here
You can use this
^(?!.*dev.*\.unityweb)(?=.*\.unityweb).*$
Demo

REGEX for search and exclude combined

Overview:
I am trying to combine two REGEX queries into one:
\d+\.\d+\.\d+\.\d+
^(?!(10\.|169\.)).*$
I wrote this as a two part query. The first part would isolate IPs in a block of text and after I copy and paste this I select everything and that does not being with a 10 or 169.
Questions:
It seems like I am over complicating this:
Can anybody see a better way to do this?
Is there a way to combine these two queries?
Sure. Just put the anchored negative look ahead at the start:
^(?!10\.|169\.)\d+\.\d+\.\d+\.\d+$
Note: Unnecessary brackets have been removed.
To match within a line, ie remove the anchors and use a "word boundary" \b as the anchor:
\b(?!10\.|169\.)\d+\.\d+\.\d+\.\d+
A quick-and-gimme-regex style answer
Basic one (whole string looks like an IP): ^\d+\.\d+\.\d+\.\d+$
Lite (period-separated 4-digit chunks, a whole word): \b\d+\.\d+\.\d+\.\d+\b
Medium (excluding junk like 1.2.4.6.7.9.0): (?<!\d\.)\b\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 1 (not starting with 10 or 169): (?<!\d\.)\b(?!(?:1(?:0|69))\.)\d+\.\d+\.\d+\.\d+\b(?!\.\d+)
Advanced 2 (not ending with 8 or 10): (?<!\d\.)\b\d+\.\d+\.\d+\.(?!(?:8|10)\b)\d+\b(?!\.\d+)
Details for the curious
The \b is a word boundary that makes it possible to match exact "words" (entities consisting of [a-zA-Z0-9_] characteters) inside a longer text. So, if we do not want to match 12.12.23.56 inside g12.12.23.56g, we use the Lite version.
The lookarounds together with the word boundary, make it possible to further restrict the matches. (?<!\d\.) - a negative lookbehind - and a (?!\.\d+) - a negative lookahead - will fail a match if the IP-resembling substring is preceded with a digit+. or followed with a .+digit. So, we do not match 12.12.34.56.78.90899-like entities with this regex. Choose Medium regex for that case.
Now, you need to restrict the matches to those that do not start with some numeric value. You need to make use of either a lookbehind, or a lookahead. When choosing between a lookbehind or a lookahead solution, prefer the lookahead, because 1) it is less resource consuming, and 2) more flavors support it. Thus, to fail all matches where IP first number is equal to 10 or 169, we can use a negative lookahead anchored after the leading word boundary: (?!(?:1(?:0|69))\.). The syntax is (?!...) and inside, we match either 1 followed with 0 and then a ., or 1 followed with 69 and then .. Note that we could write (?!10\.|169\.) but there is some redundant backtracking overhead then, as 1 part is repeating. Best practice is to "contract" alternations so that the beginning of each branch did not repeat, make the alternation group more linear. So, use Advanced 1 regex version to get those IPs.
A similar case is the Advanced 2 regex for getting some IPs that do not end with some value.

regex to match first instance of a word but only when preceeded by match from another pattern

I've found some info on finding the first instance of a word in a string, but I'm trying to find the first instance of a word (two, actually, but in separate calls) only when it is preceded by some very specific text (an IP address delimited by underscores) that varies slightly. Also, these words are separated by underscores, so for some reason \b isn't working for me.
Here's some example strings to test against one line at a time. Only bolded words should be matched.
192_168_10_2_card02_port01_other_text_with_card_or_port
10_22_1_200_card4_port5_another_string_with_port_or_card
something_else_with_card_or_port_in_it
And in a second call, I'd like to match a different word in these strings.
192_168_10_2_card02_port01_other_text_with_card_or_port
10_22_1_200_card4_port5_another_string_with_port_or_card
something_else_with_card_or_port_in_it
My regex flavor is POSIX regex (for PostgreSQL 9.4). I've been able to run with anything that works in here http://regexpal.com/ so far.
Even if it can't solve for all 3 examples at once, if it could just solve for the first two, that would be very helpful.
Edit: To be absolutely clear, my intent is to replace the first string 'card' with the character 'c' and then to replace the first string 'port' with the letter 'p' without affecting any instance of 'card' or 'port' that are not immediately followed by numbers. This is why my match needs to include just those first words without their corresponding numbers.
If you can use negative lookahead you can use card((?!port).)*port to match a string with card, than any number of characters not followed by port, then card again.
EDIT:
if the input is always in the same format, then you can be more specific by using card[0-9]{1,2}_port. This will keep it from matching any other extraneous instances of card and port
EDIT2:
to match only the word in the first case you can use a positive lookahead: card(?=[0-9]{1,2}_port). Im not sure if your flavor allows positive lookbehind (the tester doesnt, but that is in js), but give (?<=card[0-9]{1,2}_)port a shot. If positive lookbehind doesnt work you may need to look into alternatives.
The \b assertion is not working in this case because _ is considered a word character.
Demo
You can use a look behind:
(?<=_)(card).*?(?<=_)(port)
Demo
To be even more specific, use the IP address pattern:
(^(?:\d+_){4})(card\d+)_(port\d+)
Demo
I had to solve this in two steps. In the first, I matched only lines with the IP string in the beginning (this excludes lines like my 3rd example). In the second step, I used regexp_replace to replace the first match of each word.
Unfortunately, I had completely missed the fact that regexp_replace only replaces the first match unless told otherwise with the 'g' flag:
WHEN (SELECT regexp_matches(mystring, '^1(?:[0-9]{1,3}_){4}card[0-9]{1,2}_port[0-9]{1,2}')) IS NOT NULL
THEN regexp_replace(regexp_replace(mystring, 'card', 'c'), 'port', 'p')
Though I still wish I could figure out how to match one of those words in a single expression, and I would accept any answer that could achieve that.

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....