Retrieve characters after nth occurrence of an another with Regex - regex

I'm writing a simple bot that broadcasts messages to clients based on messages from a server. This will be done in JavaScript but I am trying to understand Regex. I've been Googling for the past hour and I've come so close but I am simply unable to solve this one.
Basically I need to retrieve everything between the second / and the first [. It sounds really simple but I cannot figure out how to do this.
Here's some sample code:
192.168.1.1:33291/76561198014386231/testName joined [linux/76561198014386231]
Here's the Regex I've come up with:
\/(.*?)\[
I've found lots of similar questions here on StackOverflow but most of them seem specific to a particular language or end up being too complex and I'm unable to whittle down the query.
I know this is a simple one, but I am totally stumped.

Instead of .*?. Then you could match everything but a forward slash by doing [^\/]*.
([^\/]*)\s*\[
Live preview
If it needs to be after the second slash. As in the contents between the second slash and the square bracket can contain slashes. Then you could do:
(?:.*?\/){2}(.*)\s*\[
Live preview
Remove the \s* if you want to. I'm just assuming you don't care about that whitespace.

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

Trying to regex YouTube ads with pihole

EDIT:
As far as I know, Pihole does not block YouTube ads.
Original Post:
Trying to regex urls like:
r4---sn-vgqsrnez.googlevideo.com
r1---sn-vgqsknlz.googlevideo.com
r5---sn-vgqskn7e.googlevideo.com
r3---sn-vgqsknez.googlevideo.com
r6---sn-vgqs7ney.googlevideo.com
r4---sn-vgqskne6.googlevideo.com
r4---sn-vgqsrnez.googlevideo.com
r5---sn-vgqskn76.googlevideo.com
r6---sn-vgqs7ns7.googlevideo.com
r1---sn-vgqsener.googlevideo.com
r1---sn-vgqskn7z.googlevideo.com
r1---sn-vgqsknek.googlevideo.com
r6---sn-vgqsener.googlevideo.com
r3---sn-vgqs7nly.googlevideo.com
r1---sn-vgqsknes.googlevideo.com
r4---sn-vgqsrnes.googlevideo.com
r6---sn-vgqskn76.googlevideo.com
I've tried:
(^|\.)r[0-100]---sn-vgqs?n??\.googlevideo\.com$
(^|\.)r[0-100]?*\.googlevideo\.com$
^r[0-100]---sn-vgqs(?:.*)n(?:.*)(?:.*).googlevideo.com$
^r[0-100]---sn-vgqs(?:.*)n(?:.*).googlevideo.com$
but nothing works
I am probably using regex wrong because I don't have much experience with it but looking online some people have said it could be a thing with Pihole.
I'm guessing that you'd like to have restricted boundaries, if not though, this expression might be somewhat close to what you have in mind:
^r\d+---sn-vgqs[a-z0-9]{4}\.googlevideo\.com$
Demo 1
You can add more boundaries, if necessary, such as:
^r(?:100|[1-9]\d|\d)---sn-vgqs[a-z0-9]{4}\.googlevideo\.com$
Demo 2
or:
^r(?:100|[1-9]\d|\d)---sn-vgqs(?:rne(?:s|z)|kne(?:s|z)|knlz|kn7e|7ney|kne6|kn76|7ns7|ener|kn7z|knek|7nly)\.googlevideo\.com$
Demo 3
which I'm just guessing.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
The following Regex match all the url start with "r" then followed by anything else without limiting number of character then followed by "sn" then followed by any number of characters then end with ".googlevideo.com" the expression was anchor with ^ and $.
I try it on my pihole with great success but have to remove it later. all r....sn...googlevideo.com was blocked in the query list but it also rendered my smart tv youtube app broken. It will not play any video at all unless I remove it from pihole. use it at your own risk.
^r.+sn.+(\.googlevideo\.com)$
The post is a bit older but because I tried myself with regexes I just want to say that your regexes can't work because of one "little" point.
Pi-Hole uses the POSIX ERE (POSIX Extended Regular Expressions) standard.
So there are no lazy quantifiers or shorthand character classes.
It also does not support non-capturing groups like in your third and fourth line.
You can check such regexes in tools like RegexBuddy. Maybe other free tools can check it too and help to convert it.
My current regex is:
^r[[:digit:]]+---sn-4g5e[a-z0-9]{4}\.googlevideo\.com$
It correctly blocks all ads BUT also videos.
If you use it you have to do the following.
Open a youtube video and check if the video loads.
If not, go to your pi hole dashboard to the query log.
For your device you will have two dns queries
r5---sn-4g5e6nze.googlevideo.com
and
r5---sn-4g5ednse.googlevideo.com
The last one (upper) in the query log is the video. So whitelist
the dns. You have to do it sometimes.
Greetings

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

RegEx filter links from a document

I am currently learning regex and I am trying to filter all links (eg: http://www.link.com/folder/file.html) from a document with notepad++. Actually I want to delete everything else so that in the end only the http links are listed.
So far I tried this : http\:\/\/www\.[a-zA-Z0-9\.\/\-]+
This gives me all links which is find, but how do I delete the remaining stuff so that in the end I have a neat list of all links?
If I try to replace it with nothing followed by \1, obviously the link will be deleted, but I want the exact opposite to have everything else deleted.
So it should be something like:
- find a string of numbers, letters and special signs until "http"
- delete what you found
- and keep searching for more numbers, letters ans special signs after "html"
- and delete that again
Any ideas? Thanks so much.
In Notepad++, in the Replace menu (CTRL+H) you can do the following:
Find: .*?(http\:\/\/www\.[a-zA-Z0-9\.\/\-]+)
Replace: $1\n
Options: check the Regular expression and the . matches newline
This will return you with a list of all your links. There are two issues though:
The regex you provided for matching URLs is far from being generic enough to match any URL. If it is working in your case, that's fine, else check this question.
It will leave the text after the last matched URL intact. You have to delete it manually.
The answer made previously by #psxls was a great help for me when I have wanted to perform a similar process.
However, this regex rule was written six years ago now: accordingly, I had to adjust / complete / update it in order it can properly work with the some recent links, because:
a lot of URL are now using HTTPS instead of HTTP protocol
many websites less use www as main subdomain
some links adds punctuation mark (which have to be preserved)
I finally reshuffle the search rule to .*?(https?\:\/\/[a-zA-Z0-9[:punct:]]+) and it worked correctly with the file I had.
Unfortunately, this seemingly simple task is going to be almost impossible to do in notepad++. The regex you would have to construct would be...horrible. It might not even be possible, but if it is, it's not worth it. I pretty much guarantee that.
However, all is not lost. There are other tools more suitable to this problem.
Really what you want is a tool that can search through an input file and print out a list of regex matches. The UNIX utility "grep" will do just that. Don't be scared off because it's a UNIX utility: you can get it for Windows:
http://gnuwin32.sourceforge.net/packages/grep.htm
The grep command line you'll want to use is this:
grep -o 'http:\/\/www.[a-zA-Z0-9./-]\+\?' <filename(s)>
(Where <filename(s)> are the name(s) of the files you want to search for URLs in.)
You might want to shake up your regex a little bit, too. The problems I see with that regex are that it doesn't handle URLs without the 'www' subdomain, and it won't handle secure links (which start with https). Maybe that's what you want, but if not, I would modify it thusly:
grep -o 'https\?:\/\/[a-zA-Z0-9./-]\+\?' <filename(s)>
Here are some things to note about these expressions:
Inside a character group, there's no need to quote metacharacters except for [ and (sometimes) -. I say sometimes because if you put the dash at the end, as I have above, it's no longer interpreted as a range operator.
The grep utility's syntax, annoyingly, is different than most regex implementations in that most of the metacharacters we're familiar with (?, +, etc.) must be escaped to be used, not the other way around. Which is why you see backslashes before the ? and + characters above.
Lastly, the repetition metacharacter in this expression (+) is greedy by default, which could cause problems. I made it lazy by appending a ? to it. The way you have your URL match formulated, it probably wouldn't have caused problems, but if you change your match to, say [^ ] instead of [a-zA-Z0-9./-], you would see URLs on the same line getting combined together.
I did this a different way.
Find everything up to the first/next (https or http) (then everything that comes next) up to (html or htm), then output just the '(https or http)(everything next) then (html or htm)' with a line feed/ carriage return after each.
So:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace with: \1\2\3\r\n
Saves looking for all possible (incl non-generic) url matches.
You will need to manually remove any text after the last matched URL.
Can also be used to create url links:
Find: .*?(https:|http:)(.*?)(html|htm)
Replace: \1\2\3\r\n
or image links (jpg/jpeg/gif):
Find: .*?(https:|http:)(.*?)(jpeg|jpg|gif)
Replace: <img src="\1\2\3">\r\n
I know my answer won't be RegEx related, but here is another efficient way to get lines containing URLs.
This won't remove text around links like Toto mentioned in comments.
At least if there is nice pattern to all links, like https://.
CTRL+F => change tab to Mark
Insert https://
Tick Mark to bookmark.
Mark All.
Find => Bookmarks => Delete all lines without bookmark.
I hope someone who lands here in search of same problem will find my way more user-friendly.
You can still use RegEx to mark lines :)

Regex pattern to format url

I have this pattern ^(?:http://)?(?:www.)?(.*?)/?(.*?)$ but it's still not perfect.
Let's say we have these urls to test against it:
example.com
example.com/
www.example.com/
http://example.com/
example.com/param
http://example.com/params/
The final output should be example.com/ if there's no parameters and example.com/params/ if with parameters. My problem is that it matches only second group. It doesn't look like /? is working otherwise it would stop on slash character. Is it possible to achieve what I want using only one pattern?
So you want the host name in $1? Your regex is ambiguous, there are many ways to match it; the regex engine will prefer the longest, leftmost possible match. If you don't want slashes in the first part, then say so. Explicitly. (?:http://)?(?:www\.)?([^/]*)?/?(.*)?$
One that I've used is:
((?:(?:https?://)?[\w\d:##%/;$()~_?\+\-=&]+|www|ftp)\.[\w\d:##%/;$()~_?\+\-=&\.]+)
The problem with URLs is that there are SO many ways one can be written, which is why the above code looks so congested. This will match all your examples above, but it will also match things like:
alkasi.jaias
Hopefully this will get you headed to where you need or want to go, and perhaps someone might be able to come up behind me and clean it up some (it's early morning, I'm getting ready for work, and am exhausted. :P)