RegEx: a group to match any string beginning with a specific character - regex

I'm creating some reports in Google Analytics.
I am trying to write a RegEx that will match
www.website.com/
www.website.com/?_string_begins_with_question_mark
But will not match
www.website.com/string_doesnt_begin_with_question_mark
Using Reggy (with POSIX Extended), I tried to create an optional group that would match a string beginning with a question mark, followed by any number of characters. I thought
(\?.+)?
would do the trick, but it ignores the question mark requirement, and matches any string.
I tried some variations:
www.website.com/(\?(.+))?
www.website.com/(\?.+)?
www.website.com/(?.+)?
Et cetera.
Any help is appreciated - Sorry if this has already been asked! I'm new to RegEx.
Thank you!

Your regexp
www.website.com/(\?.+)?
will still match anything that contains www.website.com/, no matter what comes after the slash. Have you tried appending a $ (end-of-input marker)?
www\.website\.com/(?:\?.+)?$
(Escape the dots for more precision; the ?: is just a way of indicating that the group is of no special meaning and does not have to be remembered -- if you omit this, you can access the contents of the group by \1, e.g., in a replace operation.)

Try this one
www.website.com/\?(.+)

I would suggest
www.website.com/($|\?.+$)
Either empty or question mark plus some string

Related

How to extract characters from a string with optional string afterwards using Regex?

I am in the process of learning Regex and have been stuck on this case. I have a url that can be in two states EXAMPLE 1:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA
OR EXAMPLE 2:
spotify.com/track/1HYcYZCOpaLjg51qUg8ilA
I need to extract the 1HYcYZCOpaLjg51qUg8ilA ID
So far I am using this: (?<=track\/)(.*)(?=\?)? which works well for Example 2 but it includes the ?si=Nf5w1q9MTKu3zG_CJ83RWA when matching with Example 1.
BUT if I remove the ? at the end of the expression then it works for Example 1 but not Example 2! Doesn't that mean that last group (?=\?) is optional and should match?
Where am I going wrong?
Thanks!
I searched a handful of "Questions that may already have your answer" suggestions from SO, and didn't find this case, so I hope asking this is okay!
The capturing group in your regular expression is trying to match anything (.) as much as possible due to the greediness of the quantifier (*).
When you use:
(?<=track\/)(.*)(?=\?)
only 1HYcYZCOpaLjg51qUg8ilA from the first example is captured, as there is no question mark in your second example.
When using:
(?<=track\/)(.*)(?=\??)
You are effectively making the positive lookahead optional, so the capturing group will try to match as much as possible (including the question mark), so that 1HYcYZCOpaLjg51qUg8ilA?si=Nf5w1q9MTKu3zG_CJ83RWA and 1HYcYZCOpaLjg51qUg8ilA are matched, which is not the desired output.
Rather than matching anything, it is perhaps more appropriate for you to match alphanumerical characters \w only.
(?<=track\/)(\w*)(?=\??)
Alternatively, if you are expecting other characters , let's say a hyphen - or a underscore _, you may use a character class.
(?<=track\/)([a-zA-Z0-9_-]*)(?=\??)
Or you might want to capture everything except a question mark ? with a negated character class.
(?<=track\/)([^?]*)(?=\??)
As pointed out by gaganso, a look-behind is not necessary in this situation (or indeed the lookahead), however it is indeed a good idea to start playing around with them. The look-around assertions do not actually consume the characters in the string. As you can see here, the full match for both matches only consists of what is captured by the capture group. You may find more information here.
This should work:
track\/(\w+)
Please see here.
Since track is part of both the strings, and the ID is formed from alphanumeric characters, the above regex which matches the string "track/" and captures the alphanumeric characters after that string, should provide the required ID.
Regex : (\w+(?=\?))|(\w+&)
See the demo for the regex, https://regexr.com/3s4gv .
This will first try to search for word which has '?' just after it and if thats unsuccessful it will fetch the last word.

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Notepad++ regex group capture

I have such txt file:
ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua
Trying to delete all subdomains with such regex:
Find: .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1
Receive:
prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua
Why last line becomes com.ua instead of jwbefw.com.ua ?
This works without look around:
Find: [a-zA-Z0-9-.]+\.([a-zA-Z0-9-]+)\.([a-zA-Z0-9-]+)$
Replace: \1\.\2
It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods; then it replaces it with the last 2 parts. More intuitive, in my opinion.
There's something funny going on with that leading xxx. It doesn't appear to be plain ASCII. For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.
Incorrect
Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes. So I think I should preserve it:
Find: [a-zA-Z0-9-]+\.([a-zA-Z0-9-]+)\.(.+)$
Replace: \1\.\2
It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.
The .+ part is matching as much as possible. Try using .+? instead, and it will capture the least possible, allowing the com.ua option to match.
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
This answer still uses the specific domain names that the original question was looking at. As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set. Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.
Here is a quick explanation of why this psnig's original regex isn't working as intended:
The + is greedy.
.+ will zip all the way to the right at the end of the line capturing everything,
then work its way backwards (to the left) looking for a match from here:
(ru|ua|com\.ua|com|net|info)
With srfsf.jwbefw.com.ua the regex engine will first fail to match a,
then it will move the token one place to the left to look at "ua"
At that point, ua from the regex (the second option) is a match.
The engine will not keep looking to find "com.ua" because ".ua" met that requirement.
Niet the Dark Absol's answer tells the regex to be "lazy"
.+? will match any character (at least one) and then try to find the next part of the regex. If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.
The .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua.
But the implimentation of ? also creates issues.
Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru
This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube. before \.(ru|ua|com\.ua|com|net|info))$ matches .ru
To avoid this, change that third group from (.*?) to ([\w-]*?) so it won't capture . only letters and numbers, or a dash.
resulting regex:
.+?\.(([\w-])*?\.(ru|ua|com\.ua|com|net|info))$
Note that you don't need to capture any groups other than the first. Adding ?: makes the TLD options non-capturing.
last change:
.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$
Search what: .+?\.(\w+\.(?:ru|com|com\.au))
Replace with: $1
Look in the picture above, what regex capture referring
It's color the way you will not need a regex explaination anymore ....

Regex match everything after question mark?

I have a feed in Yahoo Pipes and want to match everything after a question mark.
So far I've figured out how to match the question mark using..
\?
Now just to match everything that is after/follows the question mark.
\?(.*)
You want the content of the first capture group.
Try this:
\?(.*)
The parentheses are a capturing group that you can use to extract the part of the string you are interested in.
If the string can contain new lines you may have to use the "dot all" modifier to allow the dot to match the new line character. Whether or not you have to do this, and how to do this, depends on the language you are using. It appears that you forgot to mention the programming language you are using in your question.
Another alternative that you can use if your language supports fixed width lookbehind assertions is:
(?<=\?).*
With the positive lookbehind technique:
(?<=\?).*
(We're searching for a text preceded by a question mark here)
Input: derpderp?mystring blahbeh
Output: mystring blahbeh
Example
Basically the ?<= is a group construct, that requires the escaped question-mark, before any match can be made.
They perform really well, but not all implementations support them.
\?(.*)$
If you want to match all chars after "?" you can use a group to match any char, and you'd better use the "$" sign to indicate the end of line.
?(.*\n)+
With this you can get everything Even a new line
Check out this site: http://rubular.com/ Basically the site allows you to enter some example text (what you would be looking for on your site) and then as you build the regular expression it will highlight what is being matched in real time.
str.replace(/^.+?\"|^.|\".+/, '');
This is sometimes bad to use when you wanna select what else to remove between "" and you cannot use it more than twice in one string. All it does is select whatever is not in between "" and replace it with nothing.
Even for me it is a bit confusing, but ill try to explain it. ^.+? (not anything OPTIONAL) till first " then | Or/stop (still researching what it really means) till/at ^. has selected nothing until before the 2nd " using (| stop/at). And select all that comes after with .+.

Problem with basic regex to match ending optional character

Hi all i was hoping someone could help be with some basic regex i am really struggling with.
Bascially i need to match a url for redirection. I have been using
^~/abc(/)?
however i need to change the end part to just check the last optional character as this will also match ^/abcd
How about ^~/abc(/?)
or more generally: ^~/[a-zA-Z0-9]+/?
Assuming PCRE, you will want:
^~/abc(.)?$
Which will match "~/abc" followed (optionally) by any single character, which will be captured. Leave the () off if you don't need to capture said character.
Just like ^ matches the beginning of string (or line, depending upon mode), $ matches the end of string (or line).
I'll do something like this :
^~/([a-zA-Z0-9]+/?)*$