Regex to match a string within a url

Regex to match a string within a url - regex

I need to pull out links with a certain set of numbers and a string in their URL in Google Analytics, so I'm setting up a filter.
This is my input url: http://website.com/content/123/12/1234?utm_source=ABC&utm_campaign=ThisIsWhatINeed
In this link, I need the regex to match /content/123/12/1234 (or any numbers in the xxx/xx/xxxx format) and also to match the exact string ThisIsWhatINeed
I have the regex \/content\/\d+\/\d+\/\d+ to match the number part /content/123/12/1234, and this works fine. But I can't figure out how to also match the ThisIsWhatINeed. I've tried \/content\/\d+\/\d+\/\d+ThisIsWhatINeed but some vital part is missing.
I've been using a regex tester and it says that everything matches, but then at the end I get the message 'Global pattern flags g modifier: global. All matches (don't return after first match)'
I will confess I'm very new to regex and am just learning what all the tokens mean.
PS - I know I can pull out campaigns by other means in GA - I have a specific reason for needing to set up this filter

If you want to match the whole string:
To match the /123/12/1234 part you can use a character class.
To match a more generic link, you could exchange http://website.com/ to just .*?
To match your string after the campaign attribute, you can use a negated character class, marked bei ^ in the character class. This means, the pattern matches every character, as long as it's not a & sign.
http://website.com/content/[\d/]+.*?utm_campaign=[^&]*?
To explain the global modifier:
Usually your attempt to match something with regex would return on the first match. So if you're trying to match multiple links, the first match would return and stop your request.
When global flag is set, the pattern will match as often as possible and returns, when there is no match left.
Hope this helps!

Related

Regex Email validation with some special cases [duplicate]

I am trying to make a regex match which is discarding the lookahead completely.
\w+([-+.]\w+)*#\w+([-.]\w+)*\.\w+([-.]\w+)*
This is the match and this is my regex101 test.
But when an email starts with - or _ or . it should not match it completely, not just remove the initial symbols. Any ideas are welcome, I've been searching for the past half an hour, but can't figure out how to drop the entire email when it starts with those symbols.

You can use the word boundary near # with a negative lookbehind to check if we are at the beginning of a string or right after a whitespace, then check if the 1st symbol is not inside the unwanted class [^\s\-_.]:
(?<=^|\s)[^\s\-_.]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
See demo
List of matches:
support#github.com
s.miller#mit.edu
j.hopking#york.ac.uk
steve.parker#soft.de
info#company-hotels.org
kiki#hotmail.co.uk
no-reply#github.com
s.peterson#mail.uu.net
info-bg#software-software.software.academy
Additional notes on usage and alternative notation
Note that it is best practice to use as few escaped chars as possible in the regex, so, the [^\s\-_.] can be written as [^\s_.-], with the hyphen at the end of the character class still denoting a literal hyphen, not a range. Also, if you plan to use the pattern in other regex engines, you might find difficulties with the alternation in the lookbehind, and then you can replace (?<=\s|^) with the equivalent (?<!\S). See this regex:
(?<!\S)[^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*
And last but not least, if you need to use it in JavaScript or other languages not supporting lookarounds, replace the (?<!\S)/(?<=\s|^) with a (non)capturing group (\s|^), wrap the whole email pattern part with another set of capturing parentheses and use the language means to grab Group 1 contents:
(\s|^)([^\s_.-]\w*(?:[-+.]\w+)*\b#\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*)
See the regex demo.

I use this for multiple email addresses, separate with ‘;':
([A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4};)*
For a single mail:
[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,4}

Regex Last Character or Position from Found String

I've looked through Regex Last occurrence? but cannot get the regex to work for my example string ("https://www.fakesite.com test one"). I need to return the last character of the website name only (or the position). I have the expressions for both capturing the site and obtaining the last character but cannot get the expression to get the right look behind.
(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]
{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|
(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,}) <- Regular Expression for website
(?=.?$). <- Regular Expression for retrieving last character
I've been using https://regex101.com/ to try and find, but no luck.
How can I retrieve the last character or position?
--Edit--
How can I retrieve just the last character of any string? (I need just the letter 'r' in System Engineer). 'System Engineer' is dynamic.
"for the position of System Engineer located in"
(?<=position of )(.*)(?= located) <- regex to capture System Engineer between words 'position of' and 'located'

You may try the below regex. The below regex will check for valid url address as well as will get you the last character of your url.
https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}(\w)\b(?:[-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*)
Explanation of the above regex:
https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256} - Matches the http/https:// part of the regex along with www and the domain name before the first ..
[a-zA-Z0-9()]{1,6} - This part matches the last of the url part.
(\w) - Represents a capturing group capturing the last character of the url. You may use ([a-zA-Z0-9]) manually if you don't want to include _.
\b(?:[-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*) - Matches the rest part of the url like .uk or .in, etc. zero or more times.
You can find the demo of the above regex in here.
Reference: The regex for matching the valid url is taken from this answer.
If you want to get amendments in your regex; just add [a-zA-Z] after your regex. You can find the demo here.

Parsing multiple groups from a regular expression

I am having a problem parsing some fields from the following regular expression which I uploaded to rubular. The string that I am parsing is a special header from the banner of an FTP server. In order for me to process this banner, the line
special:pTXT1TOCAPTURE^:mTXT2TOCAPTURE^:uTXT3TOCAPTURE^
I thought that: (?i)^special(:[pmu](.*?)\^)?* would do the trick, however unfortunately this only gives me the last match and I am not sure why as I am lazily trying to capture each group. Also note that I should be able to capture an empty string also, i.e. if for ex the match string contains :u^
Wrap words Show invisibles Ruby version
Match result:
special:pTXT1TOMATCH^:mTXT2TOMATCH^:uTXT3TOMATCH^
Match groups:
:uTXT3TOMATCH^
TXT3TOMATCH
The idea is that the line must start with the test 'special' followed by up to 3 capture groups delimited with p,m or u lazily up to the next ^ symbol. I need to capture the text indicated above - basically I need to find TXT1TOCAPTURE, TXT2TOCAPTURE, and TXT3TOCAPTURE. There should be at least one of these three capture groups.
Thanks in advance

You have two problems with your RegEx, one is syntactic and one is conceptual.
Syntactic:
We don't have such a modifier ?* in PCRE but it is equal to * in Ruby which denotes a greedy quantifier. In the case of applying to a capturing group it captures last match.
Conceptual:
Using a lazy quantifier .*? doesn't provide you with continues matches. It stops immediately on engine satisfaction. While g modifier is on next match will never occur as there is no ^special at the next position of last match.
Solution is using \G token to benefit from its mean of start matching at the end of previous match:
(?:special|(?!\A)\G):([pmu][^^]*\^)
Live demo

You might want to have the \G modifier:
(?:(?:^special:)|\G(?!\A)\^:)[pmu]([^^]+)
See it working on rubular.com.

Regex - Match the pattern of a string representing a variable and its assigned value

I am trying to find a regex pattern that would enable me to swiftly search through my source code to find the following string pattern:
placeholder="any text here"
I have tried the following regex pattern however, it does not exclusively capture strings beginning with the sub-string "placeholder".
placeholder=\".+\"

You need to make lazy, with ?. Otherwise it captures the maximal possible match. Also, no need to escape the quotes.
placeholder=".+?"

RegEx for 15 character alpha-numeric character within JSON

I am using RegEx to assert the response of an API call, but it's currently a little too 'greedy' and ends up matching all kinds of responses. The RegEx bits are needed since the actual IDs in the response will be different each time.
The RegEx assertion is this:
{data:\[{"name":"Mat","~id":"(.*)"},{"name":"Laurie","~id":"(.*)"}\]},"something":true}
Which matches this correct response:
{data:[{"name":"Mat","~id":"4fd5ec146fc2ee0fff234234"},{"name":"Laurie","~id":"4fd5ec146fc2ee0fff234227"}]},"something":true}
as well as this incorrect response:
{data:[{"name":"Mat","~id":"4fd5ec146fc2ee0fff234234"},{"name":"Laurie","~id":"4fd5ec146fc2ee0fff234227"},{"name":"John","~id":"4fd5ec146fc2ee0fff234237"},{"name":"Paul","~id":"4fd5ec146fc2ee0fff234238"},{"name":"George","~id":"4fd5ec146fc2ee0fff234239"}]},"something":true}
The second (.*) is not just matching the ID of the second item, but it's matching the ID and all the other unwanted objects.
So I guess I need to make my RegEx be a little more strict when it comes to the ~id fields. Since the IDs will always be 24 hex characters, I'd like to replace the (.*) with something more appropriate.
I am writing this in Go, and therefore using Go's RegExp package.
And am using http://regexpal.com/ to test the RegEx

You can use [^"]*, [^"]{24} or [0-9a-fA-F]{24} instead of .* for your ID fields.

. (dot) in regular expression will match anything since a dot in RegEx is a special characteres that matches any single character (exception are newline characters).
You should use this RegEx to match always a 24 hex characters only:
^[A-Fa-f0-9]{24}$
Peace

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to match a string within a url - regex

Related

Regex Email validation with some special cases [duplicate]

Regex Last Character or Position from Found String

Parsing multiple groups from a regular expression

Regex - Match the pattern of a string representing a variable and its assigned value

RegEx for 15 character alpha-numeric character within JSON

Categories

Resources