Regex Last Character or Position from Found String - regex

I've looked through Regex Last occurrence? but cannot get the regex to work for my example string ("https://www.fakesite.com test one"). I need to return the last character of the website name only (or the position). I have the expressions for both capturing the site and obtaining the last character but cannot get the expression to get the right look behind.
(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]
{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|
(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,}) <- Regular Expression for website
(?=.?$). <- Regular Expression for retrieving last character
I've been using https://regex101.com/ to try and find, but no luck.
How can I retrieve the last character or position?
--Edit--
How can I retrieve just the last character of any string? (I need just the letter 'r' in System Engineer). 'System Engineer' is dynamic.
"for the position of System Engineer located in"
(?<=position of )(.*)(?= located) <- regex to capture System Engineer between words 'position of' and 'located'

You may try the below regex. The below regex will check for valid url address as well as will get you the last character of your url.
https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}(\w)\b(?:[-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*)
Explanation of the above regex:
https?:\/\/(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256} - Matches the http/https:// part of the regex along with www and the domain name before the first ..
[a-zA-Z0-9()]{1,6} - This part matches the last of the url part.
(\w) - Represents a capturing group capturing the last character of the url. You may use ([a-zA-Z0-9]) manually if you don't want to include _.
\b(?:[-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*) - Matches the rest part of the url like .uk or .in, etc. zero or more times.
You can find the demo of the above regex in here.
Reference: The regex for matching the valid url is taken from this answer.
If you want to get amendments in your regex; just add [a-zA-Z] after your regex. You can find the demo here.

Related

REGEX: Match all instances of text, digits, + , _ and -, Between colons, which are NOT part of an URL

I'd like to find and replace (with nothing) all instances of text between colons, like such:
:smile:
:thumbs_up:
:+1:
:-1:
but NOT if the colons are part of the url, like this URL for example:
http://pdf.reuters.com/htmlnews/htmlnews.asp?i=43059c3bf0e37541&u=urn:newsml:reuters.com:20190417:nPn5XHnXBa
As you can see, this URL has several colons and any such matches should be ignored.
The complete text can have some text before and after as well. In addition, these can also show up in succession, without any spaces in between. For example:
I was browsing and found this url :smile: http://pdf.reuters.com/htmlnews/htmlnews.asp?i=43059c3bf0e37541&u=urn:newsml:reuters.com:20190417:nPn5XHnXBa it's fantastic :smile::+1: Remember: don't forget to upvote!
I would expect the result to be:
I was browsing and found this url http://pdf.reuters.com/htmlnews/htmlnews.asp?i=43059c3bf0e37541&u=urn:newsml:reuters.com:20190417:nPn5XHnXBa it's fantastic Remember: don't forget to upvote!
I am using python regex module for my replacements.
My thinking is:
"Ok, I should find any URL and tell the regex to IGNORE any matches that are part of the URL"
So I have the regex to successfully match any URL as such:
(http[^\s]+)
This will find http and anything else until a non-whitespace character or newline, which would indicate the end of the URL.
I also have regex to match the text between (including) colons:
(:[\w+-]+:)
SO... I was hoping to use negative lookahead and combine these 2 like this:
(?!http[^\s]+)(:[\w+-]+:)
This is ALMOST perfect but it ends up matching these 2 parts of the URL:
:newsml:
and
:20190417:
How can I build this regex so that it matches everywhere in the text, EXCEPT if the colons are part of the URL?
Thanks a million!
PS. I've been using this awesome site to test my patterns...
https://regexr.com/
One option is to have your regex match a URL pattern (captured in a group), or match something enclosed in :s, and then you can replace with the first captured group:
(https?://\S+)|:[\w+-]+:
replace with
\1
This ensures that URLS will stay where they are in the text (being matched and replaced with themselves), but the colon sections that you want to remove will be matched and replaced with nothing.
https://regex101.com/r/d7mM1s/2

Regex - returning a match without a period

I'm using the below regex string to match the word "kohls" which is located in a group of other words.
\W*((?i)kohls(?-i))\W*
It works great when the word is alone, but if the word is in a url, the match includes a period on both sides.
See the below examples:
Thank you for shopping at Kohls - returns a match for kohls.
https://www.kohls.com - returns a match for .kohls.
Edit. https://www.KohlsAndMichaels.com - doesn't return any match for kohls.
I want it to only extract the exact match for kohls without periods or any other symbols/text in front or behind it. Can you tell me what I'm doing wrong?
In cases like that you can always use a site like regex101.com, which explains the regular expression and shows the matches with colors. So this is how your regular expression currently works:
As you can see in blue color, the problem with the dots is in the \W*, which matches any non-word character. In order to fix this, you can use the following regular expression:
\b((?i)kohls(?-i))\b
The \b (before and after the word you want to match) is used to assert the position at a word boundary. See how this work on that website now:
If you still have questions, look at the explanation of the regular expression provided by that website. It is worth looking.
The \W metacharacter is used to find non-word characters. So adding a star operator will match 0 or more of these non-word characters (like periods). Did you meant to add a word boundary instead?
\b(?i)kohls(?-i)\b
Replace both \W* with [\W,\.\-]* etc.
Should be enough.

Regex to match a string within a url

I need to pull out links with a certain set of numbers and a string in their URL in Google Analytics, so I'm setting up a filter.
This is my input url: http://website.com/content/123/12/1234?utm_source=ABC&utm_campaign=ThisIsWhatINeed
In this link, I need the regex to match /content/123/12/1234 (or any numbers in the xxx/xx/xxxx format) and also to match the exact string ThisIsWhatINeed
I have the regex \/content\/\d+\/\d+\/\d+ to match the number part /content/123/12/1234, and this works fine. But I can't figure out how to also match the ThisIsWhatINeed. I've tried \/content\/\d+\/\d+\/\d+ThisIsWhatINeed but some vital part is missing.
I've been using a regex tester and it says that everything matches, but then at the end I get the message 'Global pattern flags g modifier: global. All matches (don't return after first match)'
I will confess I'm very new to regex and am just learning what all the tokens mean.
PS - I know I can pull out campaigns by other means in GA - I have a specific reason for needing to set up this filter
If you want to match the whole string:
To match the /123/12/1234 part you can use a character class.
To match a more generic link, you could exchange http://website.com/ to just .*?
To match your string after the campaign attribute, you can use a negated character class, marked bei ^ in the character class. This means, the pattern matches every character, as long as it's not a & sign.
http://website.com/content/[\d/]+.*?utm_campaign=[^&]*?
To explain the global modifier:
Usually your attempt to match something with regex would return on the first match. So if you're trying to match multiple links, the first match would return and stop your request.
When global flag is set, the pattern will match as often as possible and returns, when there is no match left.
Hope this helps!

regex capturing to start at \b or end of (www\.)

I am trying to capture first occurence of anything that looks like a domain name from a string. For examaple my.domain.home.com from 'dfasdf https://www.my.domain.home.com fadsfas'. I am using \b assertion or non-capturing group (?:www\.) to mark the start of my capturing group. But instead I get www.my.domain.home.com i.e. the www. is not stripped out.
This is my full regex:
\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
this is the part that I am unsure of:
\b(?:www\.)
how can I make my capturing start at the beginning of the word OR end of 'www.'?
[CLARIFICATION]
If there is no 'www.' it should capture at the beginning of the word. If there is 'www.' it should start capturing after the dot in 'www.' at the beginning of the possible domain string.
I have checked it with https://www.regex101.com/r/NjR11m/1/tests as well but my final destination is Teradata 15.10 regex which is said to be compliant with the Perl dialect. So if you could help me with in the Perl context I guess I will be fine.
SELECT 'dfasdf https://www.my.domain.home.com fadsfas' AS string,
REGEXP_SUBSTR(string,
'\b(?:www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b'
) AS url_to_match;
For 'dfasdf https://my.domain.home.com fadsfas' it should return my.domain.home.com as well.
Additional examples of the strings that should also return my.domain.home.com
'dfasdf my.domain.home.com fadsfas'
'dfasdf ,my.domain.home.com-- fadsfas'
'dfasdf www.my.domain.home.com#fadsfas'
[SOLUTION]
REGEXP_SUBSTR(LOWER(string),
'\b(?!www\.)((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b'
)
The problem with www. being included in the match seems to be because you're using the 0th group (which is the full match, not just the capturing groups). While I don't know how to change that, it is possible to reformulate the regex so that group 0 and group 1 have the same value, like so:
\b(?!www\.)([-a-z0-9]{1,63}(?:\.[-a-z0-9]{1,63})+)
This just says the match can't start at www., rather than allowing the match to start there and then having to ignore it.
I've made a modified version of your regex that shows how it works. Note that if you want to match names with mixed-case alphanumerics you'll need to add A-Z to the a-z0-9, or turn on case-insensitivity; matching non-ascii domain names is more work, and left for the interested reader to work out.

Regular Expression extract substring (.)*

I am trying to match with the following regex.
\d{11}(.*)
Which is any 11 digits followed by a string. I want to extract the tailing string whatever it is.
I used RE2::FullMatch but it gives the first half (the 11 digits). How to get the sub-string matched with (.*) ?
string subStr
RE2::FullMatch("<sip:+19073381121#216.67.108.201:5060;user=phone>;npi=ISDN",(<sip:\+(\d{11}))(.*), &subStr);
I am trying to extract everything starting from # in above string. Basically I want what matches to (.*) but the above function returns <sip:+19073381121.
I am not very familiar with regex but I looked at different APIs to extract substrings and found this one usefull
Remove the extra capturing groups from your regular expression.
<sip:\+\d{11}(.*)
To get the sub string matched with (.*) use $1. That is the first capturing group that you specified with the brackets.