Google Sheets Regex Match all URL between two strings (In a paragraph) - regex

I have try to build my regex, to using Google Sheet to extract the domain url from any paragraph:
Website: https://www.interprism.co.jp/ => interprism.co.jp
Website: https://growupwork.com => growupwork.com
Email: contact#interprism.com website: None => interprism.com
HP: onetech.jp => onetech.jp
Web:interprism.jp/index.html => interprism.jp
I have tried with this look ok, =iferror(regexextract(A11,".+?[#|www.](.*\n?)( )")) but not match all case, any one can help me on this?
Best Regards
Nim

You can try:
=iferror(regexextract(A11,"(?:https?:\/\/)?(?:[^#]+#)?(?:www\.)?([^:\/?]+)"))
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
http matches the characters http literally (case sensitive)
s? matches the character s literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
Non-capturing group (?:[^#]+#)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
Non-capturing group (?:www\.)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
www matches the characters www literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group ([^:\/?]+)
Match a single character not present in the list below [^:\/?]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
? matches the character ? literally (case sensitive)

Related

Regex Extract in Google Data Sttudio

How am I able to extract a string from an array within a field? I am able to use Breakdown Dimension to get the array, but I can't seem to figure out how to use REGEXP_EXTRACT to get just the 'friendly_name'.
I tried this REGEXP_EXTRACT(shared_attrs, '^friendly_name:\\s?"?([^";,]*)') but that didn't work.
Use
REGEXP_EXTRACT(shared_attrs, 'friendly_name"?\\s*:\\s*"?([^"]*)')
See proof.
EXPLANATION
- "friendly_name" - matches the characters friendly_name literally (case sensitive)
- '"?' - matches the character " with index 3410 (2216 or 428) literally (case sensitive) between zero and one times, as many times as possible, giving back as needed (greedy)
- "\s*" - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- ":" - matches the character : with index 5810 (3A16 or 728) literally
- "\s*" - matches any whitespace character (equivalent to [\r\n\t\f\v ]) between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- '"?' - matches the character " with index 3410 (2216 or 428) literally (case sensitive) between zero and one times, as many times as possible, giving back as needed (greedy)
1st Capturing Group ([^"]*)
- Match a single character not present in the list below [^"]
- "*" - matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- '"' - matches the character " with index 3410 (2216 or 428) literally (case sensitive)

PCRE to find all urls without file extensions that include #, ?, & etc

Very much a rookie question - have the following which I am using to look for any urls without the following extensions (seems to work):
href="\S*(?i)(?<!\.html)(?<!\.pdf)(?<!\.doc)(?<!\.docx)(?<!\.ppt)(?<!\.pptx)(?<!\.xls)(?<!\.xlsx)(?<!\.jpg)(?<!\.jpeg)(?<!\.eps)"
Where I am struggling is trying to figure out how to also find file names with extensions and exclude such as:
test.html#help
test.html?help
test.html?help&please
Not sure how to take something like this (?<!\.html) and add a wildcard to handle anything after .html
Did some more testing via an online regex tester site and this seems to work - matches any of the file extensions including test.html#help etc :
href="\S*(?i)(((?<=\.html)\S*)|((?<=\.pdf)\S*)|((?<=\.doc)\S*)|((?<=\.ppt)\S*)|((?<=\.xls)\S*)|((?<=\.jpg)\S*)|((?<=\.jpeg)\S*)|((?<=\.eps)\S*))"
but this does not work at all:
href="\S*(?i)((?<!\.html)\S*)"
Any help greatly appreciated.
Does this work for you?
href="(?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
https://regex101.com/r/SPjvkR/1
href=" matches the characters href=" literally (case sensitive)
Named Capture Group url (?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
(?<url> is what gives the capturing group the name. It can be omitted if you don't want the capturing group to have a name or can be replaced with ?: to make the it a non-capturing group. Naming can just make it more convenient to get the group's value in later code if needed, but in your case I don't think it matters
.+? matches any character (except for line terminators) between one and unlimited times, as few times as possible, expanding as needed (lazy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Negative Lookahead (?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
2nd Capturing Group (html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
.*? matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)
Update
New regex based on comments
href="((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
Regex Demo
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
Negative Lookahead (?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Non-capturing group (?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)

How to match string starting from multiple characters (can be nested) via regex?

Im trying to make regex to get domain from different kinds of url.
Im using regex, that works properly either with links w/o # in domain part, e.g:
https://stackoverflow.com/questions/ask
https://regexr.com/
/(?<=(\/\/))[^\n|\/|:]+/g
For links with # (e.g. http://regex#regex.com) works with replasing \/\/ to \#:/(?<=(#))[^\n|\/|:]+/g
But when im trying to make regex to match both of these cases and making
/(?<=((\/\/)|(\#)))[^\n|\/|:]+/g
it dosen't work.
You should look for the string ://,(Positive Look Behind) if it comes in the string means it is domain and you need to capture everything after that. Whether it has # or not.
Case 1
Capture the whole string after ://
Regex:
(?<=\:\/\/).*
Explanation:
Positive Lookbehind (?<=\:\/\/)
Assert that the Regex below matches
\: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Example
https://regex101.com/r/jsqqw8/1/
Case 2
Capture just the domain after ://
Regex:
(?<=:\/\/)[^\n|\/|:]+
Explanation:
Positive Lookbehind (?<=:\/\/)
Assert that the Regex below matches
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
Match a single character not present in the list below [^\n|\/|:]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\n matches a line-feed (newline) character (ASCII 10)
| matches the character | literally (case sensitive)
\/ matches the character / literally (case sensitive)
|: matches a single character in the list |: (case sensitive)
Case 3:
Capture domain after :// if there is not # in the text and if # present in the text, capture text after that.
Regex:
(?!:\/\/)(?:[A-z]+\.)*[A-z][A-z]+\.[A-z]{2,}
Explanation:
Negative Lookahead (?!:\/\/)
Assert that the Regex below does not match
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
\/ matches the character / literally (case sensitive)
Non-capturing group (?:[A-z]+\.)*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [A-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-z a single character in the range between A (index 65) and z (index 122) (case sensitive)
\. matches the character . literally (case sensitive)
Match a single character present in the list below [A-z]
A-z a single character in the range between A (index 65) and z (index 122) (case sensitive)
Match a single character present in the list below [A-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
A-z a single character in the range between A (index 65) and z (index 122) (case sensitive)
\. matches the character . literally (case sensitive)
Match a single character present in the list below [A-z]{2,}
{2,} Quantifier — Matches between 2 and unlimited times, as many times as possible, giving back as needed (greedy)
A-z a single character in the range between A (index 65) and z (index 122) (case sensitive)
Example:
https://regex101.com/r/jsqqw8/4

Validating emails in file with batch

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?
Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.
Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

Regex to match URL end-of-line or "/" character

I have a URL, and I'm trying to match it to a regular expression to pull out some groups. The problem I'm having is that the URL can either end or continue with a "/" and more URL text. I'd like to match URLs like this:
http://server/xyz/2008-10-08-4
http://server/xyz/2008-10-08-4/
http://server/xyz/2008-10-08-4/123/more
But not match something like this:
http://server/xyz/2008-10-08-4-1
So, I thought my best bet was something like this:
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)[/$]
where the character class at the end contained either the "/" or the end-of-line. The character class doesn't seem to be happy with the "$" in there though. How can I best discriminate between these URLs while still pulling back the correct groups?
To match either / or end of content, use (/|\z)
This only applies if you are not using multi-line matching (i.e. you're matching a single URL, not a newline-delimited list of URLs).
To put that with an updated version of what you had:
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|\z)
Note that I've changed the start to be a non-greedy match for non-whitespace ( \S+? ) rather than matching anything and everything ( .* )
You've got a couple regexes now which will do what you want, so that's adequately covered.
What hasn't been mentioned is why your attempt won't work: Inside a character class, $ (as well as ^, ., and /) has no special meaning, so [/$] matches either a literal / or a literal $ rather than terminating the regex (/) or matching end-of-line ($).
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)(/.*)?$
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (\d{4}-\d{2}-\d{2})
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
4th Capturing Group (.*)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string
In Ruby and Bash, you can use $ inside parentheses.
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|$)
(This solution is similar to Pete Boughton's, but preserves the usage of $, which means end of line, rather than using \z, which means end of string.)