PCRE to find all urls without file extensions that include #, ?, & etc

PCRE to find all urls without file extensions that include #, ?, & etc - regex

Very much a rookie question - have the following which I am using to look for any urls without the following extensions (seems to work):
href="\S*(?i)(?<!\.html)(?<!\.pdf)(?<!\.doc)(?<!\.docx)(?<!\.ppt)(?<!\.pptx)(?<!\.xls)(?<!\.xlsx)(?<!\.jpg)(?<!\.jpeg)(?<!\.eps)"
Where I am struggling is trying to figure out how to also find file names with extensions and exclude such as:
test.html#help
test.html?help
test.html?help&please
Not sure how to take something like this (?<!\.html) and add a wildcard to handle anything after .html
Did some more testing via an online regex tester site and this seems to work - matches any of the file extensions including test.html#help etc :
href="\S*(?i)(((?<=\.html)\S*)|((?<=\.pdf)\S*)|((?<=\.doc)\S*)|((?<=\.ppt)\S*)|((?<=\.xls)\S*)|((?<=\.jpg)\S*)|((?<=\.jpeg)\S*)|((?<=\.eps)\S*))"
but this does not work at all:
href="\S*(?i)((?<!\.html)\S*)"
Any help greatly appreciated.

Does this work for you?
href="(?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
https://regex101.com/r/SPjvkR/1
href=" matches the characters href=" literally (case sensitive)
Named Capture Group url (?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
(?<url> is what gives the capturing group the name. It can be omitted if you don't want the capturing group to have a name or can be replaced with ?: to make the it a non-capturing group. Naming can just make it more convenient to get the group's value in later code if needed, but in your case I don't think it matters
.+? matches any character (except for line terminators) between one and unlimited times, as few times as possible, expanding as needed (lazy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Negative Lookahead (?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
2nd Capturing Group (html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
.*? matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)
Update
New regex based on comments
href="((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
Regex Demo
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
Negative Lookahead (?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Non-capturing group (?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)

Related

Google Sheets Regex Match all URL between two strings (In a paragraph)

I have try to build my regex, to using Google Sheet to extract the domain url from any paragraph:
Website: https://www.interprism.co.jp/ => interprism.co.jp
Website: https://growupwork.com => growupwork.com
Email: contact#interprism.com website: None => interprism.com
HP: onetech.jp => onetech.jp
Web:interprism.jp/index.html => interprism.jp
I have tried with this look ok, =iferror(regexextract(A11,".+?[#|www.](.*\n?)( )")) but not match all case, any one can help me on this?
Best Regards
Nim

You can try:
=iferror(regexextract(A11,"(?:https?:\/\/)?(?:[^#]+#)?(?:www\.)?([^:\/?]+)"))
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
http matches the characters http literally (case sensitive)
s? matches the character s literally (case sensitive)
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
Non-capturing group (?:[^#]+#)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
Match a single character not present in the list below [^#]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
# matches the character # literally (case sensitive)
Non-capturing group (?:www\.)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
www matches the characters www literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group ([^:\/?]+)
Match a single character not present in the list below [^:\/?]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
: matches the character : literally (case sensitive)
\/ matches the character / literally (case sensitive)
? matches the character ? literally (case sensitive)

Validating emails in file with batch

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?

Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.

Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

Regex / grep match on: "not this" and "that"

From a Linux command line, I would like to find all the instances in multiple files where I do not reference a figure reference with Fig..
So I'm looking each line for when I don't preface \ref{fig with exactly Fig. .
Fig. \ref{fig:myFigure}
A sentence with Fig. \ref{fig:myFigure} there.
\ref{fig:myFigure}
A sentence with \ref{fig:myFigure} there.
The regex should ignore cases (1) and (2), but find cases (3) and (4).

You can use Negative Lookahead like:
^((?!Fig\. {0,1}\\ref\{fig).)*$
https://regex101.com/r/wSw9iI/2
Negative Lookahead (?!Fig\.\s*\\ref\{fig)
Assert that the Regex below does not match
Fig matches the characters Fig literally (case sensitive)
\. matches the character . literally (case sensitive)
\s* matches any whitespace character (equal to [\r\n\t\f\v ])
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
ref matches the characters ref literally (case sensitive)
\{ matches the character { literally (case sensitive)
fig matches the characters fig literally (case sensitive)

Removing all strings except specific except some in Regex

I have tried to find the solution of this problem, but I still can't get the correct answer. Therefore, I decided to ask you all here for help.
I have some text :
CommentTimestamps:true,showVODCommentTimestamps:false,enableVODStreamingComments:false,enablePinLiveComments:false,enableFacecastAnimatedComments:false,permalink:"1",isViewerTheOwner:false,isLiveAudio:false,mentionsinput:{inputComponent:{__m:"LegacyMentionsInput.react"}},monitorHeight:false,viewoptionstypeobjects:null,viewoptionstypeobjectsorder:null,addcommentautoflip:true,autoplayLiveVODComments:true,disableCSSHiding:true,feedbackMode:"none",instanceid:"u_0_w",lazyFetch:true,numLazyComments:2,pagesize:50,postViewCount:"78,762",shortenTimestamp:true,showaddcomment:true,showshares:true,totalPosts:1,viewCount:"78,762",viewCountReduced:"78K"},{comments:[],pinnedcomments:[],profiles:{},actions:[],commentlists:{comments:{"1":{filtered:{range:{offset:32,length:0},values:[],count:32,clienthasall:false}}},replies:null},featuredcommentlists:{comments:null,replies:null},featuredcommentids:null,servertime:1492916773,feedb.........`
What I want to get is only : postViewCount:"78,762"
I have tried using [^(postViewCount\b.......)] but it is not what I want to get.

This should do it
(postViewCount:\"\d{2}\,\d{3}\")
https://regex101.com/r/9JENH0/1
postViewCount: matches the characters postViewCount: literally (case sensitive)
\" matches the character " literally (case sensitive)
\d{2} matches a digit (equal to [0-9]) {2} Quantifier — Matches exactly 2 times
\, matches the character , literally (case sensitive)
Now if the count is one million or larger then use (postViewCount:"(?:.*?)")

Regex: postViewCount:"[^"]+"
1. postViewCount:" will match postViewCount:"
2. [^"]+ match all till "
Regex demo

try to match -
.*(postViewCount:"[0-9,]*").*
and replace it with catched group that is \1
Regex demo

Regex to match URL end-of-line or "/" character

I have a URL, and I'm trying to match it to a regular expression to pull out some groups. The problem I'm having is that the URL can either end or continue with a "/" and more URL text. I'd like to match URLs like this:
http://server/xyz/2008-10-08-4
http://server/xyz/2008-10-08-4/
http://server/xyz/2008-10-08-4/123/more
But not match something like this:
http://server/xyz/2008-10-08-4-1
So, I thought my best bet was something like this:
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)[/$]
where the character class at the end contained either the "/" or the end-of-line. The character class doesn't seem to be happy with the "$" in there though. How can I best discriminate between these URLs while still pulling back the correct groups?

To match either / or end of content, use (/|\z)
This only applies if you are not using multi-line matching (i.e. you're matching a single URL, not a newline-delimited list of URLs).
To put that with an updated version of what you had:
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|\z)
Note that I've changed the start to be a non-greedy match for non-whitespace ( \S+? ) rather than matching anything and everything ( .* )

You've got a couple regexes now which will do what you want, so that's adequately covered.
What hasn't been mentioned is why your attempt won't work: Inside a character class, $ (as well as ^, ., and /) has no special meaning, so [/$] matches either a literal / or a literal $ rather than terminating the regex (/) or matching end-of-line ($).

/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)(/.*)?$
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (\d{4}-\d{2}-\d{2})
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
4th Capturing Group (.*)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string

In Ruby and Bash, you can use $ inside parentheses.
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|$)
(This solution is similar to Pete Boughton's, but preserves the usage of $, which means end of line, rather than using \z, which means end of string.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

PCRE to find all urls without file extensions that include #, ?, & etc - regex

Related

Google Sheets Regex Match all URL between two strings (In a paragraph)

Validating emails in file with batch

Regex / grep match on: "not this" and "that"

Removing all strings except specific except some in Regex

Regex to match URL end-of-line or "/" character

Categories

Resources