Removing all strings except specific except some in Regex - regex

I have tried to find the solution of this problem, but I still can't get the correct answer. Therefore, I decided to ask you all here for help.
I have some text :
CommentTimestamps:true,showVODCommentTimestamps:false,enableVODStreamingComments:false,enablePinLiveComments:false,enableFacecastAnimatedComments:false,permalink:"1",isViewerTheOwner:false,isLiveAudio:false,mentionsinput:{inputComponent:{__m:"LegacyMentionsInput.react"}},monitorHeight:false,viewoptionstypeobjects:null,viewoptionstypeobjectsorder:null,addcommentautoflip:true,autoplayLiveVODComments:true,disableCSSHiding:true,feedbackMode:"none",instanceid:"u_0_w",lazyFetch:true,numLazyComments:2,pagesize:50,postViewCount:"78,762",shortenTimestamp:true,showaddcomment:true,showshares:true,totalPosts:1,viewCount:"78,762",viewCountReduced:"78K"},{comments:[],pinnedcomments:[],profiles:{},actions:[],commentlists:{comments:{"1":{filtered:{range:{offset:32,length:0},values:[],count:32,clienthasall:false}}},replies:null},featuredcommentlists:{comments:null,replies:null},featuredcommentids:null,servertime:1492916773,feedb.........`
What I want to get is only : postViewCount:"78,762"
I have tried using [^(postViewCount\b.......)] but it is not what I want to get.

This should do it
(postViewCount:\"\d{2}\,\d{3}\")
https://regex101.com/r/9JENH0/1
postViewCount: matches the characters postViewCount: literally (case sensitive)
\" matches the character " literally (case sensitive)
\d{2} matches a digit (equal to [0-9]) {2} Quantifier — Matches exactly 2 times
\, matches the character , literally (case sensitive)
Now if the count is one million or larger then use (postViewCount:"(?:.*?)")

Regex: postViewCount:"[^"]+"
1. postViewCount:" will match postViewCount:"
2. [^"]+ match all till "
Regex demo

try to match -
.*(postViewCount:"[0-9,]*").*
and replace it with catched group that is \1
Regex demo

Related

PCRE to find all urls without file extensions that include #, ?, & etc

Very much a rookie question - have the following which I am using to look for any urls without the following extensions (seems to work):
href="\S*(?i)(?<!\.html)(?<!\.pdf)(?<!\.doc)(?<!\.docx)(?<!\.ppt)(?<!\.pptx)(?<!\.xls)(?<!\.xlsx)(?<!\.jpg)(?<!\.jpeg)(?<!\.eps)"
Where I am struggling is trying to figure out how to also find file names with extensions and exclude such as:
test.html#help
test.html?help
test.html?help&please
Not sure how to take something like this (?<!\.html) and add a wildcard to handle anything after .html
Did some more testing via an online regex tester site and this seems to work - matches any of the file extensions including test.html#help etc :
href="\S*(?i)(((?<=\.html)\S*)|((?<=\.pdf)\S*)|((?<=\.doc)\S*)|((?<=\.ppt)\S*)|((?<=\.xls)\S*)|((?<=\.jpg)\S*)|((?<=\.jpeg)\S*)|((?<=\.eps)\S*))"
but this does not work at all:
href="\S*(?i)((?<!\.html)\S*)"
Any help greatly appreciated.
Does this work for you?
href="(?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
https://regex101.com/r/SPjvkR/1
href=" matches the characters href=" literally (case sensitive)
Named Capture Group url (?<url>.+?\.(?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
(?<url> is what gives the capturing group the name. It can be omitted if you don't want the capturing group to have a name or can be replaced with ?: to make the it a non-capturing group. Naming can just make it more convenient to get the group's value in later code if needed, but in your case I don't think it matters
.+? matches any character (except for line terminators) between one and unlimited times, as few times as possible, expanding as needed (lazy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Negative Lookahead (?!(html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
2nd Capturing Group (html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
.*? matches any character (except for line terminators) between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)
Update
New regex based on comments
href="((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)"
Regex Demo
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ((?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)).*?)
Negative Lookahead (?!.*\.(?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps))
Assert that the Regex below does not match
. matches any character (except for line terminators)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\. matches the character . with index 4610 (2E16 or 568) literally (case sensitive)
Non-capturing group (?:html|xlsm|pdf|doc|ppt|jpg|jpeg|eps)
. matches any character (except for line terminators)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)
" matches the character " with index 3410 (2216 or 428) literally (case sensitive)

Validating emails in file with batch

I have a file with emails and I need to validate them.
The sequence is:
First name.
Dot.
Last name.
Number (optional - for same names).
static string domain(#utp.ac.pa).
I wrote this:
egrep -E [a-z]\.+[a-z][0-9]*#["utp.ac.pa"] test.txt
It should match this email: "anell.zheng#utp.ac.pa"
But it is also matching:
test4#utp.ac.pa
2anell#utp.ac.pa
Although they don't follow the sequence. What am I doing wrong?
Your regex doesn't even match the first email. If I understand your requirements correctly, this should work:
[A-Za-z]+\.[A-Za-z]+[0-9]*#utp\.ac\.pa
Note that to match a dot, it needs to be escaped (i.e., \.) because . matches any character.
You can get rid of A-Z if you don't want to match upper-case letters.
Try it online.
Let me know if this isn't what you want.
Regex: ^[A-Za-z]+\.[A-Za-z]+(?:_\d+)*#utp\.ac\.pa$
Demo
Regex Details:
^ asserts position at start of a line
Match a single character present in the list below [A-Za-z]+
. matches the character . literally (case sensitive)
Match a single character present in the list below [A-Za-z]+
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Non-capturing group (?:_\d+)*
Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
_ matches the character _ literally (case sensitive)
\d+ matches a digit (equal to [0-9])
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
#utp matches the characters #utp literally (case sensitive)
. matches the character . literally (case sensitive)
ac matches the characters ac literally (case sensitive)
. matches the character . literally (case sensitive)
pa matches the characters pa literally (case sensitive)
$ asserts position at the end of a line

How to replace any_string#a.net to any_string#b.com

How to replace any_string#a.net to any_string#b.com using RegEx?
I want to strip the #a.net and replace it with #b.com
I've tried
(.*#a.net)
but the $1 is showing all the string.
So when i try to replace it, it became
any_string#a.net#b.com
And can someone point me to a nice tutorial regarding RegEx?
The () indicates the capture group. Put the parts of the expression you don't want to capture outside the parens:
(.*)#a.net
A great site to play around with regular expressions is http://refiddle.com/.
I fiddled this problem already.
You can use
\b#[a-zA-Z].net\b
\b to set word boundaries before #
# matches the character # literally
a-z a single character in the range between a and z (case sensitive)
A-Z a single character in the range between A and Z (case sensitive)
. matches any character (except newline)
net matches the characters net literally (case sensitive)
\b word boundary
The above regex will capture the given characters literally which you can replace using #b.com
And of you simply want to capture only #a.net than you can simply use
\b#a.net\b
Regex Demo

Regex Matching Behaviour Of \w

I noticed some interesting behaviour with some regex work I am doing, and I'd like some insight.
From what I understand, the word character, \w should match the following [a-zA-Z_0-9]
Given this input,
0000000060399301+0000000042456971+0000000
What should this regex
(\d+)\w
Capture?
I would expect it to capture 0000000060399301 but it actually captures 000000006039930
Is there something I am missing? Why is the 1 dropped from the end?
I noticed if I changed the regex to
(\d+\w)
It captures correctly i.e. including the 1
Anyone care to explain? Thanks
You require the regex to match a trailing word character - that would be the 1.
It cannot be another character, because
+ is not a word class character
+ is not a digit
matching is greedy
\d+ - matches one or more digit characters.
\w+ - matches one or more word characters. [A-Za-z\d_]
So with this string 0000000060399301+, \d+ in this (\d+)\w regex matches all the digits (including the 1 before +) at very first, since the following pattern is \w , regex engine tries to find a match, so it backtracks one character to the left and forces \w to match the digit before + . Now the captured group contains 000000006039930 and the last 1 is matched by \w
The 1 is being dropped because \w isn't in the capture group.

Regex to match URL end-of-line or "/" character

I have a URL, and I'm trying to match it to a regular expression to pull out some groups. The problem I'm having is that the URL can either end or continue with a "/" and more URL text. I'd like to match URLs like this:
http://server/xyz/2008-10-08-4
http://server/xyz/2008-10-08-4/
http://server/xyz/2008-10-08-4/123/more
But not match something like this:
http://server/xyz/2008-10-08-4-1
So, I thought my best bet was something like this:
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)[/$]
where the character class at the end contained either the "/" or the end-of-line. The character class doesn't seem to be happy with the "$" in there though. How can I best discriminate between these URLs while still pulling back the correct groups?
To match either / or end of content, use (/|\z)
This only applies if you are not using multi-line matching (i.e. you're matching a single URL, not a newline-delimited list of URLs).
To put that with an updated version of what you had:
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|\z)
Note that I've changed the start to be a non-greedy match for non-whitespace ( \S+? ) rather than matching anything and everything ( .* )
You've got a couple regexes now which will do what you want, so that's adequately covered.
What hasn't been mentioned is why your attempt won't work: Inside a character class, $ (as well as ^, ., and /) has no special meaning, so [/$] matches either a literal / or a literal $ rather than terminating the regex (/) or matching end-of-line ($).
/(.+)/(\d{4}-\d{2}-\d{2})-(\d+)(/.*)?$
1st Capturing Group (.+)
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
2nd Capturing Group (\d{4}-\d{2}-\d{2})
\d{4} matches a digit (equal to [0-9])
{4} Quantifier — Matches exactly 4 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
\d{2} matches a digit (equal to [0-9])
{2} Quantifier — Matches exactly 2 times
- matches the character - literally (case sensitive)
3rd Capturing Group (\d+)
\d+ matches a digit (equal to [0-9])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
4th Capturing Group (.*)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of the string
In Ruby and Bash, you can use $ inside parentheses.
/(\S+?)/(\d{4}-\d{2}-\d{2})-(\d+)(/|$)
(This solution is similar to Pete Boughton's, but preserves the usage of $, which means end of line, rather than using \z, which means end of string.)