notepad++ regex how to extract userId from this list

notepad++ regex how to extract userId from this list - regex

I have this list below:
originalscrape,scrapeDate,userId,username,full_name,is_private,follower_count,following_count,media_count,biography,hasProfilePic,external_url,email,contact_phone_number,address_street,isbusiness,Engagement %,MostRecentPostDate,AvgLikes,AvgComments,category,businessJoinDate,businessCountry,businessAds,countryCode,cityName,isverified
,07/03/2020 05:54 AM,="189389157",stronger_together_forever,stronger_together_forever 🌈🏖☀️,False,0,0,0,,False,,,,,No,0,Has no posts.,0,0,,,,,,,No
,07/03/2020 05:54 AM,="51807820",aaronistattoo,Aaron Is.,False,0,0,0,,False,,,,,No,0,Has no posts.,0,0,,,,,,,No
,07/03/2020 05:54 AM,="194962598",djcoley727,djcoley727,False,0,0,0,,False,,,,,No,0,Has no posts.,0,0,,,,,,,No
,07/03/2020 05:54 AM,="4182106610",cesararce1985,Cesar Arce,False,0,0,0,,False,,,,,No,0,Has no posts.,0,0,,,,,,,No
,07/03/2020 05:54 AM,="8957742561",minkwhiz,𝕄𝕚𝕟𝕜𝕎𝕙𝕚𝕫,False,0,0,0,,False,,,,,No,0,Has no posts.,0,0,,,,,,,No
I would like to get the userIds only as below:
189389157
51807820
194962598
4182106610
8957742561
I've used ^(?:[^,\r\n]*,){3}([^,\r\n]+).* but it gets me "Usernames", I want is Userids.
I wish somebody who can help me to find the right Regex to extract the userids only.
Thank you

Use the advantage the time in the AM/PM format is present before each ID as well as the ID is surrounded with " characters:
(?:AM|PM),=\"(\d+)\"
Check the demo at Regex101.

You could use Match the =" and repeat the group 2 times instead of 3. Then capture 1+ digits.
Note to repeat the character class [^,\r\n] using * for 0 or more times.
If you want the digits only, you could replace with group 1 using $1
^(?:[^,\r\n]*,){2}="(\d+)".*
^ Start of string
(?:[^,\r\n]*,){2} Repeat 2 times matching 0 or more times any char except a comma or a newline, then match ,
=" Match literally
(\d+) Capture group 1, match 1+ digits
".* Match " and match the rest of the line
Regex demo
If you want the match only, you could make use of \K to reset the match buffer, then match the digits and assert a double quote on the right.
^(?:[^,\r\n]*,){2}="\K\d+(?=")
Regex demo

Related

How to find regex for multiple conditions

I am trying to find regex which would find below matches. I would replace these with blank. I am able to create regex for few of these conditions individually, but I am not able to figure out how to create one regex for all of these
Strings:
song1 artist (SiteWithMp3Keyword.com).mp3
02.song2 | siteWithdownloadKeyword.in 320 Kbps
song3 [SitewithDjKeyword.in] 128kbps.mp3
Output
song1 artist.mp3
song2
song3.mp3
Criteria for match:
Case Insensitive
Find Strings with particular keyword and remove whole word, even if inside any braces
Find kpbs keyword and remove it along with any number before it (128/320)
if string ends in .mp3, keep it as it is.
Remove junk characters (like | ) and replace _ with space.
Remove number if present at start of string, like 001_ 02. etc.
Trim whitespaces before and after remaining string
Example Regex for 2.
\S+(mp3|dj|download)\S+
https://regex101.com/r/nxp4d3/1

Try this regex ....
Find:^[0-9. ]*(song\d+ (\w+ )?).*?(\.mp3 ?)?$
Replace with:$1$3
P.S , if this code doesn't solve your problem, please share a sample of your real data, so someone well better understand you,
Thanks...

For the example data, you might use:
^\h*(?:\d+\W*)?(\w+(?:\h+\w+)*).*?(\.mp3)?\h*$
The pattern matches:
^ Start of string
\h* Match optional leading spaces
(?:\d+\W*)? Match 1+ digits followed by optional non word characters
(\w+(?:\h+\w+)*) Capture group 1, match word characters optionally repeated with a space in between
.*? Match any character except a newline, as least as possible
(\.mp3)? Optionally capture .mp3 in group 2
\h* Match optional trailing spaces
$ End of string
Regex demo
Replace with capture group 1 and group 2
$1$2

How do I create a regex expression for a 10 digit phone number with the same separator?

I am trying to create a basic regular expression to match a phone number which can either use dots [.] or hyphens [-] as the separator.
The format is 123.456.7890 or 123-456-7890.
The expression I am currently using is:
\d\d\d[-.]\d\d\d[-.]\d\d\d\d
The issue here is that it also matches the phone numbers that have both separators in them which I want to be termed as invalid/not a match. For example, with my expression, 123.456-7890 and 123-456.7890 show up as a match, something I do not want happening.
Is there a way to do that?

Use a backreference:
^\d{3}([.-])\d{3}\1\d{4}$
Here is an explanation of the regex:
^ from the start of the number
\d{3} match any 3 digits
([.-]) then match AND capture either a dot or a dash separator
\d{3} match any 3 digits
\1 match the SAME separator seen earlier
\d{4} match any 4 digits
$ end of the number

You can use this regex:
^\d{3}([-.])\d{3}\1\d{4}$
You can see that it works here.
Key point here - is that you capture your desired character using brackets ([-.])
and then reuse it with back reference \1.

Regex to get value from <key, value> by asserting conditions on the value

I have a regex which takes the value from the given key as below
Regex .*key="([^"]*)".* InputValue key="abcd-qwer-qaa-xyz-vwxc"
output abcd-qwer-qaa-xyz-vwxc
But, on top of this i need to validate the value with starting only with abcd- and somewhere the following pattern matches -xyz
Thus, the input and outputs has to be as follows:
I tried below which is not working as expected
.*key="([^"]*)"?(/Babcd|-xyz).*
The key value pair is part of the large string as below:
object{one="ab-vwxc",two="value1",key="abcd-eest-wd-xyz-bnn",four="obsolete Values"}
I think by matching the key its taking the value and that's y i used this .*key="([^"]*)".*
Note:
Its a dashboard. you can refer this link and search for Regex: /"([^"]+)"/ This regex is applied on the query result which is a string i referred. Its working with that regex .*key="([^"]*)".* above. I'm trying to alter with that regexGroup itself. Hope this helps?
Can anyone guide or suggest me on this please? That would be helpful. Thanks!

Looks like you could do with:
\bkey="(abcd(?=.*-xyz\b)(?:-[a-z]+){4})"
See the demo online
\bkey=" - A word-boundary and literally match 'key="'
( - Open 1st capture group.
abcd - Literally match 'abcd'.
(?=.*-xyz\b) - Positive lookahead for zero or more characters (but newline) followed by literally '-xyz' and a word-boundary.
(?: - Open non-capturing group.
-[a-z]+ - Match an hyphen followed by at least a single lowercase letter.
){4} - Close non-capture group and match it 4 times.
) - Close 1st capture group.
" - Match a literal double quote.
I'm not a 100% sure you'd only want to allow for lowercase letter so you can adjust that part if need be. The whole pattern validates the inputvalue whereas you could use capture group one to grab you key.
Update after edited question with new information:
Prometheus uses the RE2 engine in all regular expressions. Therefor the above suggestion won't work due to the lookarounds. A less restrictive but possible answer for OP could be:
\bkey="(abcd(?:-\w+)*-xyz(?:-\w+)*)"
See the online demo

Will this work?
Pattern
\bkey="(abcd-[^"]*\bxyz\b[^"]*)"
Demo

You could use the following regular expression to verify the string has the desired format and to match the portion of the string that is of interest.
(?<=\bkey=")(?=.*-xyz(?=-|$))abcd(?:-[a-z]+)+(?=")
Start your engine!
Note there are no capture groups.
The regex engine performs the following operations.
(?<=\bkey=") : positive lookbehind asserts the current
position in the string is preceded by 'key='
(?= : begin positive lookahead
.*-xyz : match 0+ characters, then '-xyz'
(?=-|$) : positive lookahead asserts the current position is
: followed by '-' or is at the end of the string
) : end non-capture group
abcd : match 'abcd'
(?: : begin non-capture group
-[a-z]+ : match '-' followed by 1+ characters in the class
)+ : end non-capture group and execute it 1+ times
(?=") : positive lookahead asserts the current position is
: followed by '"'

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.

I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):

The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.

You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1

If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regular expression in Hive - Get number after specific text

How can I use regular expressions in order to get the number 4968 from the following text?
"category_path":["XXX1430","XXX109026","XXX3120","XXX4968","XXX377357"]
Many thanks!

Try Regex: category_path":\[(?:"[X]+\d+",){3}"[X]+\K\d+
Demo

If you want to use a regex you could capture 4968 in a group (\d+).
"category_path":\["XXX\d+"(?:,"XXX\d+"){2},"XXX(\d+)"(?:,"XXX\d+")*\]
Explanation
"category_path":\[ Match literally
"XXX\d+" Match XXX digits pattern without a coma
(?:,"XXX\d+"){2} repeat the XXX digits pattern with a comma 2 times
"XXX(\d+)" match "XXX and capture one or more digits in a group (\d+) and match a "
(?:,"XXX\d+")*\] match the pattern zero or more times and the closing ]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

notepad++ regex how to extract userId from this list - regex

Use the advantage the time in the AM/PM format is present before each ID as well as the ID is surrounded with " characters: (?:AM|PM),=\"(\d+)\" Check the demo at Regex101.

Related

How to find regex for multiple conditions

How do I create a regex expression for a 10 digit phone number with the same separator?

Regex to get value from <key, value> by asserting conditions on the value

Regular expressions in notepad++ (Search and Replace)

Regular expression in Hive - Get number after specific text

Categories

Resources