Notepad++ regex to extract usernames from this list - regex

I have this list below:
scrapeDate,username,full_name,is_private,follower_count,following_count,media_count,biography,hasProfilePic,external_url,email,contact_phone_number,address_street,category,businessJoinDate,businessCountry,businessAds,countryCode,cityName,isverified
07/05/2020 05:37 AM,maplethenorwich,Maple the Norwich,False,0,0,0,,False,,,,,,,,,,,No
07/05/2020 05:37 AM,baby_yoda_militia,Baby Yoda,False,0,0,0,,False,,,,,,,,,,,No
07/05/2020 05:37 AM,caciquegoldendoodle,CaciqueGoldenDoodle,False,0,0,0,,False,,,,,,,,,,,No
07/05/2020 05:37 AM,ja_watts,Julie Anna Watts,False,0,0,0,,False,,,,,,,,,,,No
07/05/2020 05:37 AM,lets_go_zumba_and_travel,Mrsirenetakamoto,False,0,0,0,,False,,,,,,,,,,,No
07/05/2020 05:37 AM,bunnyslash,Bunnyslash,False,0,0,0,,False,,,,,,,,,,,No
I would like to get the Usernames only as below:
maplethenorwich
baby_yoda_militia
caciquegoldendoodle
ja_watts
lets_go_zumba_and_travel
bunnyslash
I've tried ^(?:[^,\r\n]*,){3}([^,\r\n]+).* but it gets me "False".
I wish somebody who can help me to find the right Regex to extract the Usernames only.

You may try:
.*?,(.*?),.*
Explanation of the above regex:
.*? - Lazily matches everything except the new line.
, - Matches , literally.
(.*?) - Represents first capturing group matching lazily username or the second values in csv.
,.* - Greedily matching everything except new line. If you don't want to remove the contents; just leave this and capture the above group and write them to a new file or according to your requirement.
$1 - For the replacement part replace all the matched text with just the captured group using $1.
You can find the demo of the above regex in here.
Result Snap from notepad++

You are repeating the group 3 times using quantifier {3}, but there is no need to repeat it because you want the second value.
^(?:[^,\r\n]*,){3}([^,\r\n]+).*
^^^ ^^^^
You can omit the quantifier and the non capturing group as there is nothing to repeat.
^[^,\r\n]*,([^,\r\n]+).*
^ Start of the string
[^,\r\n]*, Match 0+ times any char except a comma or newline, then match ,
( Capture group 1
[^,\r\n]+ Match 1+ times any char except a comma or newline
) Close group 1
.* Match the rest of the line
Regex demo

Related

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015
The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo
A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

RegEx - Return pattern to the right of a text string for URL

I'm looking to return the URL string to the right of a specific set of text using RegEx:
URL:
www.websitename/countrycode/websitename/contact/thank-you/whitepaper/countrycode/whitepapername.pdf
What I would like to just return:
/whitepapername.pdf
I've tried using ^\w+"countrycode"(\w.*) but the match won't recognize countrycode.
In Google Data Studio, I want to create a new field to remove the beginning of the URL using the REGEX_REPLACE function.
Ideally using:
REGEX_REPLACE(Page,......)
The REGEXP_REPLACE function below does the trick, capturing all (.*) the characters after the last countrycode, where Page represents the respective field:
REGEXP_REPLACE(Page, ".*(countrycode)(.*)$", "\\2")
Alternatively - Adapting the RegEx by The fourth bird to Google Data Studio:
REGEXP_REPLACE(Page, "^.*/countrycode(/[^/]+\\.\\w+)$", "\\1")
Google Data Studio Report as well as a GIF to elaborate:
You could use a capturing group and replace with group 1. You could match /countrycode literally or use the pattern to match 2 times chars a-z with an underscore in between like /[a-z]{2}_[a-z]{2}
In the replacement use group 1 \\1
^.*/countrycode(/[^/]+\.\w+)$
Regex demo
Or using a country code pattern from the comments:
^.*/[a-z]{2}_[a-z]{2}(/[^/]+\.\w+)$
Regex demo
The second pattern in parts
^ Start of string
.*/ Match until the last occurrence of a forward slash
[a-z]{2}_[a-z]{2} Match the country code part, an underscore between 2 times 2 chars a-z
( Capture group 1
/[^/]+ Match a forward slash, then match 1+ occurrences of any char except / using a negated character class
\.\w+ Match a dot and 1+ word chars
) Close group
$ End of string

Regular Expressions Notepad++

When using character delimited text, what code allows me to pull out specific segments within a given row? Out of a given set of data (focusing on bold):
1194459945,11/07/2007 18:25:45,2,vnta,287.78,2,7.783,2,34.111,2,1.3,2,89.54,2,1485.31,26.612
Trying to get it like:
11/07/2007 7.783 89.54
Currently, the progress I've made has been: (\w+,)(.+) (
which has given me the first two columns, but I'm stuck as to how to reach 7.783 and segment that out. Without including the entire row. I cannot put \, because that doesn't help.
Something like this might work.. ^.*?,([^ ,]+)(?:.*?,){5}([^ ,]+)(?:.*?,){6}([^ ,]+).*$
Explanation:
^ - Start of the string / line
.*?, - matches anything up until the first comma
([^ ,]+) - matches anything not a space or comma and stores it in capture group 1 (your date)
(?:.*?,){5} - non capture group to match the fields and commas for the next 5 fields
([^ ,]+) - matches anything not a space or comma and stores it in capture group 2 (your 7.783)
(?:.*?,){6} - another non capture group to match the fields and commas for the next 6 fields
([^ ,]+) - matches anything not a space or comma and stores it in capture group 3 (your 89.54)
.*$ - matches anything trailing after this match to the end of string / line
Notepad++:
You can use the find and replace tool in Notepad++ to replace the strings with only the capture groups which can be accessed by using a dollar sign followed by the capture group number like so:
Find: ^.*?,([^ ,]+)(?:.*?,){5}([^ ,]+)(?:.*?,){6}([^ ,]+).*$
Replace: $1 $2 $3
Test:
Before:
1194459945,11/07/2007 18:25:45,2,vnta,287.78,2,7.783,2,34.111,2,1.3,2,89.54,2,1485.31,26.612
After:
11/07/2007 7.783 89.54

Regex pattern in vbscript to match Text with multiple line

I have a long string with Slno. in it. I want to split the sentence from the string with Slno.
Sample text:
1. Able to click new button and proceed to ONB-002 dialogue.
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
I tried using vbscript regex to match a pattern. but it is matches only the first line of the string (1. text) not the second one.
^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]
And while splitting the string, for the Slno. 2 i want o get the below sentence as well. which am finding difficulty in getting.
Please assist me.
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]"
.Global = True
End With
Set matches = regex.Execute(txt)
My Expectation is am looking for a regex pattern that match
1. Able to click new button and proceed to ONB-002 dialogue.
&
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
as separate sentence or group.
If I am not mistaken, to get the 2 separate parts including the line after you could use:
^\d+\..*(?:\r?\n(?!\d+\.).*)*
Explanation
^ Start of string
\d+\. Match 1+ digits followed by a dot
.* Match any character except a newline 0+ times
(?: Non capturing group
\r?\n(?!\d+\.).* Match a newline and use a negative lookahead to asset what is on the right is not 1+ digits followed by a dot
)* Close non capturing group and repeat 0+ times
Regex demo

Regex to remove all except XML

I need help with a Regex for notepad++ to match all but XML
The regex I'm using:
(!?\<.*\>) <-- I want the opposite of this (in first three lines)
The example code:
[20173003] This text is what I want to delete [<Person><Name>Foo</Name><Surname>Bar</Surname></Person>], and this text too.
[20173003] This is another text to delete [<Person><Name>Bar</Name><Surname>Foo</Surname></Person>]
[20173003] This text too... [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], delete me!
[20173003] But things like this make the regex to fail < [<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>], or this>
Expected result:
<Person><Name>Foo</Name><Surname>Bar</Surname></Person>
<Person><Name>Bar</Name><Surname>Foo</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>
<Person><Name>Lorem</Name><Surname>Ipsum</Surname></Person>
Thanks in advance!
This is not perfect, but should work with your input that looks quite simple and well-structured.
If you need to handle just a single unnested <Person> tag, you may use simple (<Person>.*?</Person>)|. regex (that will match and capture into Group 1 any <Person> tag and will match any other char) and replace with a conditional replacement pattern (?{1}$1\n:) (that will reinsert Person tag with a newline after it or will replace the match with an empty string):
To make it a bit more generic, you may capture the opening and corresponding closing XML tags with a recursion-based Boost regex, and the appropriate conditional replacement pattern:
Find What: (<(\w+)[^>]*>(?:(?!</?\2\b).|(?1))*</\2>)|.
Replace With: (?{1}$1\n:)
. matches newline: ON
Regex Details:
(<(\w+)[^>]*>(?:(?!</?\2\b).|(?1))*</\2>) - Capturing group 1 (that will be later recursed with the (?1) subrouting call) matching
<(\w+)[^>]*> - any opening tag with its name captured into Group 2
(?:(?!</?\2\b).|(?1))* - zero or more occurrences of:
(?!</?\2\b). - any char (.) not starting a sequence of </ + tag name as a whole word with an optional / in front
| - or
(?1) - the whole Group 1 subpattern is recursed (repeated)
</\2> - the corresponding closing tag
| - or
. - any single char.
Replacement pattern:
(?{1} - if Group 1 matches:
$1\n - replace with its contents + a newline
: - else replace with an empty string
) - end of the replacement pattern.