How to find regex for multiple conditions - regex

I am trying to find regex which would find below matches. I would replace these with blank. I am able to create regex for few of these conditions individually, but I am not able to figure out how to create one regex for all of these
Strings:
song1 artist (SiteWithMp3Keyword.com).mp3
02.song2 | siteWithdownloadKeyword.in 320 Kbps
song3 [SitewithDjKeyword.in] 128kbps.mp3
Output
song1 artist.mp3
song2
song3.mp3
Criteria for match:
Case Insensitive
Find Strings with particular keyword and remove whole word, even if inside any braces
Find kpbs keyword and remove it along with any number before it (128/320)
if string ends in .mp3, keep it as it is.
Remove junk characters (like | ) and replace _ with space.
Remove number if present at start of string, like 001_ 02. etc.
Trim whitespaces before and after remaining string
Example Regex for 2.
\S+(mp3|dj|download)\S+
https://regex101.com/r/nxp4d3/1

Try this regex ....
Find:^[0-9. ]*(song\d+ (\w+ )?).*?(\.mp3 ?)?$
Replace with:$1$3
P.S , if this code doesn't solve your problem, please share a sample of your real data, so someone well better understand you,
Thanks...

For the example data, you might use:
^\h*(?:\d+\W*)?(\w+(?:\h+\w+)*).*?(\.mp3)?\h*$
The pattern matches:
^ Start of string
\h* Match optional leading spaces
(?:\d+\W*)? Match 1+ digits followed by optional non word characters
(\w+(?:\h+\w+)*) Capture group 1, match word characters optionally repeated with a space in between
.*? Match any character except a newline, as least as possible
(\.mp3)? Optionally capture .mp3 in group 2
\h* Match optional trailing spaces
$ End of string
Regex demo
Replace with capture group 1 and group 2
$1$2

Related

Notepad++ Search for and replace Underscore Characters in "GUIDs"

A colleague has written some C# code that outputs GUIDs to a CSV file. The code has been running for a while but it has been discovered that the GUIDs contain underscore characters, instead of hyphens :-(
There are several files which have been produced already and rather than regenerate these, I'm thinking that we could use the Search and Replace facility in Notepad++ to search across the files for "GUIDs" in this format:
{89695C16_C0FF_4E7C_9BB2_8B50FAC9D371}
and replace it with a properly formatted GUID like this:
{89695C16-C0FF-4E7C-9BB2-8B50FAC9D371}.
I have a RegEx to find the offending GUIDs (probably not very efficient):
(([A-Z]|[0-9]){8}_)(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4})_(([A-Z]|[0-9]){4}_(([A-Z]|[0-9]){12}))
but I don't know what RegEx to use to replace the underscores with. Does anybody know how to do this?
You can use the following solution:
Find What: (?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\}))[a-f\d]*\K_
Replace with: -
Match case: OFF
See the settings and demo:
See the regex demo online. Details:
(?:\G(?!\A)|{(?=[a-f\d]{8}(?:_[a-f\d]{4}){4}[a-f\d]{8}\})) - either the end of the previous match or a { char immediately followed with eight alphanumeric chars, four repetitions of an underscore and then four alphanumeric chars and then eight alphanumeric chars and a } char
[a-f\d]* - zero or more alphanumeric chars
\K - match reset operator that discards the text matched so far from the overall match memory buffer
_ - an underscore.
You can match the pattern with 5 capture groups where you would match the underscores in between.
Then you can use the capture groups in the replacement with $1-$2-$3-$4-$5
{\K([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})(?=})
{ Match {
\K Clear the match buffer (forget what is matched so far)
([A-Z0-9]{8})_ Capture group 1, match 8 times a char A-Z0-9
([A-Z0-9]{4})_ Capture 4 times a char A-Z0-9 in group 2
([A-Z0-9]{4})_ Same for group 3
([A-Z0-9]{4})_ Same for group 4
([A-Z0-9]{12}) Capture 12 times a char A-Z0-9 in group 5
(?=}) Positive lookahead, assert } to the right
Regex demo
If the pattern should also match without matching the curly's { and } you can append word boundaries
\b([A-Z0-9]{8})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{4})_([A-Z0-9]{12})\b
Regex demo

How to extract a word that could possibly be followed with another word

I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).

Match certain string on second line of text with regex

I'm new to regex, and would appreciate some guidance/help.
Currently, I'm looking to write an expression, that derives a certain part of text from the 2nd line of the provided text.
Here is the text:
123 anywhere Avenue
Winnipeg, Manitoba R3E 0L7
Canada
Pharmacy Manager: person person
Pharmacy Licence Holder/Owner: 123456 Manitoba Ltd.
see correct formatting with code here
My goal is to derive the 'Manitoba' string from the second line, however I'd like to make it dynamic rather than writing an expression to always fetch Manitoba as a static. I used the below code to target the second line:
(.*)(?=(\n.*){3}$)
(It matches 3 lines up from the last line, thus targeting the desired line)
I noticed, that within the dataset, that the Province (Manitoba) is always in between two spaces.
Is there any addition I can make to the code, so that the expression only targets the second line, then matches the first string in-between spaces?
Perhaps using a lazy expression with a positive lookaround?
If I target all matches in between spaces, it would take both 'Manitoba' and 'R3E 0L7' which I dont want.
I want it to only match the first piece of text in between spaces on the second line.
Any help is much appreciated :-)
Thanks.
One option could be to match the first line, then capture the second word in the second lines in capturing group 1.
Then match the rest of the second line and assert what follows is 3 times a line.
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?=(?:\r?\n.*){3}$)
In parts:
^ Start of string
.*\r?\n Match the whole lines and a newline
\S+ Match 1+ non whitespace char (the first "word")
[^\S\r\n]+ Match 1+ times a whitespace char except newlines
(\S+) Capture group 1 Match 1+ times a non whitespace char (the second "word')
.* Match the rest of the line
(?= Positive lookahead, assert what follows on the right is
(?:\r?\n.*){3}$ Match 3 times a newline followed by 0+ times any except a newline and assert the end of the string
) Close lookahead
Regex demo
You could also turn the lookahead in to a match instead
^.*\r?\n\S+[^\S\r\n]+(\S+).*(?:\r?\n.*){3}$
Regex demo

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

RegEx for matching uppercase and dash followed by a comma

Trying to remove strings that follow the pattern
Tag Starts With
Size:
and before the next COMMA (,) includes the - character.
Example:
Size: XS-S-M-L-XL-2XL,
or
Size: XS-S-M,
etc.
WOULD get selected (including ,)
but Size_S, would be ignored because there is no -
I'm close with:
Size:(.*)-*(.?),
But still not stopping at ,
Here is 1 line of tags:
Athletics, Fitted, Mesh, Feature_Moisture Wicking, Material_Polyester 100%, , Material_Polyester 100%, Material_Polyester Over 50%, School, Style_Short Sleeves, Size_2XL, Size_L, Size_M, Size_S, Size_XL, Size_XS, Size: XS-S-M-L-XL-2XL, Uniforms, Unisex, V-Neck, VisibleLogos, Youth
To remove all size 'range' tags from my cells and only leave the single size tag.
Solution can be found here: regex101.com/r/VuTzba/1
In your pattern Size:(.*)-*(.?), you are first matching until the end of the string using (.*).
After that the hyphen -* and single character in the group (.?) are optional so it will backtrack until the last comma as that is the only character that has to be matched.
To get a more exact match, you could use a repeating pattern to match the sizes:
Size: (?:\d*X[SL]|L|M|S)(?:-(?:\d*X[LS]|L|M|S))*,
Explanation
Size: Match Size followed by a space
(?: Non capturing group
\d*X[SL]|L|M|S match one of the listed items in the alternation
) Close group
(?: Non capturing group
-(?:\d*X[LS]|L|M|S) Match a hyphen followed by any of the listed items
)*, Close group and repeat 0+ times and match a comma
Regex demo
As more broader pattern could be using a character class and list all the allowed characters Size: [XSML\d]+(?:-[XSML\d]+)*, or match until the first comma Size:[^,]+,
Edit
To also match Size: 28W-30W-32W-34W-36W-38W-40W, Size: 28W-30W-32W-34W or you could use extend the character class adding |\d+W to it and end the pattern matching either a comma or assert the end of the string $
Size: (?:\d*X[SL]|L|M|S|\d+W)(?:-(?:\d*X[LS]|L|M|S|\d+W))*(?:,|$)
Regex demo
We might want to add more boundaries in our expression here. Let's start with something similar to:
Size:\s+([A-Z0-9-]+),
where the capturing group () collects our desired data.
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.
Size:\s*(.*?), will grab everything after the colon and before the next comma skipping leading white space.