Regular expressions in notepad++ (Search and Replace)

Regular expressions in notepad++ (Search and Replace) - regex

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.

I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):

The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.

You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1

If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Related

Regex to disregard partial matches across lines / matching too much

I have three lines of tab-separated values:
SELL 2022-06-28 12:42:27 39.42 0.29 11.43180000 0.00003582
BUY 2022-06-28 12:27:22 39.30 0.10 3.93000000 0.00001233
_____2022-06-28 12:27:22 39.30 0.19 7.46700000 0.00002342
The first two have 'SELL' or 'BUY' as first value but the third one has not, hence a Tab mark where I wrote ______:
I would like to capture the following using Regex:
My expression ^(BUY|SELL).+?\r\n\t does not work as it gets me this:
I do know why outputs this - adding an lazy-maker '?' obviously won't help. I don't get lookarounds to work either, if they are the right means at all. I need something like 'Match \r\n\t only or \r\n(?:^\t) at the end of each line'.
The final goal is to make the three lines look at this at the end, so I will need to replace the match with capturing groups:
Can anyone point me to the right direction?

Ctrl+H
Find what: ^(BUY|SELL).+\R\K\t
Replace with: $1\t
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(BUY|SELL) # group 1, BUY or SELL
.+ # 1 or more any character but newline
\R # any kind of linebreak
\K # forget all we have seen until this position
\t # a tabulation
Replacement:
$1 # content of group 1
\t # a tabulation
Screenshot (before):
Screenshot (after):

You can use the following regex ((BUY|SELL)[^\n]+\n)\s+ and replace with \1\2.
Regex Match Explanation:
((BUY|SELL)[^\n]+\n): Group 1
(BUY|SELL): Group 2
BUY: sequence of characters "BUY" followed by a space
|: or
SELL: sequence of characters "SELL" followed by a space
[^\n]+: any character other than newline
\n: newline character
\s+: any space characters
Regex Replace Explanation:
\1: Reference to Group 1
\2: Reference to Group 2
Check the demo here. Tested on Notepad++ in a private environment too.
Note: Make sure to check the "Regular expression" checkbox.
Regex

How to find regex for multiple conditions

I am trying to find regex which would find below matches. I would replace these with blank. I am able to create regex for few of these conditions individually, but I am not able to figure out how to create one regex for all of these
Strings:
song1 artist (SiteWithMp3Keyword.com).mp3
02.song2 | siteWithdownloadKeyword.in 320 Kbps
song3 [SitewithDjKeyword.in] 128kbps.mp3
Output
song1 artist.mp3
song2
song3.mp3
Criteria for match:
Case Insensitive
Find Strings with particular keyword and remove whole word, even if inside any braces
Find kpbs keyword and remove it along with any number before it (128/320)
if string ends in .mp3, keep it as it is.
Remove junk characters (like | ) and replace _ with space.
Remove number if present at start of string, like 001_ 02. etc.
Trim whitespaces before and after remaining string
Example Regex for 2.
\S+(mp3|dj|download)\S+
https://regex101.com/r/nxp4d3/1

Try this regex ....
Find:^[0-9. ]*(song\d+ (\w+ )?).*?(\.mp3 ?)?$
Replace with:$1$3
P.S , if this code doesn't solve your problem, please share a sample of your real data, so someone well better understand you,
Thanks...

For the example data, you might use:
^\h*(?:\d+\W*)?(\w+(?:\h+\w+)*).*?(\.mp3)?\h*$
The pattern matches:
^ Start of string
\h* Match optional leading spaces
(?:\d+\W*)? Match 1+ digits followed by optional non word characters
(\w+(?:\h+\w+)*) Capture group 1, match word characters optionally repeated with a space in between
.*? Match any character except a newline, as least as possible
(\.mp3)? Optionally capture .mp3 in group 2
\h* Match optional trailing spaces
$ End of string
Regex demo
Replace with capture group 1 and group 2
$1$2

Regex for text file

I have a text file with the following text:
andal-4.1.0.jar
besc_2.1.0-beta
prov-3.0.jar
add4lib-1.0.jar
com_lab_2.0.jar
astrix
lis-2_0_1.jar
Is there any way i can split the name and the version using regex. I want to use the results to make two columns 'Name' and 'Version' in excel.
So i want the results from regex to look like
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar
So far I have used ^(?:.*-(?=\d)|\D+) to get the Version and -\d.*$ to get the Name separately. The problem with this is that when i do it for a large text file, the results from the two regex are not in the same order. So is there any way to get the results in the way I have mentioned above?

Ctrl+H
Find what: ^(.+?)[-_](\d.*)$
Replace with: $1\t$2
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(.+?) # group 1, 1 or more any character but newline, not greedy
[-_] # a dash or underscore
(\d.*) # group 2, a digit then 0 or more any character but newline
$ # end of line
Replacement:
$1 # content of group 1
\t # a tabulation, you may replace with what you want
$2 # content of group 2
Result for given example:
andal 4.1.0.jar
besc 2.1.0-beta
prov 3.0.jar
add4lib 1.0.jar
com_lab 2.0.jar
astrix
lis 2_0_1.jar

Not quite sure what you meant for the problem in large file, and I believe the two regex you showed are doing opposite as what you said: first one should get you the name and second one should give you version.
Anyway, here is the assumption I have to guess what may make sense to you:
"Name" may follow by - or _, followed by version string.
"Version" string is something preceded by - or _, with some digit, followed by a dot or underscore, followed by some digit, and then any string.
If these assumption make sense, you may use
^(.+?)(?:[-_](\d+[._]\d+.*))?$
as your regex. Group 1 is will be the name, Group 2 will be the Version.
Demo in regex101: https://regex101.com/r/RnwMaw/3
Explanation of regex
^ start of line
(.+?) "Name" part, using reluctant match of
at least 1 character
(?: )? Optional group of "Version String", which
consists of:
[-_] - or _
( ) Followed by the "Version" , which is
\d+ at least 1 digit,
[._] then 1 dot or underscore,
\d+ then at least 1 digit,
.* then any string
$ end of line

Remove characters from regex query

I have trouble understanding why my regex query takes one extra character besides the symbols I have told regex to include into the query, so this is my regex:
([\-:, ]{1,})[^0-9]
This is my test text:
Test- Product-: 1 --- 3 hour ,--kayak:--rental
It always includes the first character of each starting word, like P on Product or h on hour, how can I prevent regex from including those first characters?
I am trying to get all dashes, double points, comma and spaces excluding numbers or any characters.

The [^0-9] part of your regex matches any char but a digit, so you should remove it from your pattern.
There is no need to wrap the character class with a capturing group, and {0,1} is equal to +, so the whole regex can be shortened to
[-:, ]+
Note that - in the initial and end positions inside a character class does not have to be escaped.

Regex with optional, lazy, greedy group

Let's take this source string from a word document:
A;SDLFJA;SDJFA;KSDJF;ALKSJDF SOURCE: 3 55 ASDKLFJA;KDSJF
sa;ldkjfa SOURCE: HYPERLINK "ASDLFA;SDFA;SKD" "MATCH9" 3 HYPERLINK
"ASDLFA;SDFA;SKD" "MATCH10" 55 a;sdkfja;ksdfj;aklsdjf;lk
I'm looking for a pattern that is composed of the literal text "SOURCE: " followed by a 1 digit number a space and a 2 digit number.
For example, in the first line of the source string, I want to find "SOURCE: 3 55".
Now, some clever boffin has decided to embed a hyperlink for the 1 digit number and another hyperlink for the 2 digit number. Lines 2 and 3 show the two embedded hyperlinks. MATCH1 refers to the first embedded hyperlink, MATCH2 is the second, and so on. I have no way of knowing how many hyperlinks will be placed before these, so one can't assume MATCH9 and MATCH10.
The text I want to extract is the "3 55" portion. I want to put it into a named group I'll call "KeepMe".
I don't mind using two different patterns, one for the hyperlink and one without.
Here's a pattern that works for the non-hyperlinked text:
SOURCE:\s+(?<KeepMe>\d*\s+\d*)
I get "3 55" in the KeepMe group just like I want.
I haven't been able to keep the hyperlink match pattern from being greedy.
Here's a failed regex pattern, (one of many):
SOURCE:\s+(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe1>\d*)\s+
(?<Hyperlink>HYPERLINK.*MATCH\d*\u0022\s+)??(?<KeepMe2>\d*)
In the above pattern, I'm trying to say:
Look for the literal SOURCE: followed by one or more spaces.
Then, optionally look for the literal text "HYPERLINK followed by some characters, followed by the literal text MATCH, followed by some digits and a double quote character in a lazy, non-greedy manner, followed by one or more spaces, followed by some digits I want to keep. Then, do another HYPERLINK pattern match like we just did and keep the digits after that, too.
Remember, in both cases, I want to extract "3 55". It can be extracted in one or two pieces though one would be best.
Any ideas???

This should do the trick:
\bSOURCE:\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe1>\d+)\s+(?:HYPERLINK\s+"[^"]*"\s+"MATCH\d+"\s+)?(?<KeepMe2>\d+)\b
Main difference is that I replaced the .* between HYPERLINK and MATCH with something less greedy.
Fiddle: https://regex101.com/r/yE3fP4/1

A Regex that works for just the hyperlinked case is:
/(?<SourceToken>SOURCE:) # Start with a source tag
\s+ # Followed by whitespace
(?<HyperlinkMatchGroup> # Save the hyperlink & match combo.
(?<Hyperlink> # Save the hyperlink (to be discarded)
(?<HyperlinkToken>HYPERLINK\s+) # Hyperlinks start with the literal tag "HYPERLINK"
(?<HyperlinkText>".*?") # Hyperlink text contained in quotes, non-greedy
\s*) # Followed by whitespace
* # Repeating any number of times
(?<MatchToken>"MATCH\d*") # Followed by a literal tag "MATCH" and a digit string
\s* # Followed by whitespace
(?<KeepMe>\d+) # Finally, the match, which is just a series of digits
\s* # Followed by whitespace
)+ # The whole hyperlink & match pair must occur at least once
/x
It may or may not cover all your cases; I haven't spent much time digging into it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expressions in notepad++ (Search and Replace) - regex

The RegEx looks for the 4th double quote: ^(?:[^"]\"){4}([^|]) You can see this demo: https://regex101.com/r/wJ9yS6/163 You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.

Related

Regex to disregard partial matches across lines / matching too much

How to find regex for multiple conditions

Regex for text file

Remove characters from regex query

Regex with optional, lazy, greedy group

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular expressions in notepad++ (Search and Replace) - regex

The RegEx looks for the 4th double quote: ^(?:[^"]*\"){4}([^|]*) You can see this demo: https://regex101.com/r/wJ9yS6/163 You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.

Related

Regex to disregard partial matches across lines / matching too much

How to find regex for multiple conditions

Regex for text file

Remove characters from regex query

Regex with optional, lazy, greedy group

Categories

Resources

The RegEx looks for the 4th double quote: ^(?:[^"]\"){4}([^|]) You can see this demo: https://regex101.com/r/wJ9yS6/163 You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.