How to transpose pieces of data using Regular expression in Notepad++ - regex

I am very new to the world of regular expressions. I am trying to use Notepad++ using Regex for the following:
Input file is something like this and there are multiple such files:
Code:
abc
17
015
0 7
4.3
5/1
***END***
abc
6
71
8/3
9 0
***END***
abc
10.1
11
9
***END***
I need to be able to edit the text in all of these files so that all the files look like this:
Code:
abc
1,2,3,4,5
***END***
abc
6,7,8,9
***END***
abc
10,11,12
***END***
Also:
In some files the number of * around the word END varies, is there a way to generalize the number of * so I don't have to worry about it?
There is some additional data before abcs which does not need to be transposed, how do I keep that data as it is along with transposing the data between abc and ***END***.
Kindly help me. Your help is much appreciated!

Try the following find and replace, in regex mode:
Find: ^(\d+)\R(?!\*{1,}END\*{1,})
Replace: $1,
Demo
Here is an explanation of the regex pattern:
^ from the start of the line
(\d+) match AND capture a number
\R followed by a platform independent newline, which
(?!\*{1,}END\*{1,}) is NOT followed by ***END***
Note carefully the negative lookahead at the end of the pattern, which makes sure that we don't do the replacement on the final number in each section. Without this, the last number would bring the END marker onto the same line.

This will eplace only between "abc" and "***END***" with any number of asterisk.
Ctrl+H
Find what: (?:(?<=^abc)\R|\G(?!^)).+\K\R(?!\*+END\*+)
Replace with: ,
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
(?: # non capture group
(?<=^abc) # positive look behind, make sure we have "abc" at the beginning of line before
\R # any kind of linebreak
| # OR
\G # restart from last match position
(?!^) # negative look ahead, make sure we are not at the beginning of line
) # end group
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\R # any kind of linebreak
(?!\*+END\*+) # negative lookahead, make sure we haven't ***END*** after
Screen capture (before):
Screen capture (after):

Related

Regex to disregard partial matches across lines / matching too much

I have three lines of tab-separated values:
SELL 2022-06-28 12:42:27 39.42 0.29 11.43180000 0.00003582
BUY 2022-06-28 12:27:22 39.30 0.10 3.93000000 0.00001233
_____2022-06-28 12:27:22 39.30 0.19 7.46700000 0.00002342
The first two have 'SELL' or 'BUY' as first value but the third one has not, hence a Tab mark where I wrote ______:
I would like to capture the following using Regex:
My expression ^(BUY|SELL).+?\r\n\t does not work as it gets me this:
I do know why outputs this - adding an lazy-maker '?' obviously won't help. I don't get lookarounds to work either, if they are the right means at all. I need something like 'Match \r\n\t only or \r\n(?:^\t) at the end of each line'.
The final goal is to make the three lines look at this at the end, so I will need to replace the match with capturing groups:
Can anyone point me to the right direction?
Ctrl+H
Find what: ^(BUY|SELL).+\R\K\t
Replace with: $1\t
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
(BUY|SELL) # group 1, BUY or SELL
.+ # 1 or more any character but newline
\R # any kind of linebreak
\K # forget all we have seen until this position
\t # a tabulation
Replacement:
$1 # content of group 1
\t # a tabulation
Screenshot (before):
Screenshot (after):
You can use the following regex ((BUY|SELL)[^\n]+\n)\s+ and replace with \1\2.
Regex Match Explanation:
((BUY|SELL)[^\n]+\n): Group 1
(BUY|SELL): Group 2
BUY: sequence of characters "BUY" followed by a space
|: or
SELL: sequence of characters "SELL" followed by a space
[^\n]+: any character other than newline
\n: newline character
\s+: any space characters
Regex Replace Explanation:
\1: Reference to Group 1
\2: Reference to Group 2
Check the demo here. Tested on Notepad++ in a private environment too.
Note: Make sure to check the "Regular expression" checkbox.
Regex

Regex in Notepad++ to remove certain CRLFs

Given this sample data:
00-1234T|`CRLF`
Data|Commments|`CRLF`
12-3456|Some data|Notes|`CRLF`
65-8436ZZ|Data|`CRLF`
|`CRLF`
45-4576AA|Some data|Comments|`CRLF`
98-4392REV|Data|`CRLF`
|`CRLF`
00-5432|Some Data|Some Comments|
(I added the "CRLF"s to each line to more clearly illustrate what is there and what needs to be replaced)
Each record should only have three pipes in a line, with a CRLF after the third pipe. So lines 1, 4, and 7 (pre-find/replace) need to be fixed, which means any CRLFs before the third pipe needs to be replaced with a "placeholder", which will be "#CRLF#".
The closest I've been able to come up with is ^((?:[^\v|]*\|){3})(.+), which will match (highlight) lines 3 & 4, 6 & 7, and 9 & 10. My expectation (requirement) is to find the CRLFs in lines 2, 5, & 8 and replace those with "#CRLF#".
[UPDATE]
After sleeping on this question, I woke up realizing that, for the purpose of more accurately finding the beginning of a given record - whether on one line or multiple - I should add that the first column will always start with the pattern [0-9][0-9]-[0-9][0-9][0-9][0-9] and possibly have up to three alphanumeric characters after that.
I modified the sample data above to reflect that.
Ctrl+H
Find what: \R(?!\d\d-\d{4}\w{0,3}\|)
Replace with: #CRLF#
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
\R # any kind of linebreak (i.e. \r, \n, \r\n),
if you want to match only windows EOL, use \r\n
(?! # negative lookahead, make sure we haven't after:
\d\d-\d{4} # 2 digit dash 4 digit
\w{0,3} # word character from 0 upto 3
\| # a pipe
) # end lookahead
Screenshot (before):
Screenshot (after):
This should get you started.
The regex just captures the parts between pipes then re-writes on the substitution.
Any CRLF's are not captured and get stripped out.
But this is very simplistic and may need to change if your input is any more complex.
(?m)^([^|\r\n]*)[\r\n]*\|[\r\n]*([^|\r\n]*)[\r\n]*\|[\r\n]*([^|\r\n]*)[\r\n]*\|[\r\n]*
Replace using: $1|$2|$3|\n
https://regex101.com/r/WzDLwf/1
updated answer
To answer your updated question, if you need to make it like mail merge,
it could also be done like this (as an alternative to Toto's method).
(?m)
(?:
^ \d{2} - \d{4} [^|\r\n]* \|
| \G
)
(?: [^|\r\n]* \| )*
\K
[\r\n]+ (?! [\r\n]* (?: ^ \d{2} - \d{4} | $ ) )
https://regex101.com/r/qK4SJP/1

Modify raw input to look like a bus timetable in specific format using regex?

I'm trying to figure this out for quite some time already, but can't seem to find the solution that would work at once or in the way I prefer it.
I have an input that looks like this:
0430
0500 25 50
0615 34 51
0708 26 43
And I need to turn it into this:
04:30
05:00,05:25,05:50
06:15,06:34,06:51
07:08,07:26,07:43
Since this is only part of the input and manually replacing everything isn't an option, my guess is that the best option is to go with regex.
What needs to be done:
Insert colon after the first two ciphers (something like (^\d{2}) and then doing replace/substitution with $1:)
Replace each space with comma + first two ciphers + colon.
My idea was to capture group (^\d{2}:) and then replace all spaces with ,$1 (per each line), but I can't seem to find the way to do it.
I use regex101.com for doing it, so if you have any advice on how to do it, or where to do it (or even if regex isn't the way to do it, what other way would you recommend) any help would be appreciated.
Thanks in advance!
Here is a way to do the job with Notepad++:
Ctrl+H
Find what: ^(\d\d)(\d\d)(?:\h+(\d\d)\h+(\d\d))?
Replace with: $1$2(?3,$1\:$3,$1\:$4:)
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
^ # beginning of line
(\d\d) # group 1, 2 digits
(\d\d) # group 2, 2 digits
(?: # non capture group
\h+ # 1 or more horizontal spaces
(\d\d) # group 3, 2 digits
\h+ # 1 or more horizontal spaces
(\d\d) # group 4, 2 digits
)? # end group, optional
Replacement:
$1 # content of group 1
$2 # content of group 2
(?3 # if group 3 exists
,$1\:$3 # a comma then content of group 1 and 3
,$1\:$4 # a comma then content of group 1 and 4
: # else nothing
) # end conditional
Screen capture (before):
Screen capture (after):

Regex to find strings not containing a specified value

I'm using notepad++'s regular expression search function to find all strings in a .txt document that do not contain a specific value (HIJ in the below example), where all strings begin with the same value (ABC in the below example).
How would I go about doing this?
Example
Every String starts with ABC
ABC is never used in a string other than at the beginning,
ABCABC123 would be two strings --"ABC" and "ABC123"
HIJ may appear multiple times in a string
I need to find the strings that do not contain HIJ
Input is one long file with no line breaks, but does contain special characters (*, ^, #, ~, :) and spaces
Example Input:
ABC1234HIJ56ABC7#HIJABC89ABCHIJ0ABE:HIJABC12~34HI456J
Example Input would be viewed as the following strings
ABC1234HIJ56
ABC7#HIJ
ABC89
ABCHIJ0ABE:HIJ
ABC12%34HI456J
The Third and Fifth strings both lack "HIJ" and therefore are included in the output, all others are not included in the output.
Example desired output:
ABC89
ABC12~34HI456J
I am 99% new to RegEx and will be looking more into it in the future, as my job description suddenly changed earlier this week when someone else in the company left suddenly, and therefore I have been doing this manually by searching (ABC|HIJ) and going through the search function's results looking for "ABC" appearing twice in a row. Supposedly the former employee was able to do this in an automated way, but left no documentation.
Any help would be appreciated!
This question is a repeat of a prior question I asked, but I was very very bad at formatting a question and it seems to have sunk beyond noticeable levels.
You can find the items you want with:
ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+(?=ABC|$)
Note: in this first pattern, you can replace (?=ABC|$) with (?!HIJ)
pattern details:
ABC
(?: # non-capturing group
[^HA]+ # all that is not a H or an A
| # OR
H(?!IJ) # an H not followed by IJ
|
A(?!BC) # an A not followed by BC
)*+ # repeat the group
(?=ABC|$) # followed by "ABC" or the end of the string
Note: if you want to remove all that is not the items you want you can make this search replace:
search: (?:ABC(?:[^HA]+|H(?!IJ)|A(?!BC))*+HIJ.*?(?=ABC|$))+|(?=ABC)
replace: \r\n
you could use this pattern
(ABC(?:(?!HIJ).)*?)(?=ABC|\R)
Demo
( # Capturing Group (1)
ABC # "ABC"
(?: # Non Capturing Group
(?! # Negative Look-Ahead
HIJ # "HIJ"
) # End of Negative Look-Ahead
. # Any character except line break
) # End of Non Capturing Group
*? # (zero or more)(lazy)
) # End of Capturing Group (1)
(?= # Look-Ahead
ABC # "ABC"
| # OR
\R # <line break>
) # End of Look-Ahead
You can use the following expression to match your criterion:
(^ABC(?:(?!HIJ).)*$)
This starts with ABC and looks ahead (negative) for HIJ pattern. The pattern works for the separated strings.
For a single line pattern (as provided in your question), a slight modification of this works (as follows):
(ABC(?:(?!HIJ).)*?)(?=ABC|$)

Find & Replace digit by digit and space in Sublime Text

I have lot of digits and I want to add spaces in between them, like so:
0123456789 --> 0 1 2 3 4 5 6 7 8 9
I'm trying to do this using the search and replace function in Sublime Text. What I've tried so far is using \S to find all characters (there are only digits so it doesn't matter) and \1\s to replace them. However, this deletes the digits and replaces them with s. Does anybody know how to do this?
You can use a combination of Lookahead and Lookbehind assertions to do this. Use Ctrl + H to open the Search and Replace, enable Regular Expression, input the following and click Replace All
Find What: (?<=\d)(?=\d)
Replace With: empty space
Live Demo
Explanation:
(?<= # look behind to see if there is:
\d # digits (0-9)
) # end of look-behind
(?= # look ahead to see if there is:
\d # digits (0-9)
) # end of look-ahead
Press Ctrl + H to open the Search and Replace dialog (make sure you click the .* at the left to enable regular expression mode):
Search for: (\d)(?!$)
Replace with: $1[space]
The regular expression (\d)(?!$) matches all the digits except the one at the end of the line.
Here's how it looks like:
you could use this pattern (\d)(?=\d) and replace with $1 <- $1{space}
Demo