How to split multiple line text by regex - regex

I have multiple lines text
SUBJECT=Testing001
TEXT=TestingLine001-Test
TEXT=TestingLine002-Test
REFER=Reference001
SUBJECT=Testing002
TEXT=TestingLine003-Test
SUBJECT=Testing003
TEXT=TestingLine004-Test
REFER=Reference002
Just want to split text blocks (for this case, three text blocks, "Subject" is the first line of the text block) like as:
SUBJECT=Testing001
TEXT=TestingLine001-Test
TEXT=TestingLine002-Test
REFER=Reference001
SUBJECT=Testing002
TEXT=TestingLine003-Test
SUBJECT=Testing003
TEXT=TestingLine004-Test
REFER=Reference002

(?=\bSUBJECT\b)(?!^)
You can use this split.See demo.
https://regex101.com/r/mG8kZ9/9

Related

RegEx in Notepad++ to find lines with less or more than n pipes

I have a large pipe-delimited text file that should have one 3-column record per line. Many of the records are split up by line breaks within a column.
I need to do a find/replace to get three, and only three, pipes per line/record.
Here's an example (I added the line breaks (\r\n) to demonstrate where they are and what needs to be replaced):
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\r\n
\r\n
on to multiple lines|More text|\r\n
09-1234AS|\r\n
||\r\n
\r\n
56-1234|Some text|Some more text\r\n
|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
The caveat is that I need to retain those mid-record line breaks for the target system. They need to be replaced with \.br\. So the final result of the above should look like this:
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\.br\\.br\on multiple lines|More text|\r\n
09-1234AS|\.br\||\.br\\r\n
56-1234|Some text|Some more text\.br\|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
As you can see the mid-record line breaks have all been replaced with \.br\ and the end-of-line line breaks have been retained to keep each three-column/pipe record on its own line. Note the last record's text, explaining how each line/record begins. I included that in case that would help in building a regex to properly identify the beginning of a record.
I'm not sure if this can be done in one find/replace step or if it needs to be (or just should be) split up into a couple of steps.
I had the thought to first search for |\r\n, since all records end with a pipe and a CRLF, and replace those with dummy text !##$. Then search for the remaining line breaks with \r\n, which will be mid-column line breaks and replace those with \.br\, then replace the dummy text with the original line breaks that I want to keep |\r\n.
That worked for all but records that looked like the third record in the first example, which has several line breaks after a pipe within the record. In such a large file as I am working with it wasn't until much later that I found that the above process I was using didn't properly catch those instances.
You can use
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)\K\R+
Replace with \\.br\\. See the regex demo. Details:
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?) - either the end of the previous match (\G(?!^(?<!.))) or (|) start of a line, two digits, 0, one or more digits, zero or more letters, a |, then any zero or more chars other than |, as few as possible, and then an optional sequence of | and any zero or more chars other than |, as few as possible (see ^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)
\K - omit the text matched
\R+ - one or more line breaks.
See the Notepad++ demo:
If you need to remove empty lines after this, use Edit > Line Operations > Remove Empty Lines.

How to split a string based on empty/blank lines?

I'm writing a c++ application (Qt Widgets) that is supposed to parse an .srt subtitle file. Each part of the file is separated by an empty line, like this:
1
00:00:08,000 --> 00:00:11,000
[Line]
2
00:00:56,034 --> 00:00:57,492
[Line]
[Another line]
3
00:01:13,676 --> 00:01:15,420
[Line]
Basically, I want to read the entire file to a QString, and split it by empty lines into QString array, each item containing one of those sections like this:
2
00:00:56,034 --> 00:00:57,492
[Line]
[Another line]
However, I cannot figure out how to do this. I tried splitting the string by \r and \n, but that split everything into separate lines, not by empty lines.
This is the routine I had in mind to get the data from the .srt file:
Read all of the contents of the file to a QString (named something along the lines of content).
Split the QString by empty lines, and append to a QStringList (named something along the lines of sections).
For each item in sections, split the second line by the --> identifier, and assign indexes 0 and 1 to QString variables called startTime, and endTime, respectively.
Take the rest of the lines (everything after line 2 is the subtitle text), and append them to a QString called subtitleText.
Add all the gathered information to an SrtSubtitle instance, and append it to QList<SrtSubtitle>
How can I achieve this?
New lines are usually represented as \n.
To split the string when there are 2 new lines without anything between them, you can use \n\n as delimiter.
I would improve upon ziarra's answer. You certainly want the solution to be robust and work also with Windows line endings which are "\r\n" instead of "\n". In that case ziarra's solution would not suffice.
So my proposal is to do it in two steps:
replace all occurrences of "\r\n" with "\n"
split the text by "\n\n" (as ziarra suggests)

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

regex to match two characters and one equal operator

I'm reading a file.
In that file I'm using row separator to split file. But in file the row separators are not constant.
Here is my file example.
CN=100
adshnxhndxghdngfhdsfs
CN=200
jhnxrhewxrgewhgxew
XN=300
jskhd sa
ZP=400
jhnxrhewxrgewhgxew
XX=500
jhnxrhewxrgewhgxew
Any my row separators in above file are like these CN=, ZP=, XX=, XN= There can be more because its gonna be very big file.
What regex I can use to figure out my row separators of pattern like these(CN=, ZP=, XX=, XN=)
Simple as
^\w{2}=\d+
See a demo on regex101.com (mind the multiline modifier and tell us your programming language, though!)

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.