How do I search and replace with a multiline regular expression? - regex

I need to edit a large EDI message, which basically is a textfile of thousands of short lines. The reason is that it must comply to the standard specification and doesn't because in some of the segments there are an extra QTY+220 line that must be removed. It is in those segments that has 4 QTY lines where QTY+220 must be deleted. Here is a correct segment:
SEQ++79'
MOA+9:1.87945:NOK'
QTY+58:0'
QTY+136:5'
QTY+260:5'
Here is an incorrect segment:
SEQ++365'
MOA+9:1.31896:NOK'
QTY+58:0'
QTY+136:4'
QTY+220:0' <---- this line must be removed
QTY+260:4'
The complete textfile is about 75.000 lines and there are more than 2200 of these validation errors in the xml schema. I tried to make a seach and replace with notepad++ and regular expressions, but I can't make it match over multiple lines. Here is a single line:
^QTY.*'
But I want it to find matches of 4 QTY-lines and remove the 3rd line. How can I do that?

Use \n to match linebreaks.
In your example, replace
(QTY[^\n]+)\n(QTY[^\n]+)\n(QTY[^\n]+)\n(QTY[^\n]+)
with
$1\n$2\n$4
to remove the third line

Related

regular expressions regex problem with three dots in text ... the search patter fails after . dot

It failed to detect after first . dot of ...
test expression at regex101
https://regex101.com/r/7fJG8W/1
search pattern
"(<textarea [a-zA-Z=\s\d\w\"]*>)([|~a-zA-Z\s\w\d.:,\ #\-()/[\]?=!*%$#&`\{}\^;:'\"+ ]*)"gm
sample text
<textarea id="source">
```markdown
1. First ordered list item
2. Another item* Unordered sub-list.
1. Actual numbers don't matter, just that it's a number1. Ordered sub-list
4. And another item.
You can have properly indented paragraphs within list items. Notice the blank line above, and the leading spaces (at least one, but we'll use three here to also align the raw Markdown).
To have a line break without a paragraph, you will need to use two trailing spaces...
...Note that this line is separate but within the same paragraph.⋅⋅
⋅⋅⋅(This is contrary to the typical GFM line break behavior, where trailing spaces are not required.)
The reason is simple: those characters at the end of the line (⋅⋅) are not in the character class you have in the regular expression. But there are many, many more characters that would be allowed in a textarea element.
It is not advised to parse HTML with a regular expression, but to use a DOM parser instead.
But a quick fix for the actual problem you encountered is to make the match stop at </textarea> and nothing else:
(<textarea\b[^>]*>)((?!</textarea>).)*
This regex needs the s flag so . can also match newline characters.
See regex101.com

How do I insert new lines in a fixed length text file in NiFi using the ReplaceText processor with regular expressions?

I have a fixed-length text file that has only one line containing all of the 500-character records. I want to insert new lines so that the file contains only one 500-character record per line, with the number of lines being the number of records. Using the Regex, I put
(.{1000})*
in the Search Value field, and
((.{500})<shift+enter>(.{500}))*
in the Replacement Value field. The resulting file contains just the literal replacement.
((.{500})
(.{500})(.{500})
(.{500}))
What am I missing in the configuration? Is there something wrong in my regular expressions?
Thanks, everybody, for your help. Here is what I did to get it to work.
Search Value:
(.{500})(.{500})
Replacement Value:
$1<shift+enter>$2<shift+enter>
The $1 backreferences the first (.{500}) group, and the $2 backreferences the second (.{500}) group. The <shift+enter> introduces the newlines.

Regex Match Paragraph Pattern

I am trying to match a paragraph pattern and I am having trouble.
The pattern is:
[image.gif]
some words, usually a few lines
name
emailaddress<mailto:theemailaddress#mail.com>
I tried matching everything between the gif image and the <mailto: but this happens multiple times in the file meaning I get a bad result.
I tried it with this
(?<=\[image.gif\].*?(\[image.gif\])).*?(?=<mailto:)
Is there a way to use Regex to match the general layout of a paragraph?
"the general layout of a paragraph" needs a better definition. Given the lack of an input plus expected output, I'm having to guess what you want here. I'm also guessing that you will accept any language. Here's perl, almost certainly not a language you're familiar with.
Assumed input:
do not match this line
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
don't match this line either
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
this line is also not for matching
Expected output:
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
Solution using perl:
#!/usr/bin/perl -n007
my $sep = "";
while (/(\[image\.gif\].*?<mailto:[^>]*>(\r)?\n)/gms) {
print $sep . $1;
$sep = "---$2\n";
}
perl is the king of regex languages; many would say that's all it is good for. Here, we use the -n007 option to tell it to read the entire contents of each file and run the code on it as the default variable.
$sep starts blank because there's nothing to separate until the second match.
Then we loop over each block of text that matches the regex:
matches a literal [image.gif]
then matches as little content following that as possible
then matches a literal <mailto: and continues until the next >
then captures the line break (including optional support for DOS line endings)
(see full regex explanation and example at regex101)
We then print the match and finally set the separator to three dashes and a line break (DOS line endings added when needed).
Now you can run it:
$ perl answer.pl input.txt
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>

Changing order of csv-file entries with regex replacement in Notepad++

I am trying to change the order of the entries of an *.csv-file with Notepad++ built-in find/replace function. This is how the file looks like now:
ABC;DEF;Here comes some long text with ,.- in it;true;false;
QWE;RTY;Here comes some long text with ,.- in it;true;false;
And this is how it should look like after find/replace:
DEF;Here comes some long text with ,.- in it;ABC;true;false;;
RTY;Here comes some long text with ,.- in it;QWE;true;false;;
So column #1 should be at the position of #3, column number #2 and #3 should shift one to the left.
What I tried so far:
I tried to get the first three columns with an regular expression in the find field, put some brackets around them and reorder them with the $ sign in the replace field. But my regex matches for nearly the whole line, not only the first three columns- what am I doing wrong? Here is my regex:
([A-Z]{3})\;([A-Z]{3})\;(.*[^\;])\;
The first two columns and the following ; are select properly, the problem must be in the third round bracket. But I have no clue what the problem is. The third expression should match to everything except ; and is ended by an ;.
The content of the replacement field should be $2;$3;$1;, I guess that's right.
The main problem is that you're escaping the semi-colons unnecessarily. Use this expression ^(?s)([A-Z]{3};)([A-Z]{3};)([^\n\r;]*;) and replace it with this expression $2$3$1
Have included line delimiters \r or \n too in case of a line with fewer columns. Also you should use start of string anchor ^ to be safe if you have more columns.

I need a regex to repair lines split at column 80

Problem - Multiline, Semi-colon delimited file has been split at column 79 or 80 (not always the same for some strange reason).
It seems to me that a Regex would be the appropriate solution, so now I have two problems.
Lines are:
1sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
2sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
3sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
4sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
.....|.....|.....|.....|.....|.....|[cr][lf]
... 10000 rows ...
Where the pipe is a non-space whitespace character (possibly a tab)
I need:
1sdf.............................mnopqr........xyz......................[cr][lf]
2sdf.............................mnopqr........xyz......................[cr][lf]
3sdf.............................mnopqr........xyz......................[cr][lf]
4sdf.............................mnopqr........xyz......................[cr][lf]
I managed to get the job done with
Pass 1:
Replace ^\s*\r\n with \rxxx\n
// Replace Blank lines with \rxxx\n leaving
1sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
[cr]xxx[lf]
2sdf.............................mno[cr][lf]
pqr........xyz......................[cr][lf]
Pass 2:
Replace \r\n with [empty]
//leaving:
1sdf.............................mnopqr........xyz......................[cr]
xxx[lf]
2sdf.............................mnopqr........xyz......................
Pass 3:
Replace \rxxx\n with \r\n
//leaving:
1sdf.............................mnopqr........xyz......................[cr][lf]
2sdf.............................mnopqr........xyz......................
And the rest of the cleanup is trivial.
Is there any way of doing this in a single step? The output is from a common financial application, and I'd rather be able to fix the files myself rather than try and get many multiple clients to adjust their output.
In Notepad++ (using regular expression mode) you can use this:
Find what: \r\n(\s*\r\n)?
Replace with: \1
Then run "Replace All" exactly once. However, make sure you update to Notepad++ 6! Otherwise matching \r\n with a regular expression won't work in Notepad++.
Assuming that ^\s*\r\n match the line you want to remove as you said above, I believe you could do it with replacing \r\n\s*\r\n|\r\n by \r\n
It's my first regex, so if it doesn't work, don't be to harsh :-)
Good luck