Given the following ; delimited string
a;; z
toy;d;hh
toy
;b;;jj
z;
d;23
d;23td
;;io;
b y;b;12
z
a;b;bb;;;34
z
and this regex
^(?!(?:(a|d))(?:;|$)).*(\s*\z|$)\R*
I am looking to get the full lines whose 1st. column is not a or d, and have the matching lines removed, to get this , after substituting with empty
a;; z
d;23
d;23td
a;b;bb;;;34
Please see the demo
In the Substitution panel, there is a 5th empty line, which needs to be removed.
I have used this \s*\z in this past for this purpose. As implemented here, it does not seem to work.
Any help is appreciated
I think the reason your regex won't remove the last newline, is that it is part of the end of the last part that you want to keep, so without matching it you can't remove it.
So I rewrote the regex to match the line you want to keep, but also to include everything above and below the match that is not another match.
The key difference is using a conditional to only match the newline of the group you want to keep if it is followed by another match.
regex (linebreaks for readability):
((?!(a|d)).*(\s*\z|$)\R*)*
(^(a|d).*(?(?=\R*(.*\s*\R+)*(a|b))\R))
((?!(a|d)).*(\s*\z|$)\R*)*
replace with $4 -->
a;; z
d;23
d;23td
a;b;bb;;;34
For readability I removed some of the non-capturing and string separator logic you had, if they are necessary you can add them back in.
Logic breakdown of the parts:
(?(?=\R*(.*\s*\R+)*(a|b))\R) is the conditional, it only matches the newline \R if (?) it is followed by (?=) any non-matching lines (.*\s*\R+)* that end in a newline followed by (a|b).
The middle part (^(a|d).*(?(?=\R*(.*\s*\R+)*(a|b))\R)) containing this ends up as the replacing group $4. It thus matches lines starting with (a|d), and all but the last match also match the newline at the end of their line.
The beginning and end of the regex ((?!(a|d)).*(\s*\z|$)\R*)* is exactly the same, and matches of all the unneeded stuff so that it gets removed.
You could match what you want to remove, and capture in a group what you want to keep.
To prevent removing the newline sequences between capturing groups, you could use an if clause (? to only match 0+ unicode newline sequences when there is no more line following that starts with [ad];
In the replacement use group 1 $1
^(?:(?![ad];).*\R*)*|^([ad];.*(?:\R[ad];.*)*)(?(?![\s\S]*\R[ad];)\R*)
Explanation
^ Start of line
(?: Non capture group
(?![ad];) If the line does not start with a or d followed by ;
.*\R* Match the whole line and 0+ times a unicode newline sequence
)* Close group and repeat 0+ times to match all consecutive lines
| Or
^ Start of line
( Capture group 1
[ad];.* Match a or d followed by ; and the rest of the line
(?: Non capture group
\R[ad];.* Match newline, a or d followed by ; and the rest of the line
)* Close group and repeat 0+ times to match all consecutive lines
) Close group 1
(? If clause, only match a unicode newline sequence if the [ad]; pattern does not occur anymore
(?! Negative lookahead, assert what follows is not
[\s\S]*\R[ad]; Match the [ad]; pattern
) Close lookahead.
\R* If the assertion is true, Match 0+ unicode newline sequences
) Close if clause
See a Regex demo
Related
I need to come up with a regular expression with flavor PCRE. It must be a regular expression <
I want to grab all lines of text that end in a newline character up until I encounter <zz> where zz is a digit enclosed in '<' and '>'.
e.g.
111a z
222 aset
333 //+
12 <zz> 11
abc
def
It would need to capture "111a z", "222 aset", "333 //+" in this case [and nothing else].
Right now I have ^(?!.*<zz>)[^\n]+(?=\n) but it's pretty far off from what it needs to be.
For clarification purposes, the regex I was using shows <zz>, but definitely looking for a digit enclosed in angle brackets.
Would really appreciate some help.
Edit
This is /really/ difficult for me, because at least one of the answers looks like it does the job. I'll try to mark one... Thank you, everyone.
You could repeat matching all lines including a Unicode newline sequence while the <\d+> pattern does not occur in the line.
\A(?:(?!.*<\d+>).*\R)+
Explanation
\A Start of string
(?: Non capture group
(?!.*<\d+>) Negative lookahead, assert that the pattern <\d+> does not occur
.*\R Match any char except a newline followed by matching a Unicode newline sequence
)+ Close the non capturing group, and repeat it 1+ times to match at least a single line
Regex demo
If the <\d+> has to be present, you could assert that with a positive lookahead at the end
\A(?:(?!.*<\d+>).*\R)+(?=.*<\d+>)
I'm not sure why you're using a negative lookahead, but I think you want a positive lookahead. This lets you only match the line if you see the <zz> in a lookahead. I would solve the problem using something like this:
^.*(?=.*(?:\n.*)*<\d+>)\n
^ Anchors match to beginning of line (like yours)
.* Matches all the characters it can. In this case it matches the whole line because it has to satisfy the \n at the end.
(?=...) Performs a positive lookahead (makes sure the string exists somewhere ahead)
.*(?:\n.*)* Allows any number of characters on any number of lines
<\d+> Only matches one or more digits enclosed in angle brackets
\n ensures that there is a newline at the end of the line.
I have assumed that the text may have more than one line that contains one or digits bracketed in '<' and '>', and that those lines are not themselves to be matched.
You can use the following expression to match the lines of interest.
^(?!.*<\d+>).*\r?\n(?=[\s\S]*?<\d+>)
PCRE Demo
The regex engine performs the following operations.
^ match beginning of line
(?! begin negative lookahead (prevent matching line with '<12>'
.* match 0+ characters other than newlines
<\d+> match '<', 1+ digits, '>'
) end negative lookahead
.* match 0+ characters other than newlines
\r?\n match newline optionally preceded by '\r'
(?= begin positive lookahead
[\s\S]*? match 0+ characters (incl. newlines), non-greedily
<\d+> match '<', 1+ digits, '>'
) end positive lookahead
'\r', a carriage return, will be present if the file was produced when using the Windows operating system.
Given a string such as below:
word.hi. bla. word.
I want to construct a regex which will match all "."s preceded by "word" and any other non space character
So, in the above example I would want the the first, second and last dots to be matched.
While matching the first and last dots would be easy with global flag (/(?:word.*)\K./gU), I'm not sure how to construct a regex that would also match the second dot.
Appreciate any pointers.
You might match word and then get all consecutive matches using the \G anchor excluding matching whitespace chars or a dot.
(?:\bword|\G(?!\A))[^.\s]*\K\.
In parts
(?: Non capture group
\bword Match word preceded by a word boundary
| Or
\G(?!\A) Assert the position at the end of the previous match, not at the start
) Close non capture group
[^.\s]* Match 0+ occurrences of any char except . or a whitespace char
\K Clear the match buffer (forget what is matched until now)
\. Match a dot
Regex demo
I'm trying to clean up some assembly code and I'd like to convert the spaces between the instruction and argument to tabs. However, I'd like to avoid inadvertently converting the spaces between the words in the comments after the semicolon.
So here is an example some lines of code:
label: bcf INTCON,2 ; comment comment and more comment.
btfss PORTA,2
The closest I've come is (?<=^).+(?=;). This not only matches EVERYTHING between the beginning of the line and the semicolon, but it includes all semicolons except for the very last semicolon. Imagine lines of codes with comments that was commented out. It also doesn't take into consideration line without comments.
How do I do this?
Maybe,
^([^:\r\n]+:)\s*([^\r\n]+?)(?:$|\s{2,})(;.*)?$
and a replacement of,
$1 $2 $3
might be OK to start with.
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Ctrl+H
Find what: ^(\w+:)\h+|^\h+
Replace with: (?1$1\t:\t\t)
check Wrap around
check Regular expression
Replace all
Explanation:
^ # beginning of line
(\w+:) # group 1, 1 or more word characters followed by colon
\h+ # 1 or more horizontal spaces
| # OR
^ # beginning of line
\h+ # 1 or more horizontal spaces
Replacement:
(?1 # if group 1 exists, then
$1\t # content of group 1 and a tab
: # else
\t\t # 2 tabs
) # end conditional replace
Screen capture:
If you want to change the space between bcf and INTCON,2 to 2 tabs, you might match the 2 "words" and make sure that they don't start with a ;
^(?:\S+:)?\h+(?!;)\S+\K\h+(?=[^\s;])
^ Start of string
(?:\S+:)? Optionally match 1+ non whitespace chars and :
\h+(?!;) Match 1+ horizontal whitespace chars, then assert what is on the right is not a ;
\S+\K Match 1+ non whitespace chars, forget what was matched
\h+ Match 1+ horizontal whitespace chars (this match will be replaced)
(?=[^\s;]) Assert what is on the right is not a whitespace char or ;
In the replacement use 2 tabs \t\t
Regex demo
Edit
If you want to find the first space between non whitespace chars, you might use
^.*?\S\K (?=\S)
Let's suppose we have, in a text file, many rows containing each one multiple names joined with ";" delimiter except last name (which doesn't end with it).
We can use the following regex :
^(\w+;)+$ // Not good
The previous regex won't work because it forces last name, hence the whole row to end with a ";" also
You could add matching a single \w+ after it. If you don't need the capturing group, you might make it non capturing.
This way you are repeating matching word characters followed by a ; and end the match with word characters.
^(?:\w+;)+\w+$
Explanation
^ Start of string
(?: Non capturing group
\w+; Match 1+ word chars followed by ;
)+ Close non capturing group and repeat 1+ times
\w+ Match 1+ word chars
$ End of string
Regex demo
If a single word should also match, you could repeat the group 0+ times using * instead of +
^(?:\w+;)*\w+$
Regex demo
I have the following template :
1251 Left Random Text I want to fill
It can go through multiple lines
As you can see
9841 Right Again we see a lot of random text with 3115 numbers
And this also goes
To multiple lines
0121 Right
5151 Right This one is just one line
I was wrong
9731 Left This one is just a line
5123 NA Instruction 5151 was wrong
4113 Right Instr 9841 was correct
We checked
I want to have 3 groups:
1251
Left
Random Text I want to fill
It can go through multiple lines
As you can see
I'm using
(\d+)\s(\w+)\s(.*)
but it stops at the current line only (so I get only Random Text I want to fill in group 3, although I want including As you can see)
If I'm using Single line flag I get only 1 match for each group, group 3 almost being all
Here is live : https://regex101.com/r/W3x0mH/4
You could use a repeating group matching all the lines while asserting that the next line does not start wit 1+ digits followed by Left or Right:
(\d+)\s(\w+)\s(.*(?:\r?\n(?!\d).*)*)
Explanation
(\d+)\s(\w+)\s Match the first 2 groups
(Third capturing group
.* Match 0+ times any char except a newline
(?: Non capturing group
\r?\n(?!\d).* Match newline, assert what is on the right is not a digit
)* Close non capturing group and repeat 0+ times
) Close capturing group
Regex demo
You may use this regex with a lookahead:
^(\d+)\s(\w+)\s(.*?)(?=\n\d|\z)
with DOTALL and MULTILINE modifiers.
Updated Regex Demo
RegEx Details:
^: Line start
(\d+): Match and capture 1+ digits in group #1
\s: match a whitespace
(\w+): Match and capture 1+ word characters in group #2
\s: match a whitespace
(.*?): Match 0 or more of any character (non-greedy) provided next lookahead assertion is satiSfied
(?=\n\d|\z): Lookahead assertion to assert that we have a newline followed by a digit or there is end of input
Faster Regex:
If you are using this regex on a long string then you should also keep overall performance in mind as a regex with DOTALL modifier will tend to get slow on a large size text. For that I suggest using this regex that doesn't need DOTALL modifier:
^(\d+)\s(\w+)\s(.*(?:\n.*)*?)(?=\n\d|\z)
RegEx Demo 2
On regex101 demo this regex takes just 181 steps as compared to first one that takes 1300 steps.
For the third group, repeat any character while using negative lookahead for ^\d, which would indicate the start of a new match:
(\d+)\s(\w+)\s((?:(?!^\d)[\s\S])*)
https://regex101.com/r/W3x0mH/5
You may try with this regex:
^(\d+)\s+(\w+)\s+(.*?)(?=^\d|\z)
^(\d+)\s+ , ^\d+ Line begins with numbers followed by one or more whitespace character \s+
(\w+)\s+ where \w+ one or more characters (left,right,na or something else) followed by one or more whitespace \w+
(.*?) matches everything until it finds a line beginning with number or \z end of string.
I think it fits your requirement....
Regex101