Using a regex to append to the end of non-blank lines - regex

I would think that this would be a common question, but I can't find anybody asking how to do this. There are people asking how to do the opposite (find blank lines) and add a <br><br> at the end of each one. For human readability, this document has blank lines between paragraphs.
(I don't want to replace the blank lines with <br><br>. I know this would achieve the same result, but for human readability and personal preference, I don't like how this makes the document one giant block of text.)
How can I write a regex that captures -- I don't know if this is the right word to use; maybe "groups"? -- the end of lines that aren't blank so that I can append to the end of them?
I am using Visual Studio code, so I'd like this to work in the search/replace box:
I'm assuming in the replacement box above, I'd need to say $some group number(s?), so I just said $x as a temporary placeholder. Here's what I've tried as search patterns:
^(?!:($))$
^(?!:(\S$))$
^(?!:([^\S]$))$
^(?!:([^\s]$))$
^(?!([^\S]+))$
All of these seem to grab the inverse of what I'm trying to find. I guess my strategy has been, between the beginning and end of the line, there shouldn't be only whitespace. But I'm pretty sure that's not what I'm saying.

You can use
Find What:      (\S)[^\S\n]*(\n)
Replace With: $1<br><br>$2
NOTE: The above replacement will not add the <br>s at the end of the last line if it is not blank. If you need that, use
Find What:      (\S)[^\S\n]*$
Replace With: $1<br><br>
See the regex demo. The regex above matches the last non-whitespace char on a line (capturing it in Group 1 to keep it), then matches horizontal whitespace (if any) and then captures a line break that is also captured to keep in the output.
Details
(\S) - Group 1: any non-whitespace char
[^\S\n]* - zero or more horizontal whitespace chars
(\n) - Group 2: line break.
$ - end of a line (note that m flag (in its PCRE meaning) is always on, by default, in VSCode regex).
The replacement is $1<br><br>$2, Group 1 value + <br><br> + Group 2 value (if you use the first regex).
is changed into

This works to retain the spaces at the end of lines:
Find: (?<=^.*)(\S+.*)
Replace: $1<br><br>

Related

How would I match all data between 2 symbols with Regex?

I'm trying to find all data (including and after) a dash (-) appears, only up to the first delimiter which is a colon.
Example data:
Input:
bart23-testaccount#test.test:Test:Test:Test
Desired output:
bart23:Test:Test:Test
I've done some research and found this regex, but it's not fit for purpose -(.*):
My purpose is for thousands of lines which are all in various types of order, however the purpose remains the same, highlight all text between the - and the first : (which I will then proceed to delete). I will be using Notepad++
I can answer any questions or make my post more specific if need be, it's kind of hard to explain.
In Notepad++ you can use regex find/replace. Look for:
^([^-]+)-[^:]+(:.*)$
which captures everything up to the first - in group 1, and everything after (and including) the first : in group 2, and replace with
\1\2
Using Notepad++, without any capture group:
Ctrl+H
Find what: -[^:]+
Replace with: LEAVE EMPTY
check Wrap around
check Regular expression
Replace all
Explanation:
- # an hyphen (by default, the first one in a line)
[^:]+ # 1 or more not colon
Result for given example:
bart23:Test:Test:Test
Screen capture:

regex Pattern matching across all lines until end of file or string

I have been working on the following regex expression:
/(?<=\#Comment\{Annotation: key:START;\})( )/
which is designed to try and find an annotation that looks like: #Comment{Annotation: key:START;} in a text file. These annotations represent the possible lines where the file could be broken down into smaller files.
I am having problems completing my capture group instruction or if I have described that wrong, my last ( ) so that it scans all lines remaining in the string (which might contain EOF) or the next annotation fitting this pattern is detected.
I am hoping not to have to convert this to a line based approach with checks performed on each line...
I thought one of the following might have worked but so far nothing has:
\s
\Z
\s*(.*) --> this works in the sense that I can manually repeat this sequence to add each line, one at a time, but that's highly impractical
This regex should work:
(.*?)((\#Comment\{Annotation: key:START;\})|$)
See example online.
The (.*?) matches the text up until your separator expression. Then follows an expression which matches either your separator, or the end of the document ($).
For each match, the first group gives you the text before the separator, and the second group is the matched separator text.
This expression needs single-line mode s and global mode g.

Find lines with same characters set

I have situation like this.
Car Driver
Cat Mouse
Door House
Driver Car
I need help with regex to find all lines with same set of characters or words no mater how placed in line.
Car Driver
Driver Car
Edited list:
A0JLS3 Q9NUA2 <
A0JLT2 Q9Y3C7
A0N0L5 P26441
A0N0Q1 O00626
A0N0Q1 P35626
A0PJF8 P27361
Q9NUA2 A0JLS3 <
EDIT: after taking a look at your file, it seems that there is one tab character after the first word and a variable number of tab characters after the second, so you must change the pattern to:
^(\w+)\h+(\w+)\h*$(?=(?>\R.*)*?\R(?:\1\h+\2|\2\h+\1)\h*$)
where \h stand for an horizontal white-character.
Since you seems to have huge files and I don't see how to not use a reluctant quantifier in the lookahead assertion, you can try to use this modified pattern where all the quantifiers are possessive (when possible), and all groups are atomic. It seems to be a little faster:
^(\w++)\h++(\w++)\h*+$(?=(?>\R.*+)*?\R(?>\1\h++\2|\2\h++\1)\h*+$)
Previous answer:
You can use this pattern:
^(\w+) (\w+)$(?=(?>\R.*)*?\R(?:\1 \2|\2 \1)$)
This will find lines that have a "duplicate line" with the two same words after in the text. If you want to use it to remove duplicate, keep in mind that this will preserve the last occurence and remove the first.
pattern details:
^(\w+) (\w+)$ : this describes a whole line (note the anchors for start ^ and end $ of the line) and put each word in a capturing group (group 1 and group 2)
The second part of the pattern checks if there is a "similar line" (a line with the same words) after. Since it is embeded in a lookahead assertion ((?=...) i.e. followed by), this part isn't included in the match result.
(?>\R.*)*?: lines until the duplicate. \R stand for CRLF or LF, and .* match all characters except newlines. The group is repeated with a lazy quantifier to stop before the first duplicate line. (note that this works with a greedy quantifier too, the best choice depends on how looks your document. For example, if duplicates are often at the end of the document, using a greedy quantifier is a better choice)
(?:\1 \2|\2 \1) describes the two possibilities using backreferences to group 1 and 2.
$ is added to ensure that the last word is whole. (otherwise something like A0N0L5 P26441 ... A0N0L5 P26441XXX will succeed)
I'm not sure exactly what you are trying to achieve. If you're looking for all lines containing both of the words Car and Driver, you can mark all lines containing this regular expression:
Car Driver|Driver Car
Here's a guide on regular expressions in Notepad++: http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=Regular_Expressions
And consider taking a look at the Stack Overflow Regular Expressions FAQ for some more useful information.

get a line with regex

I'm having trouble doing simple things with regex in dot net.
Suppose I want to find all lines that contain the word "pizza". I would think I would do the following:
^ .* pizza .* $
The idea is the first character indicates the start of a line, the dollar sign indicates the end of the line, and the dot-star indicates any number of characters.
This doesn't seem to work.
Then I tried something else that doesn't work either. I thought I would find all routines in my visual basic project that start with "Sub Page_Load" and end with "End Sub". I did a search for:
Sub Page_Load .* End Sub
But this found pretty much EVERY subroutine in the project.
In other words, it didn't limit itself to the Page_Load sub.
So I thought I'd be smart and notice that every End Sub is at the end of a line, so all I have to do is put a $ after it like this:
Sub Page_Load .* End Sub$
But that finds absolutely zero strings.
So what am I doing wrong? (one note, I put extra blanks around .* here so you can see it, but normally the blanks would not be there.
you may need non-greedy approach. try this:
^.*?pizza.*$
So, now complete new answer.
Search for the word "pizza" (not "pizzas")
If you have a Multiline string and want to find a single row, you need to use the Option [Multiline][1]. That changes the behaviour of the anchors ^ and $ to match the start and the end of the row.
To ensure to match only the complete word "pizza" and no partial match, use word boundaries
If you don't use the Singleline option, you don't need to worry about greediness
So your regex would be:
Regex optionRegex = new Regex(#"^.*\bpizza\b.*$", RegexOptions.Multiline);
For the Sub Page_Load.*End Sub thing, you need to match more than one line:
Use the single line option, to allow the . match also newline characters.
You need ungreedy matching behaviour of the quantifier
So your regex would be:
Regex optionRegex = new Regex(#"Sub Page_Load.*?End Sub", RegexOptions.Singleline);

Vim multiline regex gives overlapping matches

I was surprised when I noticed that my greedy multiline regex was giving overlapping matches in Vim. The regex is designed to match an entire block of text, or consecutive non-blank lines.
The regex apparently matched everything I expected it to (highlight looked correct), but when using n to skip to the next match instead of skipping to the next block, it went to the next line in the current block.
Here is the regex I was using (equivalent to (.+\n){1,} for most regex engines):
\(.\+\n\)\{1,}
This should match at least one non-empty line, and as many consecutive non-empty lines as possible, here is an example text file:
block 1
some stuff
more stuff
block 2
foo bar
baz qux
After applying this regex (/\(.\+\n\)\{1,}+Enter) the two blocks are highlighted correctly, but I expect there to be only two matches of the regex, one for each block. However when I press n to advance to the next regex match it appears that each non-empty line matches the regex, so my cursor would start on the first line, n would take it to the second line, then third, then to the start of block 2 etc.
How can I change my regex so that I see the expected behavior of each block being a single match so that n advances to the next block, instead of the next line?
I am also interested in knowing if this behavior is in the documentation somewhere, or if there is an option to change this behavior. Note that when using the same regex in a search/replace the behavior is what I expect (replacement would only be applied twice, once for each block).
The following regex seems to work:
\(\%^\|^\n\)\zs\(.\+\n\)\+
Explanation:
\( # start of group
\%^ # beginning of file
\| # OR
^\n # a blank line
\) # end of group
\zs # start matching here
\(.\+\n\)\+ # at least one non-blank line
By using the very magic option the length can be reduced a bit:
\v(%^|^\n)\zs(.+\n)+
Looking forward to seeing if anyone can come up with a shorter solution!
zigdon's answer helped me to understand better why the behavior is the way it is. When n is used to jump to the next match it searches for the first match of the regex from the cursor's current position, even if the next matching position was included in the previous match. This is why anchoring the regex to the start of the block appears to be necessary.
Thanks to Nolen Royalty for helping me get rid of an unnecessary lookahead in the first group.
Since your match says "match one or more non-empty lines" it can certainly match multiple times within the same paragraph. To fix this, you can specify that the cursor should be placed at the end of the match - the means the next match will start from the end of the paragraph. You can do this with the \zs zero-width character, available in vim:
\zs Matches at any position, and sets the start of the match there: The
next char is the first char of the whole match. |/zero-width|
So your match will become:
\(.\+\n\)\{1,}\zs