How can I delete everything that is not matched by a pattern - regex

I have a text document in Notepad++ with information separated by line, but want to delete everything but every seventh line. This line is always matched by the pattern (\d{4} :.*?\r\n).
How can I delete everything that does not match this pattern so that I just get every seventh line separated by \r\n?

You could maybe try:
^(?!\d{4} :)[\s\S]*?(?=\r\n\d{4} :)
regex101 demo
[Note, I couldn't put \r in there because I couldn't insert carriage returns in the input box somehow...]
^ is a beginning of line anchor and matches the beginning of a line.
(?!\d{4} :) is a negative lookahead and will make the whole regex match only if there's no \d{4} : at the beginning of the line (the position being indicated by ^).
[\s\S]*? is a character class that will match any and all character. The quantifier is a lazy quantifier that will cause matching to stop as soon as possible (this is determined by what's following)
(?=\r\n\d{4} :) is a positive lookahead, and matches only when there's a \r\n\d{4} : ahead.
If I understood your question well, this would be what you're looking for. All lines except the 7th lines get deleted and there's only one empty line left behind between each of those 7th line.

Open the search dialogue and select the Mark tab. In the Find what field enter a search string to find the lines to be kept. Make sure that Bookmark line and Regular expression are selected, then click Mark all. Next visit the menu => Search => Bookmark => Remove unmarked lines.
The question says the lines to be retained match (\d{4} :.*?\r\n). The capture brackets ( and ) are not needed as the capture is not used. Searches for \r\n may often be rewritten as searching for $, ie an end-of-line. Your search pattern is just looking for the first end-of-line after the earlier items. The search may be reduced to \d{4} :.

Related

Find everything except the search match

I need to write a notepad ++ regex to match everything besides my search criteria.
Fore example, if I have
James Bond (E1R1)
I have a regex to match E1R1. But I need to reverse it so I can get rid of everything besides E1R1.
So far I have ^(?!(?<=\().+?(?=\))$).*$. But it seems to match everything.
Use
^.*\(([^()\n\r]*)\)$|^(?!.*\(([^()\n\r]*)\)$).*\R?
Replace with $1.
See regex proof.
The expression finds lines ending with round brackets at the end, and removes all text outside those brackets. It will remove the entire line that contains no brackets at the end.
You could match from an opening till closing parenthesis and skip that match. Then match any single character which should be replaced by an empty string.
\([^()\r\n]*\)(*SKIP)(*F)|.
Explanation
\([^()\r\n]*\) Match from an opening till closing parenthesis (....)
(*SKIP)(*F) Skip the match
| Or
. Match any character except a newline
Regex demo

Regex - how do I match this?

I've been trying hard to get this Regex to work, but am simply not good enough at this stuff apparently :(
Regex - Trying to extract sources
I thought this would work... I'm trying to get all of the content where:
It starts with ds://
Ends with either carriage return or line feed
That's it! Essentially I'm going to then do a negative lookahead such that I can remove all content that is NOT conforming to above (in Notepad++) which allows for Regex search/replace.
Search for lines that contain the pattern, and mark them
Search menu > Mark
Find what: ds://.*\R
check Regular expression
Check Mark the lines
Find all
Remove the non marked lines
Search menu > Bookmark
Remove unmarked lines
You don't need to add the \w specifier to look for a word after the ds:// in the look ahead. Removing that and altering the final specification from "zero or one carriage return, then zero or one newline" to "either a carriage return or a newline" in capture group should do it for you:
(?=ds:\/\/).*(?:\r|\n)
Update: Carriage return or Line feed group does not need to be captured.
Update 2: The following regex will actually work for your proposed use case in the comments, matching everything but the pattern you described in the question.
^(?:(?!ds:\/\/.*(?:\r|\n)).)*$
You regex (?=ds:\w+).*\r?\n? does not match because in the content there is ds:// and \w does not match a forward slash. To make your regex work you could change it to:
(?=ds://\w+).*\r?\n? demo which can be shortened to ds://.*\R? demo
Note that you don't have to escape the forward slash.
If you want to do a find and replace to keep the lines that contain ds:// you could use a negative lookahead:
Find what
^(?!.*ds://).*\R?
Replace with
Leave empty
Explanation
^ Start of the string
(?!.*ds://) Negative lookahead to assert the string does not contain ds://
.* Match any character 0+ times
\R? An optional unicode newline sequence to also match the last line if it is not followed by a newline
See the Regex demo
Here you go, Andrew:
Regex: ds:\/\/.*
Link: https://regex101.com/r/ulO9GO/2
Let me know if any question.

What regular expression will select all lines that have more than one punctuation mark?

I have this regular expression:
\..*?\.
But it only selects between two periods, not every punctuation mark, and it also selects across multiple lines.
Would modifying this expression to only take in one line at a time work somehow, if there's also a way to group punctuation into where we have a period?
Just to make things simpler, at this time I only need the expression to recognize periods, exclamation points, and question marks. I don't need it to register commas.
Thanks to Nathan and Agumander below, I know to substitute [.!?] in place of \. now, but I'm still having trouble with the other half of my question.
Just to make sure I'm being more clear, using [.!?].*?[.!?]\s will highlight text between punctuation marks, but across multiple lines. So I can't use it to bookmark only the lines that have multiple punctuation marks.
Placing characters inside a pair of square brackets will match to any of the enclosed characters. In your case you'd want [.?!]
If you want to match any sentence that has two of these, then you'll be looking for a pair of [.!?] separated by zero or more of any character.
The regex that matches strings with more than one of the set [.?!] would then be [.!?].*[.!?]
To make . match newlines, you'd add the s modifier to your regex.
...so the full regex would be /[.!?].*[.!?]/s
Ok I figured it out. Thanks to Agumander and Nathan above I substituted [.!?] in for the two \. in my original regex:
\..*?\. became [.!?].*[.!?]
Putting \s at the end of the regex made it pink select the entire document in notepad++.
The last issue I had was remembering to turn off "matches newline."
Agumander, I think you're asking for a regex that basically finds multiple punctuation marks on a single line. So here's one way to do it.
Here's the text I'm going to match. The regex will match the first line in it's entirety, but will not match the second.
Here's a line with multiple punctuation. The entire line will match the regex!
This line does not have multiple punctuation.
Regex
^.*(?:[\.?!].*){2,}$
Explanation
^ -- Start matching at the beginning of a line
.* -- match any character 0 or more times
(?: -- start a new non-capturing group
[.?!] -- find a character matching a period, question mark, or exclamation point.
.* -- match any character 0 or more times
)
{2,} -- repeat the previous group 2 or more times. This is how we ensure there's at least two punctuation marks before considering it a match.
$ -- end of line anchor, basically stop matching at the end of a line

how can i remove every thing before ":" string in notepad++?

I have a file like this in notepad++
n1:n1:n1
n1:n1:n2
n1:n1:n3
i want to delete everything before the first ":" including the ":" itself
and be like this
n1:n1
n1:n2
n1:n3
and thanks..
hope i was clear enough in my explanation of my problem
Ken White :
thanks but the problem is my file have over 10k lines and the first "n1" changes to "n2" after about 1000 lines
and then it become "o1" instead of "n1"
i want to delelte every thing before the first ":"
Use Replace and use a regular expression to find any chars at the start of the line that are not a colon :, followed by a colon, and replace them with nothing
Find what: ^([^:]+:)(.)
Replace with: \2
Search Mode: Regular Expression
This actually answers your question and doesn't assume anything about what is before or after the first colon.
The first ^ indicates that the search must start at the beginning of a line
Parentheses are groupers and savers. They're not actually needed for this first bit, since you are just deleting the stuff before the colon, but this makes it parallel with Ken White's solution
Square brackets [ ] indicate which characters you want to look for
a. The second ^ right after the first square bracket switches from chars you want to look for to chars you do not want to look for
b. So [^:] means look for any char other than a colon
The plus + means look for 1 or more occurrences of this set of chars
a. If some lines may start with a colon, and you still want to replace that colon, you'd want to look for 0 or more occurrences of non-colon chars at the start of a line
b. To do that, replace the + with a *
Select the colon (so it will be deleted also)
Right Paren ends the first group
Left Paren starts the 2nd group
Dot . says look for any char. If you don't have this here, then it will delete everything before the first colon and then next set will be at the start of the line, so you'll delete too much. You could technically put a plus or star here, but you don't need it.
Right Paren ends the 2nd group
In the Replace with box, \2 (that's a backslash or reverse solidus if you prefer) will take the contents of the 2nd group and replace everything it found with those contents
Here is the test input and output:
Input (stuck some tabs and spaces and other stuff in there for good measure)
n1:n1:n1
n1:n1:n2
n1:n1:n3
n2:n1:n3
n4:n7:n5
o1:n1:n1:m1:m1:l1:l7b:l1011
z99:
-- Here's some more data
o1:o2:o3:o4:o5
:o2:o3:o4:o5:o6
o1:o1:o3:x37:n99
n2:o1:o3:o44:z76
n4:n7:n5:u72:j9:
Output
n1:n1
n1:n2
n1:n3
n1:n3
n7:n5
n1:n1:m1:m1:l1:l7b:l1011
z99:
o2:o3:o4:o5
:o2:o3:o4:o5:o6
o1:o3:x37:n99
o1:o3:o44:z76
n7:n5:u72:j9:
Notice it removed any line without a colon, which in some cases may be preferable. It also missed the two lines I threw in there with a colon at the beginning or end of the line.
If you wanted to leave these blank lines in, add an \r\n in the brackets in step 3 above (and again these are backslashes). Then it will look for any char that's not a colon or end-of-line (Step 3), followed by a colon (Step 5). Therefore, it only removes chars on the line with a colon. Change Find what to this string:
Find what: ^([^:\r\n]+):(.)
To catch the lines starting with a colon or with nothing after the first colon, change the plus to a star and add a question mark after the dot:
Find what: ^([^:\r\n]*):(.?)

Vim multiline regex gives overlapping matches

I was surprised when I noticed that my greedy multiline regex was giving overlapping matches in Vim. The regex is designed to match an entire block of text, or consecutive non-blank lines.
The regex apparently matched everything I expected it to (highlight looked correct), but when using n to skip to the next match instead of skipping to the next block, it went to the next line in the current block.
Here is the regex I was using (equivalent to (.+\n){1,} for most regex engines):
\(.\+\n\)\{1,}
This should match at least one non-empty line, and as many consecutive non-empty lines as possible, here is an example text file:
block 1
some stuff
more stuff
block 2
foo bar
baz qux
After applying this regex (/\(.\+\n\)\{1,}+Enter) the two blocks are highlighted correctly, but I expect there to be only two matches of the regex, one for each block. However when I press n to advance to the next regex match it appears that each non-empty line matches the regex, so my cursor would start on the first line, n would take it to the second line, then third, then to the start of block 2 etc.
How can I change my regex so that I see the expected behavior of each block being a single match so that n advances to the next block, instead of the next line?
I am also interested in knowing if this behavior is in the documentation somewhere, or if there is an option to change this behavior. Note that when using the same regex in a search/replace the behavior is what I expect (replacement would only be applied twice, once for each block).
The following regex seems to work:
\(\%^\|^\n\)\zs\(.\+\n\)\+
Explanation:
\( # start of group
\%^ # beginning of file
\| # OR
^\n # a blank line
\) # end of group
\zs # start matching here
\(.\+\n\)\+ # at least one non-blank line
By using the very magic option the length can be reduced a bit:
\v(%^|^\n)\zs(.+\n)+
Looking forward to seeing if anyone can come up with a shorter solution!
zigdon's answer helped me to understand better why the behavior is the way it is. When n is used to jump to the next match it searches for the first match of the regex from the cursor's current position, even if the next matching position was included in the previous match. This is why anchoring the regex to the start of the block appears to be necessary.
Thanks to Nolen Royalty for helping me get rid of an unnecessary lookahead in the first group.
Since your match says "match one or more non-empty lines" it can certainly match multiple times within the same paragraph. To fix this, you can specify that the cursor should be placed at the end of the match - the means the next match will start from the end of the paragraph. You can do this with the \zs zero-width character, available in vim:
\zs Matches at any position, and sets the start of the match there: The
next char is the first char of the whole match. |/zero-width|
So your match will become:
\(.\+\n\)\{1,}\zs