Delete lines between and including two patterns - regex

I have a scalar variable that contains some information inside of a file. My goal is to strip that variable (or file) of any multi-line entry containing the words "Administratively down."
The format is similar to this:
Ethernet2/3 is up
... see middle ...
a blank line
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
a blank line
Ethernet2/5 is up
... same format as previously ...
I was thinking that if I could match "administratively down" and a leading newline (for the blank line), I would be able to apply some logic to the variable to also remove the lines between those lines.
I'm using Perl at the moment, but if anyone can give me an ios way of doing this, that would also work.

Use Perl's Paragraph Mode
Perl has a rarely-used syntax for using blank lines as record separators: the -00 flags; see Command Switches in perl(1) for details.
Example
For example, given a corpus of:
Ethernet2/3 is up
... see middle ...
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
Ethernet2/5 is up
You can use extract all pargagraphs except the ones you don't want with the following one-liner:
$ perl -00ne 'print unless /administratively down/' /tmp/corpus
Sample Output
When tested against your corpus, the one-liner yields:
Ethernet2/3 is up
... see middle ...
Ethernet2/5 is up

So, you want to delete from the beginning of a line containing "administratively down" to and including the next blank line (two consecutive newlines)?
$log =~ s/[^\n]+administratively down.+?\n\n//s;
s/ = regex substitution
[^\n]+ = any number of characters, not including newlines, followed by
administratively down = the literal text, followed by
.+? = any amount of text, including newlines, matched non-greedily, followed by
\n\n = two newlines
// = replace with nothing (i.e. delete)
s = single line mode, allows . to match newlines (it usually doesn't)

You can use this pattern:
(?<=\n\n|^)(?>[^a\n]++|\n(?!\n)|a(?!dministratively down\b))*+administratively down(?>[^\n]++|\n(?!\n))*+
details:
(?<=\n\n|^) # preceded by a newline or the begining of the string
# all that is not "administratively down" or a blank line, details:
(?> # open an atomic group
[^a\n]++ # all that is not a "a" or a newline
| # OR
\n(?!\n) # a newline not followed by a newline
| # OR
a(?!dministratively down\b) # "a" not followed by "dministratively down"
)*+ # repeat the atomic group zero or more times
administratively down # "administratively down" itself
# the end of the paragraph
(?> # open an atomic group
[^\n]++ # all that is not a newline
| # OR
\n(?!\n) # a newline not followed by a newline
)*+ # repeat the atomic group zero or more times

Related

Autohotekey: How to extract text between two words with multiple occurrences in a large text document

Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000+ lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text
De l e te this text
StartWord
Apples Oranges
Pears Grapes
StopWord
Delete this text
Delete this text
StartWord
Peas Carrots
Peas Carrots
StopWord
Delete this text
Delete this text
Desired Output Text
Apples Oranges
Pears Grapes
Peas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)+
^ Start of string
StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
(?: Non capture group to repeat as a whole
\R Match a newline
(?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
)+ Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
This is only slightly different than #Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R+\K.*?\R(?=\R*^StopWord\R)
Demo
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R+ # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.

Regex finding all commas between two words

I trying to clean up a large .csv file that contains many comma separated words that I need to consolidate parts of. So I have a subsection where I want to change all the commas to slashes. Lets say my file contains this text:
Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool
I want to select all commas between the unique words bar and blah. The idea is to then replace the commas with slashes (using find and replace), such that I get this result:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
As per #EganWolf input:
How do I include words in the search but exclude them from the selection (for the unique words) and how do I then match only the commas between the words?
Thus far I have only managed to select all the text between the unique words including them:
bar,.*,blah, bar:*, *,blah, (bar:.+?,blah)*,*\2
I experimented with negative look ahead but cant get any search results from my statements.
Using Notepad++, you can do:
Ctrl+H
Find what: (?:\bbar,|\G(?!^))\K([^,]*),(?=.+\bblah\b)
Replace with: $1/
check Wrap around
check Regular expression
UNCHECK . matches newline
Replace all
Explanation:
(?: # start non capture group
\bbar, # word boundary then bar then a comma
| # OR
\G # restart from last match position
(?!^) # negative lookahead, make sure not followed by beginning of line
) # end group
\K # forget all we've seen until this position
([^,]*) # group 1, 0 or more non comma
, # a comma
(?= # positive lookahead
.+ # 1 or more any character but newlie
\bblah\b # word boundary, blah, word boundary
) # end lookahead
Result for given example:
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
Screen capture:
The following regex will capture the minimally required text to access the commas you want:
(?<=bar,)(.*?(,))*(?=.*?,blah)
See Regex Demo.
If you want to replace the commas, you will need to replace everything in capture group 2. Capture group 0 has your entire match.
An alternative approach would be to split your string by comma to create an array of words. Then join words between bar and blah using / and append the other words joined by ,.
Here is a PowerShell example of split and join:
$a = "Foo,bar,spam,eggs,extra,parts,spoon,eggs,sudo,test,example,blah,pool"
$split = $a -split ","
$slashBegin = $split.indexof("bar")+1
$commaEnd = $split.indexof("blah")-1
$str1 = $split[0..($slashbegin-1)] -join ","
$str2 = $split[($slashbegin)..$commaend] -join "/"
$str3 = $split[($commaend+1)..$split.count] -join ","
#($str1,$str2,$str3) -join ","
Foo,bar,spam/eggs/extra/parts/spoon/eggs/sudo/test/example,blah,pool
This could easily be made into a function with your entire line and keywords as inputs.

Using RegEx to Extract Multiple Lines Using Alternate End of Line Characters (newline, return, end of string)

I am presenting both the problem and the solution here. I developed this after extensive research in SO, found numerous related examples, but none that matched my exact use case.
Use Case:
You have a source string with multiple lines, and you need to extract
the first n lines.
Each line could end with any of these characters:
newline \n, return \r, or end of string $
Example Data:
The content of each line is not of interest/concern here. The number of lines in the source string can vary, but I want to limit the number of lines to a max number.
Clip 08.jpg
Clip 31.jpg
Clip 31b.jpg
Clip 32.jpg
Clip 40.jpg
Clip 40b.jpg
Clip 53.jpg
Clip 54.jpg
Maui Clip 53b.jpg
Answer:
^((?:.*(?:\n|\r|$)){1,5})
where the max number of lines, the number you want to extract, is the second number in the quantifier {1,5}, in this case "5".
If anyone can improve on this solution, or sees any issues with it, please post here.
I've found this to be a better solution.
(?m)(?:^.*\R?){1,20}
https://regex101.com/r/o2D6iG/2
(?m) # Inline modifier: Multi-line mode
(?: # Cluster
^ # BOL
.* # optional not newlines
\R? # optional line terminator
){1,20} # End Cluster, 1-20 times
If you make the line terminator optional, it takes care of EOS.
Also, when the Multi-line mode is in effect, it forces \R to match
or it will not advance.
If you don't have the \R construct, you can use the underwhelming series
of alternations for it.
(?m)(?:^.*(?:\r?\n|\r)?){1,20}
https://regex101.com/r/KxxeAK/1
(?m) # Inline modifier: Multi-line mode
(?: # Cluster
^ # BOL
.* # optional not newlines
(?: \r? \n | \r )? # optional line terminator
){1,20} # End Cluster, 1-20 times
And, you could likely do away with multi-line mode (it's just insurance)
(?:.*(?:\r?\n|\r)?){1,20}
JavaScript
https://regex101.com/r/jDTIMH/1
python
https://regex101.com/r/uqoP8Q/1
The following RegEx Pattern is used:
^((?:.*(?:\n|\r|$)){1,<NumOfLinesToExtract>})
Where <NumOfLinesToExtract> is the number of lines to extract from the top of the source list. For example:
^((?:.*(?:\n|\r|$)){1,5})
would result in:
Clip 08.jpg
Clip 31.jpg
Clip 31b.jpg
Clip 32.jpg
Clip 40.jpg
For details, see https://regex101.com/r/Xp1jwT/2
This RegEx does the following:
Extracts One of More Lines up to the max number of lines set by the max parameter of the quantifier
If the Source has fewer lines than the max lines, then ALL lines are returned.
It matched a line that end in any of the following characters:
New Line \n
Return \r
End of String $

How can I match a Markdown code block with RegEx?

I am trying to extract a code block from a Markdown document using PCRE RegEx. For the uninitiated, a code block in Markdown is defined thus:
To produce a code block in Markdown, simply indent every line of the
block by at least 4 spaces or 1 tab.
A code block continues until it reaches a line that is not indented (or the end of the article).
So, given this text:
This is a code block:
I need capturing along with
this line
This is a code fence below (to be ignored):
``` json
This must have three backticks
flanking it
```
I love `inline code` too but don't capture
and one more short code block:
Capture me
So far I have this RegEx:
(?:[ ]{4,}|\t{1,})(.+)
But it simply captures each line prefixed with at least four spaces or one tab. It doesn't capture the whole block.
What I need help with is how to set the condition to capture everything after 4 spaces or 1 tab until you either get to a line that is not indented or the end of the text.
Here's an online work in progress:
https://www.regex101.com/r/yMQCIG/5
You should use begin/end-of-string markers (^ and $ in combination with the m modifier). Also, your test text had only 3 leading spaces in the final block:
^((?:(?:[ ]{4}|\t).*(\R|$))+)
With \R and the repetition you match one whole block with each single match, instead of a line per match.
See demo on regex101
Disclaimer: The rules of markdown are more complicated than the presented example text shows. For instance, when (nested) lists have code blocks in them, these need to be prefixed with 8, 12 or more spaces. Regular expressions are not suitable to identify such code blocks, or other code blocks embedded in markdown notation that uses the wider range of format combinations.
There are 3 ways to highlight code: 1) using start-of-line indentation 2) using 3 or more backticks enclosing a multiline block of code or 3) inline code.
1 and 3 are part of John Gruber original Markdown specification.
Here is the way to achieve this. You need to perform 3 separate regexp tests:
Using indentation
(?:\n{2,}|\A) # Starting at beginning of string or with 2 new lines
(?<code_all>
(?:
(?<code_prefix> # Lines must start with a tab or a tab-width of spaces
[ ]{4}
|
\t
)
(?<code_content>.*\n+) # with some content, possibly nothing followed by a new line
)+
)
(?<code_after>
(?=^[ ]{0,4}\S) # Lookahead for non-space at line-start
|
\Z # or end of doc
)
2a) Using code block with backticks (vanilla markdown)
(?:\n+|\A)? # Necessarily at the begining of a new line or start of string
(?<code_all>
(?<code_start>
[ ]{0,3} # Possibly up to 3 leading spaces
\`{3,} # 3 code marks (backticks) or more
)
\n+
(?<code_content>.*?) # enclosed content
\n+
(?<!`)
\g{code_start} # balanced closing block marks
(?!`)
[ \t]* # possibly followed by some space
\n
)
(?<code_trailing_new_line>\n|\Z) # and a new line or end of string
2b) Using code block with backticks with some class specifier (extended markdown)
(?:\n+|\A)? # Necessarily at the beginning of a new line
(?<code_all>
(?<code_start>
[ ]{0,3} # Possibly up to 3 leading spaces
\`{3,} # 3 code marks (backticks) or more
)
[ \t]* # Possibly some spaces or tab
(?:
(?:
(?<code_class>[\w\-\.]+) # or a code class like html, ruby, perl
(?:
[ \t]*
\{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
)? # Possibly followed by class and id definition in curly braces
)
|
(?:
[ \t]*
\{(?<code_def>[^\}]+)\} # a definition block like {.class#id}
) # Followed by class and id definition in curly braces
)
\n+
(?<code_content>.*?) # enclosed content
\n+
(?<!`)
\g{code_start} # balanced closing block marks
(?!`)
)
(?:\n|\Z) # and a new line or end of string
Using 1 or more backticks for inline code
(?<!\\) # Ensuring this is not escaped
(?<code_all>
(?<code_start>\`{1,}) # One or more backtick(s)
(?<code_content>.+?) # Code content inbetween back sticks
(?<!`) # Not preceded by a backtick
\g{code_start} # Balanced closing backtick(s)
(?!`) # And not followed by a backtick
)
Try this?
[a-z]*\n[\s\S]*?\n
It will extract from your example
This must have three backticks
flanking it

Notepad++ Regex to find string on a line and remove duplicates of the exact string

Anyone know how to match a random string and then remove and re-occurences of the same string on each line in a file.
Essentially I have a file:
00101 blah 0000202 thisisasentencethisisasentence 99929
00102 blah 0000202 thisisasentenc1thisisasentenc1 999292
I want to remove the duplicate sentence so it returns:
00101 blah 0000202 thisisasentence 99929
00102 blah 0000202 thisisasentenc1 999292
The width isn't fixed or anything like that.
I think this is close but I don't understand regex well and it highlights everything in the file except the last line - correctly finding the duplicate but only once.
Removing duplicate strings/words(not lines) using RegEx(notepad++)
Note I can also use the following to identify which parts of each line is duplicated - it highlights the duplicated values (thisisasentencethisisasentence) but I don't know how to split it
(.{5,})\1
Any help would be appreciated,
thanks.
EDIT I can reformat to create comma delimited (to some extent): (note with this, there is a chance a comma exists in the duplicated string but don't worry about that)
00101,blah,0000202,thisisasentencethisisasentence,99929
00102,blah,0000202,thisisasentenc1thisisasentenc1,999292
You can use this pattern in notepad++ with an empty string as replacement:
^(?>\S+[^\S\n]+){3,}?(\S+?)\K\1(?!\S)
demo
pattern details:
^ # anchors for the start of the line (by default in notepad++)
(?> # atomic group: a column and the following space
\S+ # all that is not a white-space character
[^\S\n]+ # white-spaces except newlines
){3,}? # repeat 3 or more times (non-greedy) until you find
(\S+?)\K\1(?!\S) # a column with a duplicate
details of the last subpattern:
(\S+?) # capture one or more non-white characters
# (non-greedy: until \1(?!\S) succeeds)
\K # discard all previous characters from whole match result
\1 # back-reference to the capture group 1
(?!\S) # ensure that the end of the column is reached
Note: using {5,} instead of + in \S+? (so \S{5,}?) is a good idea, if you are sure that columns contain at least five characters.
You say you are happy with what (.{5,})\1 matches. So, just use $1 as the replacement value. It will automatically replace the repeated part and its copy with a single copy of the text.