Remove first line if blank using RegEx in YML file - regex

I'm running a script to grab text from websites, but the current code sometimes has the output start with a blank line.
data1 data2 data3
data4 data5 data6
data7 data8 data9
I also have other files that don't have a blank line to start.
Running this regex script on all the files at once, how can I remove the first line of the file only if the first line is blank, while keeping the blank lines in the middle of the files?
I am using regex in a yml config file.

You can use the following regex to match if file's first line is blank:
^\s*$
together with two regex flags - multiline (m) and Anchored (A).
Explanation:
^ # line start
\s* # match between 0 and unlimited amount of whitespace chars
$ # end of line
The Anchored flag allows to match only the first line, rather than all blank lines.
See demo here.

Related

Autohotekey: How to extract text between two words with multiple occurrences in a large text document

Using Autohotkey, I would like to copy a large text file to the clipboard, extract text between two repeated words, delete everything else, and paste the parsed text. I am trying to do this to a large text file with 80,000+ lines of text where the start and stop words repeat 100s of times.
Any help would be greatly appreciated!
Input Text Example
Delete this text
De l e te this text
StartWord
Apples Oranges
Pears Grapes
StopWord
Delete this text
Delete this text
StartWord
Peas Carrots
Peas Carrots
StopWord
Delete this text
Delete this text
Desired Output Text
Apples Oranges
Pears Grapes
Peas Carrots
Peas Carrots
I think I found a regex statement to extract text between two words, but don't know how to make it work for multiple instances of the start and stop words. Honestly, I can't even get this to work.
!c::
Send, ^c
Fullstring = %clipboard%
RegExMatch(Fullstring, "StartWord *\K.*?(?= *StopWord)", TrimmedResult)
Clipboard := %TrimmedResult%
Send, ^v
return
You can start the match at StartWord, and then match all lines that do not start with either StartWord or StopWord
^StartWord\s*\K(?:\R(?!StartWord|StopWord).*)+
^ Start of string
StartWord\s*\K Match StartWord, optional whitespace chars and then clear forget what is matched so far using \K
(?: Non capture group to repeat as a whole
\R Match a newline
(?!StartWord|StopWord).* Negative lookahead, assert that the line does not start with Start or Stopword
)+ Close the non capture group and repeat 1 or more times to match at least a single line
See a regex demo.
This is only slightly different than #Thefourthbird's solution.
You can match the following regular expression with general, multiline and dot-all flags set1:
^StartWord\R+\K.*?\R(?=\R*^StopWord\R)
Demo
The regular expression can be broken down as follows:
^StartWord # match 'StartWord' at the beginning of a line
\R+ # match >= 1 line terminators to avoid matching empty lines
# below
\K # reset start of match to current location and discard
# all previously-matched characters
.*? # match >= 0 characters lazily
\R # match a line terminator
(?= # begin a positive lookahead
\R* # match >= 0 line terminators to avoid matching empty lines
# above
^StopWord\R # Match 'StopWord' at the beginning of a line followed
# by a line terminator
) # end positive lookahead
1. Click on /gms at the link to obtain explanations of the effects of each of the three flags.

Match multiple line comment blocks composed of one or more single line comments

I need a regex that will match comment blocks composed of one or more single line comments.
Single Line Comment:
# This is a single line comment
Comment Block Composed of Multiple Single Line Comments:
# This is a multiple line comment
# which is just a block of single line comments
# that are strung together
The first character of a comment line can begin with any of the following characters: ;#%|*
I have found the following regex matches individual comment lines: [;#%|*]{1}(.+)
But I cannot figure out how to match for blocks that have more than one line.
I want to keep all characters in the whole block, including new lines.
Match the start of a comment, the rest of its line, then repeat 0 or more occurences of a group which starts with a newline, optional spaces, followed by the a comment start character and the rest of the line:
[;#%|*].*(?:(?:\r\n|\r|\n) *[;#%|*].*)*
See this regex demo.
[;#%|*] - Initial comment character
.* - Rest of first line
(?:(?:\r\n|\r|\n) *[;#%|*].*)* - Repeat 0 or more times:
(?:\r\n|\r|\n) - Newline (if you know the format of your newline characters in advance, you can simplify this, eg, perhaps to just \n)
space followed by * - 0 or more spaces
[;#%|*] - Initial comment character
.* - Rest of line
My guess is that here we might want an expression that'd pass newlines, such as
[;#%|*]([\s\S].*?)(?=[\r\n])
DEMO

How to select parts of text from markdown with regexp?

I have next text:
#Header
my header text
##SubHeader
my sub header text
###Sub3Header
my sub 3 text
#Header2
my header2 text
I need to select text from "#Header" to "#Header2".
I tried to wrote regexp: http://regexr.com/3ffva but it's do not match what i needed.
^#[^#\n]+([\W\w]*?)^#[^#\n]+
Basic idea: find first level-1 heading, find any text until... second level-1 heading.
^#[^#\n]+ first level-1 heading
^ start of line (because of multi-line flag)
[^#\n]+ Any character that isn't # or a newline character. Repeat 1 or more times.
([\W\w]*?) any text until next matching part
^#[^#\n]+ second level-1 heading (see above)
Flags: multiline.
With looking ahead for closing capture and also matching, before next heading:
1- without multi-line flag
(^|\n)#([^#]+?)\n([^]+?)(?=\n#[^#]|$)
Demo without multi-line flag
Description:
Group 1 captures first of string or new line that follows # and no other #, that means new Heading starts there.
Group 2 captures Heading title
Group 3 captures any thing till the next heading or end of string
Group 4 is non-capturing and looks ahead for new heading, or end of text.
2- with multi-line flag
^#([^#]+?)\n([^]+?)(?=^#[^#])
Demo with Multi-line flag
Description:
first, add #-- at the end of text, for matching last Heading by this regex!
Starts matching from first char of line by ^ and matches # with no # in heading text. Group 1 captured: Heading before \n
Group 2 captures texts till next Heading start, that defined by just one # at starting line.
Depending on your regex flavor you can use:
(^#{1}.+)(.*\n)*
As shown here: http://regexr.com/3fg08
Alternately, you can use Vim's very magic mode:
\v(^#{1}.+)(.*\n)*(^#{1}\w+)

Delete lines between and including two patterns

I have a scalar variable that contains some information inside of a file. My goal is to strip that variable (or file) of any multi-line entry containing the words "Administratively down."
The format is similar to this:
Ethernet2/3 is up
... see middle ...
a blank line
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
a blank line
Ethernet2/5 is up
... same format as previously ...
I was thinking that if I could match "administratively down" and a leading newline (for the blank line), I would be able to apply some logic to the variable to also remove the lines between those lines.
I'm using Perl at the moment, but if anyone can give me an ios way of doing this, that would also work.
Use Perl's Paragraph Mode
Perl has a rarely-used syntax for using blank lines as record separators: the -00 flags; see Command Switches in perl(1) for details.
Example
For example, given a corpus of:
Ethernet2/3 is up
... see middle ...
VlanXXX is administratively down, line protocol is down
... a bunch of text indented by two spaces on multiple lines ...
Ethernet2/5 is up
You can use extract all pargagraphs except the ones you don't want with the following one-liner:
$ perl -00ne 'print unless /administratively down/' /tmp/corpus
Sample Output
When tested against your corpus, the one-liner yields:
Ethernet2/3 is up
... see middle ...
Ethernet2/5 is up
So, you want to delete from the beginning of a line containing "administratively down" to and including the next blank line (two consecutive newlines)?
$log =~ s/[^\n]+administratively down.+?\n\n//s;
s/ = regex substitution
[^\n]+ = any number of characters, not including newlines, followed by
administratively down = the literal text, followed by
.+? = any amount of text, including newlines, matched non-greedily, followed by
\n\n = two newlines
// = replace with nothing (i.e. delete)
s = single line mode, allows . to match newlines (it usually doesn't)
You can use this pattern:
(?<=\n\n|^)(?>[^a\n]++|\n(?!\n)|a(?!dministratively down\b))*+administratively down(?>[^\n]++|\n(?!\n))*+
details:
(?<=\n\n|^) # preceded by a newline or the begining of the string
# all that is not "administratively down" or a blank line, details:
(?> # open an atomic group
[^a\n]++ # all that is not a "a" or a newline
| # OR
\n(?!\n) # a newline not followed by a newline
| # OR
a(?!dministratively down\b) # "a" not followed by "dministratively down"
)*+ # repeat the atomic group zero or more times
administratively down # "administratively down" itself
# the end of the paragraph
(?> # open an atomic group
[^\n]++ # all that is not a newline
| # OR
\n(?!\n) # a newline not followed by a newline
)*+ # repeat the atomic group zero or more times

Vim: Match spaces at end of line but not lines consisting of a single space

I realise that in vim, I can highlight trailing spaces at the end of a line using
match /\s\+$/
Now I would like to exclude those lines that contain exactly one space from being matched. How do I go about doing this? (It does not need to be a single line/regex.)
match /\(\S\zs\s\+$\)\|\(^\s\{2,}$\)/
This should work - breaking it down into 2 sections
Part 1 - search for spaces at the end of a line that has other stuff on the line: \(\S\zs\s\+$\)
not a space \S,
then start matching \zs,
1 or more spaces at the end of the line \s\+$
OR match \|
Part 2 - Search for more than one space which is the entire line: \(^\s\{2,}$\)
start at the beginning of the line ^
search for at least 2 spaces \s\{2,}
at the end of the line $
This matches all lines that contain more than one space, leaving out lines that contain one space.
match /\s\s\+$/