Why are blank lines being matched in this regexp? - regex

G'day,
I am using the following Perl fragment to extract output from a Solaris cluster command.
open(CL,"$clrg status |");
my #clrg= grep /^[[:lower:][:space:]]+/,<CL>;
close(CL);
I get the following when I print the content of the elements of the array #clrg BTW "=>" and "<=" line delimiters are inserted by my print statement:
=><=
=>nas-rg mcs0.cwwtf.bbc.co.uk No Online<=
=> mcs1.cwwtf.bbc.co.uk No Offline<=
=><=
=>apache-rg mcs0.cwwtf.bbc.co.uk No Online<=
=> mcs1.cwwtf.bbc.co.uk No Offline<=
=><=
When I replace it with the following Perl fragment the blank lines are not matched.
open(CL,"$clrg status |");
my #clrg= grep /^[[:lower:][:space:]]{3,}/,<CL>;
close(CL);
And I get the following:
=>nas-rg mcs0.cwwtf.bbc.co.uk No Online<=
=> mcs1.cwwtf.bbc.co.uk No Offline<=
=>apache-rg mcs0.cwwtf.bbc.co.uk No Online<=
=> mcs1.cwwtf.bbc.co.uk No Offline<=
Simple question is why?
BTW Using {1,} in the second Perl fragment also matches blank lines!
Any suggestions gratefully received!
cheers,

That'll be because [:space:] matches newlines and carriage returns as well.
So [[:space:]]+ would match \n, \r\n, or \n\n.
But [[:space:]]{3,} would require three characters, and an empty line is just a \n.
{1,} and + mean the same thing: match the preceding group one or more times.
P.S. A typical newline is \n on Unix and \r\n on Windows.

Hm. According to the Perl regular expression documentation, the [:space:] character class should not include newlines, as it is supposed be the equivalent of \s (except that it recognizes an additional character, vertical-tab, to maintain POSIX compliance).
However, having just tested this on 5.10.0, I can verify that it is matching newlines as well. Whether this qualifies as a bug in Perl or in the documentation, I'll leave for the Perl maintainers. But to avoid the immediate problem, use the previous answerer's solution and just use \s instead of the POSIX class.

Related

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

Regex: Exact match string ending with specific character

I'm using Java. So I have a comma separated list of strings in this form:
aa,aab,aac
aab,aa,aac
aab,aac,aa
I want to use regex to remove aa and the trailing ',' if it is not the last string in the list. I need to end up with the following result in all 3 cases:
aab,aac
Currently I am using the following pattern:
"aa[,]?"
However it is returning:
b,c
If lookarounds are available, you can write:
,aa(?![^,])|(?<![^,])aa,
with an empty string as replacement.
demo
Otherwise, with a POSIX ERE syntax you can do it with a capture:
^(aa(,|$))+|(,aa)+(,|$)
with the 4th group as replacement (so $4 or \4)
demo
Without knowing your flavor, I propose this solution for the case that it does know the \b.
I use perl as demo environment and do a replace with "_" for demonstration.
perl -pe "s/\baa,|,aa\b/_/"
\b is the "word border" anchor. I.e. any start or end of something looking like a word. It allows to handle line end, line start, blank, comma.
Using it, two alternatives suffice to cover all the cases in your sample input.
Output (with interleaved input, with both, line ending in newline and line ending in blank):
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
aa,aab,aac
_aab,aac
aab,aa,aac
aab_,aac
aab,aac,aa
aab,aac_
If the \b is unknown in your regex engine, then please state which one you are using, i.e. which tool (e.g. perl, awk, notepad++, sed, ...). Also in that case it might be necessary to do replacing instead of deleting, i.e. to fine tune a "," or "" as replacement. For supporting that, please show the context of your regex, i.e. the replacing mechanism you are using. If you are deleting, then please switch to replacing beforehand.
(I picked up an input from comment by gisek, that the cpaturing groups are not needed. I usually use () generously, including in other syntaxes. In my opinion not having to think or look up evaluation orders is a benefit in total time and risks taken. But after testing, I use this terser/eleganter way.)
If your regex engine supports positive lookaheads and positive lookbehinds, this should work:
,aa(?=,)|(?<=,)aa,|(,|^)aa(,|$)
You could probably use the following and replace it by nothing :
(aa,|,aa$)
Either aa, when it's in the begin or the middle of a string
,aa$ when it's at the end of the string
Demo
As you want to delete aa followed by a coma or the end of the line, this should do the trick: ,aa(?=,|$)|^aa,
see online demo

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

Trouble converting regex

This regex:
"REGION\\((.*?)\\)(.*?)END_REGION\\((.*?)\\)"
currently finds this info:
REGION(Test) my user typed this
END_REGION(Test)
I need it to instead find this info:
#region REGION my user typed this
#endregion END_REGION
I have tried:
"#region\\ (.*?)\\\n(.*?)#endregion\\ (.*?)\\\n"
It tells me that the pattern assignment has failed. Can someone please explain what I am doing wrong? I am new to Regex.
It seems the issue lies in the multiline \n. My recommendation is to use the modifier s to avoid multiline complexities like:
/#region\ \(.*?\)(.*?)\s#endregion\s\(.*?\)/s
Online Demo
s modifier "single line" makes the . to match all characters, including line breaks.
Try this:
#region(.*)?\n(.*)?#endregion(.*)?
This works for me when testing here: http://regexpal.com/
When using your original text and regex, the only thing that threw it off is that I did not have a new line at the end because your sample text didn't have one.
Constructing this regex doesn't fail using boost, even if you use the expanded modifier.
Your string to the compiler:
"#region\\ (.*?)\\\n(.*?)#endregion\\ (.*?)\\\n"
After parsed by compiler:
#region\ (.*?)\\n(.*?)#endregion\ (.*?)\\n
It looks like you have one too many escapes on the newline.
if you present the regex as expanded to boost, an un-escaped pound sign # is interpreted as a comment.
In that case, you need to escape the pound sign.
\#region\ (.*?)\\n(.*?)\#endregion\ (.*?)\\n
If you don't use the expanded modifier, then you don't need to escape the space characters.
Taking that tack, you can remove the escape on the space's, and fixing up the newline escapes, it looks like this raw (what gets passed to regex engine):
#region (.*?)\n(.*?)#endregion (.*?)\n
And like this as a source code string:
"#region (.*?)\\n(.*?)#endregion (.*?)\\n"
Your regular expression has an extra backslash when escaping the newline sequence \\\n, use \\s* instead. Also for the last capturing group you can use a greedy quantifier instead and remove the newline sequence.
#region\\ (.*?)\\s*(.*?)#endregion\\ (.*)
Compiled Demo

Extract info from the large file, use newline character in awk regex

I need to extract some information from the very large file.
I want to extract specific lines using regular expressions.
What is the fastest way to do this?
I'm coding in c++ on linux.
I want to use grep, but seems my regex is not working as expected.
For example \s, \w are not working properly.
In man grep is written that \wand [:alnum:] are synonyms, so, \w should work properly but it shouldn't.
I need to use newline characters in my regex, so, I'couldn't use grep, therefore, I decided to use awk.
How should I use newline character in awk regex?
Let's consider we have a file (test.txt) with the content below:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement hi hi some statement ...
And I want to get only these lines:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement
I.e., I want to find lines that start with HELLO word followed by the space character(s), then some alphanumeric( or containing /) word followed by the space character(s) and then, a single ;. But I want to get this kind of lines when they are followed by the some statement line only.
I wrote:
awk '/HELLO[[:space:]]([[:alnum:]]|\/)+[[:space:]];\n[[:space:]]*some[[:space:]] statement [[:space:]];/ { print }' test.txt
But I couldn't get needed results.
Or just provide an example where newline is used in regex.
I solved this by using pcregrep and newline just worked fine!
pcregrep -M '(HELLO[[:space:]]([[:alnum:]]|\/|_)+[[:space:]];)[\r\n]([[:space:]]*some[[:space:]]statement[[:space:]];)' test.txt