Let's say I have
def
abc
xyz
abc
And I want to match
xyz
abc
as a whole
Is this possible using the most generic RegEx possible?
That is not the perl RegEx or .Net Regex which have multi line flags.
I guess it would be BNF to match this.
Many regex implementations allow explicit line terminators. If \n is the line separator, then just search for xyz\nabc.
Regexes work on whatever text you give them, multiline or otherwise. If it happens to contain linefeeds, then it's nominally "multiline" text, but you don't have to do anything special to match it with regexes. Linefeed is just another character.
The name "multiline flag" (or "multiline mode") confuses many people. All that flag does is change the meaning of the ^ and $ anchors, allowing them to match at the beginning and end of logical lines as well as the beginning and end of the whole text.
grep -A2 "xyz" <file_name>
from https://stackoverflow.com/a/34808071/5556553
Related
I am attempting to create a regex that will find any line that contains exactly one word on it. Words separated by a hyphen or symbol (e.g test-word) or leading white space should still be treated as a single word.
$cat file1
this line has many words
hello
test-hi
this does aswell
Using the regular expression
'/^\s*(\w+)\s$/GM'
Returns only "hello" and ignores "test-hi"
I am able to capture all single words but not ones with hyphens etc!
This is easier to do with awk, by default it will separate each record into fields based on one or more continuous whitespaces and whitespace at start/end of line won't be part of field calculations
$ awk 'NF==1' ip.txt
hello
test-hi
$ awk 'NF>1' ip.txt
this line has many words
this does aswell
NF is a built-in variable that indicates number of fields in the input record
You can use
^\s*([\w-]+)\s*$
which adds support for hyphens, makes the second \s match "zero or more" spaces. Keep your GM flags.
Demo
Try using \S to match any non whitespace character:
'/^\s*(\S+)\s$/GM'
I cannot figure a way to make regular expression match stop not on end of line, but on end of file in VS Code? Is it a tool limitation or there is some kind of pattern that I am not aware of?
It seems the CR is not matched with [\s\S]. Add \r to this character class:
[\s\S\r]+
will match any 1+ chars.
Other alternatives that proved working are [^\r]+ and [\w\W]+.
If you want to make any character class match line breaks, be it a positive or negative character class, you need to add \r in it.
Examples:
Any text between the two closest a and b chars: a[^ab\r]*b
Any text between START and the closest STOP words:
START[\s\S\r]*?STOP
START[^\r]*?STOP
START[\w\W]*?STOP
Any text between the closest START and STOP words:
START(?:(?!START)[\s\S\r])*?STOP
See a demo screenshot below:
To matcha multi-line text block starting from aaa and ending with the first bbb (lazy qualifier)
aaa(.|\n)+?bbb
To find a multi-line text block starting from aaa and ending with the last bbb. (greedy qualifier)
aaa(.|\n)+bbb
If you want to exclude certain characters from the "in between" text, you can do that too. This only finds blocks where the character "c" doesn't occur between "aaa" and "bbb":
aaa([^c]|\n)+?bbb
I am not a programmer but I am having to use RegEx for a particular purpose. How do I add specific characters to what is being returned from RegEx?
For example, if I have a list as follows:
XYZ ABC 123
How do I use RegEx to add something specific to the end of each? For example, if I want all three to end with .com for example?
You can try this script, replacing XYZ abc 123 with your full list:
echo "XYZ abc 123" | sed -E 's/([a-zA-Z0-9]+)/\1.com/g'
Explanation:
s/ Starts a substitution regex
([a-zA-Z0-9]+) Capture at least one alphanumeric
/ End regex
\1.com replaces the capture with itself plus adds .com
/g Global modifier (for all matches)
Without knowing which regex engine you want to use, there are many other ways to do this. In the future, please give more information.
I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?
You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.
With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.
According the Perl documentation on regexes:
By default, the "^" character is guaranteed to match only the beginning of the string ... Embedded newlines will not be matched by "^" ... You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string ... you can do this by using the /m modifier on the pattern match operator.
The "after any newline" part means that it will only match at the beginning of the 2nd and subsequent lines. What if I want to match at the beginning of any line (1st, 2nd, etc.)?
EDIT: OK, it seems that the file has BOM information (3 chars) at the beginning and that's what's messing me up. Any way to get ^ to match anyway?
EDIT: So in the end it works (as long as there's no BOM), but now it seems that the Perl documentation is wrong, since it says "after any newline"
The ^ does match the 1st line with the /m flag:
~:1932$ perl -e '$a="12\n23\n34";$a=~s/^/:/gm;print $a'
:12
:23
:34
To match with BOM you need to include it in the match.
~:1939$ perl -e '$a="12\n23\n34";$a=~s/^(\d)/<\1>:/mg;print $a'
12
<2>:3
<3>:4
~:1940$ perl -e '$a="12\n23\n34";$a=~s/^(?:)?(\d)/<\1>:/mg;print $a'
<1>:2
<2>:3
<3>:4
You can use the /^(?:\xEF\xBB\xBF)?/mg regex to match at the beginning of the line anyway, if you want to preserve the BOM.
Conceptually, there's assumed to be a newline before the beginning of the string. Consequently, /^a/ will find a letter 'a' at the beginning of a string.
Put a empty line at the beginning of the file, this cool things down, and avoid to make regex hard to read.
Yes, the BOM. It might appear at the beginning of the file, so put an empty at the beginning of the file. The BOM will not be \s, or something can be seen by bare eye. It kills my hours when a BOM make my regex fail.