I have the following regular expression for eliminating spaces, tabs, and new lines: [^ \n\t]
However, I want to expand this for certain additional characters, such as > and <.
I tried [^ \n\t<>], which works well for now, but I want the expression to not match if the < or > is preceded by a \.
I tried [^ \n\t[^\\]<[^\\]>], but this did not work.
Can any one of the sequences below occur in your input?
\\>
\\\>
\\\\>
\blank
\tab
\newline
...
If so, how do you propose to treat them?
If not, then zero-width look-behind assertions will do the trick, provided that your regular expression engine supports it. This will be the case in any engine that supports Perl-style regular expressions (including Perl's, PHP, etc.):
(?<!\\)[ \n\t<>]
The above will match any un-escaped space, newline, tab or angled braces. More generically (using \s to denote any space characters, including \r):
(?<!\\)\s
Alternatively, using complementary notation without the need for a zero-width look-behind assertion (but arguably less efficiently):
(?:[^ \n\t<>]|\\[<>])
You may also use a variation of the latter to handle the \\>, \\\>, \\\\> etc. cases as well up to some finite number of preceding backslashes, such as:
(?:[^ \n\t<>]|(?:^|[^<>])[\\]{1,3,5,7,9}[<>])
According to the grep man page:
A bracket expression is a list of
characters enclosed by [ and ]. It
matches any single character in that
list; if the first character of the
list is the caret ^ then it matches
any character not in the list.
This means that you can't match a sequence of characters such as \< or \> only single characters.
Unless you have a version of grep built with Perl regex support then you can use lookarounds like one of the other posters mentioned. Not all versions of grep have this support though.
Maybe you can use egrep and put your pattern string inside quotes. This should obliterate the need for escaping.
Related
I have this monit syntax,
check file access_log_1 with path /app/DNIF_logs/access_log_1
ignore content = ".*favicon.*"
if content = "^([:digit:]{1,3}\.){3}[:digit:]{1,3}[:space:]((((([:digit:]{1,3}\.){3}[:digit:]{1,3})|\-)[:space:]([:digit:]{6,7}|\-)[:space:][-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*)[:space:])|)-[:space:]-[:space:]\[[:digit:]{2}\/([A-Z]|[a-z]){3}\/[:digit:]{4}\:[:digit:]{2}\:[:digit:]{2}\:[:digit:]{2}[:space:]\+[:digit:]{4}\][:space:]\"(.*)\"[:space:](500|502|503)([:digit:]|[:space:])*\"https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*)\"(.*)" then alert
While running this monit I'm getting this error
/monitrc:40: syntax error '.*'
I tried removing the first .* and got
monitrc:41: syntax error '"[:space:](500|502|503)([:digit:]|[:space:])*\"'
so got to know error is at the first occurance of (.*)
As per the answer here, we've to use posix regex syntax. Is there a different posix syntax for .* or what should I do?
In POSIX ERE,
Bare POSIX character classes are not allowed, you should always use them inside bracket expressions (i.e. square brackets)
Note [[:digit:]] is the same as [0-9]
Inside bracket expressions, you can't use regex escapes, all literal backslashes are treated as literal backslash matching patterns
\b is not recognized, but you can usually use \< as a starting and \> as a trailing word boundaries
To use double quotes in the regex, use single quotes to delimit the regex string.
You may fix the regex like
if content = '^([0-9]{1,3}\.){3}[0-9]{1,3}[[:space:]]((((([0-9]{1,3}\.){3}[0-9]{1,3})|-)[[:space:]]([0-9]{6,7}|-)[[:space:]][-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\>([-a-zA-Z0-9()#:%_+.~#?&=]*)[[:space:]])|)-[[:space:]]-[[:space:]]\[[0-9]{2}/[[:alpha:]]{3}/[0-9]{4}:[0-9]{2}:[0-9]{2}:[0-9]{2}[[:space:]]\+[0-9]{4}][[:space:]]"(.*)"[[:space:]](500|502|503)[0-9[:space:]]*"https?://(www\.)?[-a-zA-Z0-9#:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\>([-a-zA-Z0-9()#:%_+.~#?&/=]*)"(.*)' then alert
Note:
([A-Z]|[a-z])* is better written as just [[:alpha:]]* (any zero or more letters
([0-9]|[[:space:]])* has a shorter variant, [0-9[:space:]]*.
This is my first time trying to use a regex for deletion.
The regex:
/net=.+\.net/
as shown here matches a string that starts with net= some random characters and ends with .net
However, when using it in vim:
:g/net=.+\.net/d
I simply get Pattern not found: net=.+\.net
I am guessing that vim uses a slightly different format, or do I need to escape the characters =, . and + ?
:help pattern is your friend. In your case, you need to escape + or prefix your whole pattern with \v to turn it “verymagic”.
Do not escape =, it would turn it into the same thing as {0,1} in some regexp engine, namely a greedy optional atom matcher.
I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.
How to use VS Find/Replace to replace:
this: $('a[name="lnkFind"]').on('click', function
with this: $(document).on("click", "a[name='lnkFind']", function
I'm not sure which characters need to be escaped - single or double quotes or both? None of the patters I've tried seem to find a match.
You'll need to escape many of these characters.
Find/Replace will complain about the un-escaped ( and ), even the bare ( at the end because it's missing a matching ). Also the square brackets, which are used for character sets, and finally the $.
So this should work as the pattern:
\$\('a\[name="lnkFind"\]'\).on\('click', function
You should look at a list of special characters in Regular Expressions.
$, ., [, ] should all be escaped.
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
Except in special cases (such as vim regex), in general you can escape any and all special characters in regex to get their literal form, i.e. escaping a special character that doesn't need to be escaped, won't do any harm.
That said, here's the minimum that needs to be escaped:
\$\('a\[name="lnkFind"]')\.on\('click', function
I don't think you'll need to escape anything in the replacement, because only a $ or \ followed by a number will be interpreted.
I am trying to find some comprehensive documentation on character classes in regular expressions that could be used while using grep. I tried
info grep
man grep
man 7 regex
but could not find all the characters classes listed down in the documentation.
I am looking for some comprehensive documentation on regex that grep uses. Is there such a documentation available?
grep has three options for regex -E or --extended-regexp -G or --basic-regexp and -P or --perl-regexp.
Extended / Basic Regex Classes: Follow POSIX Classes
Perl Regex Classes: Follow Perl Classes
From the command line POSIX regex information can be accessed via man 7 regex where as Perl regex data can be accessed via perldoc perlre
http://linux.about.com/od/commands/l/blcmdl1_grep.htm
Finally, certain named classes of characters are predefined within
bracket expressions, as follows. Their names are self explanatory, and
they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
[:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For example, [[:alnum:]] means [0-9A-Za-z], except the latter form
depends upon the C locale and the ASCII character encoding, whereas
the former is independent of locale and character set.
When I do man grep, this is what I get:
REGULAR EXPRESSIONS
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed
analogously to arithmetic expressions, by using various operators to combine smaller expressions.
grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep,
there is no difference in available functionality using either syntax. In other implementations, basic
regular expressions are less powerful. The following description applies to extended regular expressions;
differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters,
including all letters and digits, are regular expressions that match themselves. Any meta-character with
special meaning may be quoted by preceding it with a backslash.
The period . matches any single character.
Character Classes and Bracket Expressions
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that
list; if the first character of the list is the caret ^ then it matches any character not in the list. For
example, the regular expression [0123456789] matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It
matches any single character that sorts between the two characters, inclusive, using the locale's collating
sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many
locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to
[abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of
bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their
names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
[:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z],
except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is
independent of locale and character set. (Note that the brackets in these class names are part of the
symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most
meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it
first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a
literal - place it last.
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the
beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the
edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
Repetition
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
Concatenation
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by
concatenating two substrings that respectively match the concatenated expressions.
Alternation
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any
string matching either alternate expression.
Precedence
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole
expression may be enclosed in parentheses to override these precedence rules and form a subexpression.
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth
parenthesized subexpression of the regular expression.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead,
so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an
extension, but portable scripts should avoid it.
Regular expressions are awesome!