Recursively wrapping a regular expression with given text - regex

For a given path, I wish to wrap a given regular expression in all files in that path or that path's sub-directories with some given text using standard Linux shell commands.
More specifically, wrap all my syslog commands with an assert command such as syslog(LOG_INFO,json_encode($obj)); becomes assert(syslog(LOG_INFO,json_encode($obj)));.
I thought the following might work, but received sed: -e expression #1, char 47: Invalid preceding regular expression error.
sed -i -E "s/(?<=syslog\()(.*)(?=\);)/assert(syslog(\1));/" /path/to/somewhere
BACKUP INFO IN RESPONSE TO Wiktor Stribiżew's ANSWER
I've never used sed before. Please confirm my understanding of your answer:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
-i edit files in place. One could first leave out to see on the screen what will be changed.
s substitute text
The three /'s surrounding the pattern and replacement (i.e. /pattern/replacement/) are deliminator and can be any single character and not just /.
syslog(\(.*\)); The pattern with one placeholder. Uses escaped parentheses.
assert(syslog(\1)); The replacement using escaped 1 (or 2, 3, etc) for replacement sub-strings.
g Replace all and not just the first match.
Would sed -i "s/syslog(.*);/assert(&);/g" /path/to/somewhere work as well?

sed patterns do not support lookarounds like (?<=...) and (?=...).
You may use a capturing group/replacement backreference:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
The pattern is of BRE POSIX flavor (no -E option is passed), so to define a capturing group you need to use escaped parentheses, and unescaped ones will match literal parentheses.
Details
syslog( - syslog( substring
\(.*\) - Group 1: any 0+ chars as many as possible
); - a ); substring
The replacement is assert(syslog(\1));, that is, the match is replaced with assert(syslog(, the contents of Group 1, and then ));.

If you need Perl-compatible regex constructs, you can use Perl (sic).
perl -i -pe 's/(?<=syslog\()(.*)(?=\);)/assert(syslog($1));/' /path/to/somewhere
Regardless of this specific solution I switched to single quotes on the assumption that you are on a Unix-ish platform. Backslashes inside double quotes are pesky (sometimes you need to double them, sometimes not).
Perl prefers $1 over \1 in the replacement pattern, though the latter will also technically work.

Related

How to comment a include line using sed [duplicate]

I am using sed in a shell script to edit filesystem path names. Suppose I want to replace
/foo/bar
with
/baz/qux
However, sed's s/// command uses the forward slash / as the delimiter. If I do that, I see an error message emitted, like:
▶ sed 's//foo/bar//baz/qux//' FILE
sed: 1: "s//foo/bar//baz/qux//": bad flag in substitute command: 'b'
Similarly, sometimes I want to select line ranges, such as the lines between a pattern foo/bar and baz/qux. Again, I can't do this:
▶ sed '/foo/bar/,/baz/qux/d' FILE
sed: 1: "/foo/bar/,/baz/qux/d": undefined label 'ar/,/baz/qux/d'
What can I do?
You can use an alternative regex delimiter as a search pattern by backslashing it:
sed '\,some/path,d'
And just use it as is for the s command:
sed 's,some/path,other/path,'
You probably want to protect other metacharacters, though; this is a good place to use Perl and quotemeta, or equivalents in other scripting languages.
From man sed:
/regexp/
Match lines matching the regular expression regexp.
\cregexpc
Match lines matching the regular expression regexp. The c may be any character other than backslash or newline.
s/regular expression/replacement/flags
Substitute the replacement string for the first instance of the regular expression in the pattern space. Any character other than backslash or newline can be used instead of a slash to delimit the RE and the replacement. Within the RE and the replacement, the RE delimiter itself can be used as a literal character if it is preceded by a backslash.
Perhaps the closest to a standard, the POSIX/IEEE Open Group Base Specification says:
[2addr] s/BRE/replacement/flags
Substitute the replacement string for instances of the BRE in the
pattern space. Any character other than backslash or newline can
be used instead of a slash to delimit the BRE and the replacement.
Within the BRE and the replacement, the BRE delimiter itself can be
used as a literal character if it is preceded by a backslash."
When there is a slash / in theoriginal-string or the replacement-string, we need to escape it using \. The following command is work in ubuntu 16.04(sed 4.2.2).
sed 's/\/foo\/bar/\/baz\/qux/' file

How Can I Match the Last String Using this REGEX

I'm working on a bash script and need to use SED and a REGEX to match this line in a text file:
database.system = "pgsql://hostaddr=127.0.0.1 port=5432 dbname=mydb user=myuser password=mypassword options='' application_name='myappname'";
This is the regex I've come up with:
database.system\s=\s((?=")(.*)(?=;))
So far my regex is matching everything except for the last semi-colon. How do I modify the regex to catch the semi-colon as well?
You're using look-ahead assertions in your regular expression ((?=...)), which sed doesn't support.
However, you don't need them, if all you're trying to do is to extract the string inside the double quotes (using GNU sed syntax):
line=$'database.system = "pgsql://hostaddr=127.0.0.1 port=5432 dbname=mydb user=myuser password=mypassword options=\'\' application_name=\'myappname\'";'
sed -rn 's/database\.system\s*=\s*"(.*)";/\1/p' <<<"$line"
# use var=$(sed ...) to capture command output in a variable.
will extract
pgsql://hostaddr=127.0.0.1 port=5432 dbname=mydb user=myuser password=mypassword options='' application_name='myappname'
-r activates support for extended regular expressions, which function more like regular expressions in other languages (without -r, sed only supports basic regular expressions, whose feature set is limited and whose escaping rules are different).
-n suppresses printing of each input line by default, so that an explicit output command is needed to produce output.
s/<regex>/<replacement>/p matches each input line against <regex>, replaces it with <replacement>, and prints the result (p), but only if a match was found; \1 refers to the first (and only, in this case) capture group ((...)) defined in .
The basic approach is to match the entire line, yet limit the (one and only) capture group to the substring of interest, and then replace the line with only the capture group, which effectively outputs only the substring of interest for each matching line.

Regex to extract first 3 words from a string

I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?
You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.
With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

Regular expression to match beginning and end of a line?

Could anyone tell me a regex that matches the beginning or end of a line? e.g. if I used sed 's/[regex]/"/g' filehere the output would be each line in quotes? I tried [\^$] and [\^\n] but neither of them seemed to work. I'm probably missing something obvious, I'm new to these
Try:
sed -e 's/^/"/' -e 's/$/"/' file
To add quotes to the start and end of every line is simply:
sed 's/.*/"&"/g'
The RE you were trying to come up with to match the start or end of each line, though, is:
sed -r 's/^|$/"/g'
Its an ERE (enable by "-r") so it will work with GNU sed but not older seds.
matthias's response is perfectly adequate, but you could also use a backreference to do this. if you're learning regular expressions, they are a handy thing to know.
here's how that would be done using a backreference:
sed 's/\(^.*$\)/"\1"/g' file
at the heart of that regex is ^.*$, which means match anything (.*) surrounded by the start of the line (^) and the end of the line ($), which effectively means that it will match the whole line every time.
putting that term inside parenthesis creates a backreference that we can refer to later on (in the replace pattern). but for sed to realize that you mean to create a backreference instead of matching literal parentheses, you have to escape them with backslashes. thus, we end up with \(^.*$\) as our search pattern.
the replace pattern is simply a double quote followed by \1, which is our backreference (refers back to the first pattern match enclosed in parentheses, hence the 1). then add your last double quote to end up with "\1".