How can I express this regex with sed? - regex

I have this regex that I would like to use with sed. I would like to use sed, since I want to batch process a few thousand files and my editor does not like that
Find: "some_string":"ab[\s\S\n]+"other_string_
Replace: "some_string":"removed text"other_string_
Find basically matches everything between some_string and other_string, including special chars like , ; - or _ and replaces it with a warning that text was removed.
I was thinking about combining the character classes [[:space:]] and [[:alnum:]], which did not work.

In MacOS FreeBSD sed, you can use
sed -i '' -e '1h;2,$H;$!d;g' -e 's/"some_string":"ab.*"other_string_/"some_string":"removed text"other_string_/g' file
The 1h;2,$H;$!d;g part reads the whole file into memory so that all line breaks are exposed to the regex, and then "some_string":"ab.*"other_string_ matches text from "some_string":"ab till the last occurrence of "other_string_ and replaces with the RHS text.
You need to use -i '' with FreeBSD sed to enforce inline file modification.
By the way, if you decide to use perl, you really can use the -0777 option to enable file slurping with the s modifier (that makes . match any chars including line break chars) and use something like
perl -i -0777 's/"some_string":"\Kab.*(?="other_string_)/removed text/gs' file
Here,
"some_string":" - matches literal text
\K - omits the text matched so far from the current match memory buffer
ab - matches ab
.* - any zero or more chars as many as possible
OR .*? - any zero or more chars as few as possible
(?="other_string_) - a positive lookahead (that matches the text but does not append to the match value) making sure there is "other_string_ immediately on the right.

Related

How to use Perl to replace multiple lines containing character '/' and new line?

I'm trying to modify block of several lines in several files. Initially, I tried sed but read that Perl might be a better choice. However, my Perl is very basic and I'm not sure how to deal with an empty (new) line and the special character '/'. To sum things up, I'd like to have a one-liner, something like ($perl -i -pe ...), to convert
(new line)
#include <item_b/item_bC.h>
into
#include <item_a/item_aC.h>
#include <item_b/item_bC.h>
Thanks.
One way -- slurp the file into a string, then match a line with only possibly spaces followed by a line starting with #include..., and replace what's matched with that #include line twice
perl -0777 -wpe's{ ^\s*\n ( \#include.*\n ) }{$1$1}mxg' file.c
With -0777 it slurps the whole file into $_ and with -p it prints $_ on every line (only once when under -0777 since hte whole file is in $_ so there is only one "line"); see switches in perlrun. The /m modifier makes ^ (and $) also match line boundaries inside a (multiline) string.
Or, with the same general approach (slurp the file) but use a lookahead
perl -0777 -wpe's{ ^\s*\n (?= (\#include.*\n) ) }{$1}mxg' file.c
Matches an empty line after which a lookahead finds a line starting with #include, which is also captured so to replace the empty line with it. Since lookarounds don't consume anything there is no need to replace that line (with itself).
Note, the .* is greedy and matches as much as possible up to the pattern that follows it, and here we have the whole file ahead of it so it may appear that .*\n will match all the way to the very last \n in the file! However, . doesn't match a line-feed (with /s modifier it does) so .*\n here stops at the first newline, so it matches the rest of the line.
If a more specific include statement need be matched add details following the #include pattern.†
Otherwise, one can process line by line, by copying the current line and printing it when on the next line, depending on what's on the saved and next line. There are some picky details to straighten there, not super amenable to one-liners.
Both tested with input file.c (Note: it does start with an empty line)
#include<item_b/item_bC.h>
#include<item_a/item_aC.h>
#include<item_c/item_cC.h>
int main() {
return 1;
}
where we end up with two item_b and one item_a and two itewm_c includes and no empty lines, and the rest of the file is unaffected.
† Special characters are mentioned so I'll comment. But please consult more complete resources, like tutorial perlretut and reference perlre. See also perlrebackslash
Characters special for regex can mostly be matched as literal characters in a pattern when escaped with \. But in this case that's not needed: the role of / in a regex is only to delimit the pattern, commonly given as /.../, but here I use {}{} as delimiters; so / isn't special here and can be used freely. For example
perl -0777 -wpe's{ ^\s*\n (?= (\#include<item_./.*\n) ) }{$1}mxg' file.c
matches lines from the input file I used, shown above.
There is clearly a more general pattern instead of item in the actual problem, and it's a filename. Most characters that are allowed in a filename can be used literally in a regex. Exceptions, like ., can be escaped, like \. to match a literal ..
For example, a string item_bC.h, where bC characters vary but item and .h are always the same, can be matched with the pattern /item_..\.h/.

Substring using Regex in Shell or bash

I've a huge text file having row items like following
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570655"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570656"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1042.html"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1043.html?piid=47570657"
"https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1043.html?piid=47570658"
I want to extract alpha-numeric character after last occurrence of '-' and before '.html' ('agcd1043' only) and save those values to another file.
Kindly help me do this using regex ( .-(.+).html. - is the regex I used to npp for smaller files) or any other method. TIA
You could extract the string with sed:
sed 's/.*-\([^-]*\)\.html.*/\1/' <<< "https://www.wayfair.ca/appliances/pdp/agua-canada-30-500-cfm-ducted-wall-mount-range-hood-agcd1041.html?piid=47570655"
If you have all your strings in a file you can iterate on it:
while read line
do
variable=$(sed 's/.*-\([^-]*\)\.html.*/\1/' <<< $line)
# ... use the value from $variable
done < /path/to/file
The sed script is a substitution, where:
.*-\([^-]*\)\.html.* is the pattern
\1 is the replacement
The pattern is written so that it captures any sequence of non-hyphen character, i.e. [^-]* trapped between a hypen character - and the .html string. The dot character is escaped for regex purposes, hence the \.html pattern. The leading ad trailing .* make sure that anything before the hyphen and after html are captured too, otherwise they would appear in the output.

Recursively wrapping a regular expression with given text

For a given path, I wish to wrap a given regular expression in all files in that path or that path's sub-directories with some given text using standard Linux shell commands.
More specifically, wrap all my syslog commands with an assert command such as syslog(LOG_INFO,json_encode($obj)); becomes assert(syslog(LOG_INFO,json_encode($obj)));.
I thought the following might work, but received sed: -e expression #1, char 47: Invalid preceding regular expression error.
sed -i -E "s/(?<=syslog\()(.*)(?=\);)/assert(syslog(\1));/" /path/to/somewhere
BACKUP INFO IN RESPONSE TO Wiktor Stribiżew's ANSWER
I've never used sed before. Please confirm my understanding of your answer:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
-i edit files in place. One could first leave out to see on the screen what will be changed.
s substitute text
The three /'s surrounding the pattern and replacement (i.e. /pattern/replacement/) are deliminator and can be any single character and not just /.
syslog(\(.*\)); The pattern with one placeholder. Uses escaped parentheses.
assert(syslog(\1)); The replacement using escaped 1 (or 2, 3, etc) for replacement sub-strings.
g Replace all and not just the first match.
Would sed -i "s/syslog(.*);/assert(&);/g" /path/to/somewhere work as well?
sed patterns do not support lookarounds like (?<=...) and (?=...).
You may use a capturing group/replacement backreference:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
The pattern is of BRE POSIX flavor (no -E option is passed), so to define a capturing group you need to use escaped parentheses, and unescaped ones will match literal parentheses.
Details
syslog( - syslog( substring
\(.*\) - Group 1: any 0+ chars as many as possible
); - a ); substring
The replacement is assert(syslog(\1));, that is, the match is replaced with assert(syslog(, the contents of Group 1, and then ));.
If you need Perl-compatible regex constructs, you can use Perl (sic).
perl -i -pe 's/(?<=syslog\()(.*)(?=\);)/assert(syslog($1));/' /path/to/somewhere
Regardless of this specific solution I switched to single quotes on the assumption that you are on a Unix-ish platform. Backslashes inside double quotes are pesky (sometimes you need to double them, sometimes not).
Perl prefers $1 over \1 in the replacement pattern, though the latter will also technically work.

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?
You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.
With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

Regular expression to match beginning and end of a line?

Could anyone tell me a regex that matches the beginning or end of a line? e.g. if I used sed 's/[regex]/"/g' filehere the output would be each line in quotes? I tried [\^$] and [\^\n] but neither of them seemed to work. I'm probably missing something obvious, I'm new to these
Try:
sed -e 's/^/"/' -e 's/$/"/' file
To add quotes to the start and end of every line is simply:
sed 's/.*/"&"/g'
The RE you were trying to come up with to match the start or end of each line, though, is:
sed -r 's/^|$/"/g'
Its an ERE (enable by "-r") so it will work with GNU sed but not older seds.
matthias's response is perfectly adequate, but you could also use a backreference to do this. if you're learning regular expressions, they are a handy thing to know.
here's how that would be done using a backreference:
sed 's/\(^.*$\)/"\1"/g' file
at the heart of that regex is ^.*$, which means match anything (.*) surrounded by the start of the line (^) and the end of the line ($), which effectively means that it will match the whole line every time.
putting that term inside parenthesis creates a backreference that we can refer to later on (in the replace pattern). but for sed to realize that you mean to create a backreference instead of matching literal parentheses, you have to escape them with backslashes. thus, we end up with \(^.*$\) as our search pattern.
the replace pattern is simply a double quote followed by \1, which is our backreference (refers back to the first pattern match enclosed in parentheses, hence the 1). then add your last double quote to end up with "\1".