Printing a matched regexp with sed - regex

So I'm trying to match a regexp with any string in the middle of it and then print out just that string. The syntax is sort of like this...
sed -n 's/<title>.*</title>/"what do I put here"/p' input.file
and I just want to print out whatever .* is where I typed "what do I put here". I'm not very comfortable with sed at this point so this is likely a very simple answer and I'm having trouble finding one in any of the other questions. Thanks in advance!

Capture the pattern you want to extract within \(...\), and then you can refer to it as \1 in the replacement string:
sed -n 's/<title>\(.*\)</title>/\1/p' input.file
You can have multiple \(...\) expressions, and refer to them with \1, \2, \3, and so on.
If you have the GNU version of sed, or gsed, then you could simplify a bit:
sed -rn 's/<title>(.*)</title>/\1/p' input.file
With the -r flag, sed can use "extended regular expressions", which practically let's you write (...) instead of \(...\), + instead of \+, and other goodies.

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

Regex with sed to search in files

I want to search recursiv in files for a given pattern and replace them. The search is for a string like "['DB']['1']['HOST'] = 'localhost'". If testing the regex the following doesn't print anything. Can't see an error in this regex? Could anyone help?
sed -n '/\[\'HOST\'\]\s?=\s?(?:\'|")(.+)(?:\'|")/p' /path/to/file
POSIX regex does not support non-capturing groups. Besides, you have not specified the -E option and the pattern is parsed as a BRE POSIX pattern where the capturing parentheses should be escaped. Also, the single quotes cannot be escaped to be used in a sed regex pattern, use \x27 instead.
Use
sed -En '/\[\x27HOST\x27\]\s?=\s?[\x27"][^\x27"]+[\x27"]/p'
See an online demo:
s="a string like ['DB']['1']['HOST'] = 'localhost'."
sed -En '/\[\x27HOST\x27\]\s?=\s?[\x27"][^\x27"]+[\x27"]/p' <<< "$s"
Besides, instead of \s, it might be a good idea to use [[:space:]].

Using regex and sed to replace a string inside of a file

Having the following string inside of a text file.
{"_job":"delete","query":{"query":{"bool":{"must":[{"term":{"_id":"28381"}}],"should":[]}}},"script":{"inline":"ctx._source.meta='This
is a ' test string Peedr'"},"timestamp":1518165383,"host":"","port":"9200","index":"","docType":"","customIndexer":""}
I would like to replace all the ' that are inside the ctx._source.meta='' part with \' using sed.
In the example above I've This is a ' test string Peedr which I would like to convert to This is a \' test string Peedr, so the desired output would be:
{"_job":"delete","query":{"query":{"bool":{"must":[{"term":{"_id":"28381"}}],"should":[]}}},"script":{"inline":"ctx._source.meta='This
is a \' test string
Peedr'"},"timestamp":1518165383,"host":"","port":"9200","index":"","docType":"","customIndexer":""}
I'm using the following regex to get the ' that is inside the ctx._source.meta string (3rd capture group).
(meta=')(.*?)(')(.*?)(')
I've the regex, but I dont know how to use the sed comand in order to replace the 3rd capture group with \'.
Can someone give me a hand and tell me the sed comand I have to use?
Thanks in advance
sed generally does not support the Perl regex extensions, so the non-greedy .*? will probably not do what you hope. If you want to use Perl regex, use Perl!
perl -pe "s/(meta='.*?)(')(.*?')/\$1\\\\\$2\$3/"
This will still not necessarily work if the input is malformed; a better approach would be to specifically exclude single quotes from the match, and then you don't need the non-greedy matching.
sed "s/\\(meta='[^']*\\)'\\([^']*'\\)/\\1\\\\'\\2/"
In both cases, the number of backslashes required to escape the backslashes inside the shell's double quotes is staggering.
You put back-references to groups except one you want to replace. There is a better way to accomplish same task:
sed -E "s/(ctx\._source\.meta=')([^']*)(')([^']*')/\1\2\\'\4/"
You may use:
sed "s/ ' / \\\' /g" sample.txt
The first part will instruct sed to only look for a single quote between 2 spaces, as such ctx._source.meta='This and string Peedr'"} will not match, hence will not be changed.
Edit:
At the poster's request, I edited my sed command to apply to extra use cases:
sed "s/\(ctx._source.meta='.*\)'\(.*Peedr'\"\)/\1\\\'\2/g"

Replace some dots(.) with commas(,) with RegEx and awk or sed

I want to replace dots with commas for some but not all matches:
hostname_metric (Index: 1) to hostname;metric (avg);22.04.2015 13:40:00;3.0000;22.04.2015 02:05:00;2.0000;22.04.2015 02:00:00;650.7000;2.2594;
The outcome should look like this:
hostname_metric (Index: 1) to hostname;metric (avg);22.04.2015 13:40:00;3,0000;22.04.2015 02:05:00;2,0000;22.04.2015 02:00:00;650,7000;2,2594;
I was able to identify the RegEx which should work to find the correct dots.
;[0-9]{1,}\.[0-9]{4}
But how can I replace them with a comma with awk or sed?
Thanks in advance!
Adding some capture groups to the regex in your question, you can use this sed one-liner:
sed -r 's/(;[0-9]{1,})\.([0-9]{4})/\1,\2/g' file
This matches and captures the part before and after the . and uses them in the replacement string.
On some versions of sed, you may need to use -E instead of -r to enable Extended Regular Expressions. If your version of sed doesn't understand either switch, you can use basic regular expressions and add a few escape characters:
sed 's/\(;[0-9]\{1,\}\)\.\([0-9]\{4\}\)/\1,\2/g' file
sed 's/\(;[0-9]\+\)\.\([0-9]\{4\}\)/\1,\2/g' should do the trick.

PCRE regex to sed regex

First of all sorry for my bad english. I'm a german guy.
The code given below is working fine in PHP:
$string = preg_replace('/href="(.*?)(\.|\,)"/i','href="$1"',$string);
Now T need the same for sed. I thought it should be:
sed 's/href="(.*?)(\.|\,)"/href="{$\1}"/g' test.htm
But that gives me this error:
sed: -e expression #1, char 36:
invalid reference \1 on `s' command's
RHS
sed does not support non-greedy regex match.
sed -e 's|href=\"\(.[^"][^>]*\)\([.,]\)\">|href="\1">|g' file
You need a backslash in front of the parentheses you want to reference, thus
sed 's/href="\(.*?\)(.|\,)"/href="{$\1}"/g' test.htm
You have to escape the block selector characters ( and ) as follows.
sed 's/href="\(.*?\)\(.|\,\)"/href="{$\1}"/g' test.htm
here is a solution, it is not prefect, only deal with the situation of one extra "," or "."
sed -r -e 's/href="([^"]*)([.,]+)"/href="\1"/g' test.htm
If you want to match a literal ".", you need to escape it or use it in a character class. As an alternative to slashing the capturing parentheses (which you need to do with basic REs), you can use the -E option to tell sed to use extended REs. Lastly, the REs used by sed use \N to refer to subpatterns, where N is a digit.
sed -E "s/href=([\"'])([^\"']*)[.,]\1/href=\1\2\1/i"
This has its own issue that will prevent matches of href attributes that use both types of quotes.
man sed and man re_format will give more information on REs as used in sed.