Replace the string which has a pattern like using sed - regex

How to use like (.*) operation in sed to search a pattern (eg: STRING.*) and append "*" to the end of the string that matches.
Below is the example:
cat file1.txt
MAC BOOK
MODERN MACHINE
MECHANICS
MOUNT
DISK
DATA INFORMATICS
cat file2.txt
MAC
DATA
for line in $(cat file2.txt|uniq)
do
sed -i "/$line.*/s/$line.*/$line.**/" file1.txt
done
Expected output:
cat file1.txt
MAC* BOOK
MODERN MACHINE*
MECHANICS
MOUNT
DISK
DATA* INFORMATICS

A one-liner:
$ sed -r '/'"$(paste -sd'|' file2.txt)"'/s/$/*/' file1.txt
MAC*
MACHINE*
MECHANICS
MOUNT
DISK
DATA*
The paste command creates a regular expression from file2:
$ paste -sd'|' file2.txt
MAC|DATA
Then the sed command looks file lines matching this regex, and replaces the end-of-line with an asterisk.
Add -i to the sed command to complete the task.
Update for your new input:
awk -v patt="$(paste -sd'|' file2.txt)" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
$i = $i "*"
print
}' file1.txt
MAC* BOOK
MODERN MACHINE*
MECHANICS
MOUNT
DISK
DATA* INFORMATICS
and to edit save the output back into the file:
tmp=$(mktemp)
awk ... file1.txt > "$tmp" && mv "$tmp" file1.txt
Or, with the latest GNU awk:
gawk -i inplace -v patt="$(paste -sd'|' file2.txt)" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
$i = $i "*"
print
}' file1.txt

You can just replace the "end of line" with * when you match like:
for line in $(uniq file2.txt); do
sed -i "/$line/s/\$/*/" file1.txt
done
though this will only work with GNU sed, and it will match $line anywhere in the line, so hopefully that's what you expect

awk is better suited for this:
awk 'FNR==NR{a[$1];next} {for (i in a) if (index($1, i)) $0 = $0 "*"}1' file2.txt file1.txt
MAC*
MACHINE*
MECHANICS
MOUNT
DISK
DATA*

I think what you are looking for is this:
for line in $(cat file2.txt|uniq)
do
sed -i "s/\(${line}.*\)/\1\*/" file1.txt
done
You can use () in the sed search to save the result and \1 to use it in the replacement.

Related

SED - Regex fails

Given the following files:
input_file:
if_line1
if_line2
template_file_1:
temp_file_line1
temp_file_line2
##regex_match## <= must be replaced by input_file
temp_file_line3
template_file_2:
temp_file_line1
temp_file_line2
{my_file.global} <= must be replaced by input_file
temp_file_line3
output_file:
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3
For template_file_1 the following sed command works:
sed -n -e '/##regex_match##/{r input_file' -e 'b' -e '}; p' template_file_1 > output_file
However, for template_file_2 the analog sed command fails:
sed -r -n -e '/(?<={).+\.global(?=})/{r input_file' -e 'b' -e '}; p' template_file_2 > output_file
sed complains the regular expression was invalid
The given regex is at least PCRE valid, for example grep -oP '(?<={).+\.global(?=})' template_file_2 works. Any idea how to deal with that?
perl one-liners:
perl -pe 'do {local $/; open $f, "<input_file"; $_ = <$f>; close $f} if /\{.+?\.global\}/' template_file_2
or perhaps this one, not "pure" perl
perl -ne 'if (/\{.+?\.global\}/) {system("cat","input_file")} else {print}' template_file_2
Using CPAN modules can make this really tidy:
perl -MPath::Tiny -pe '$_ = path("input_file")->slurp if /\{.+?\.global\}/' template_file_2
idk exactly what that PCRE is intended to do but taking a guess at it, this will work using any awk in any shell on every UNIX box:
$ awk 'NR==FNR{new=new s $0; s=ORS; next} /##regex_match##/{$0=new} 1' input_file template_file_1
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3
$ awk 'NR==FNR{new=new s $0; s=ORS; next} /\{[^.{}]+\.global}/{$0=new} 1' input_file template_file_2
temp_file_line1
temp_file_line2
if_line1
if_line2
temp_file_line3

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Pipe awk's results to sed (deletion)

I am using an awk command (someawkcommand) that prints these lines (awkoutput):
>Genome1
ATGCAAAAG
CAATAA
and then, I want to use this output (awkoutput) as the input of a sed command. Something like that:
someawkcommand | sed 's/awkoutput//g' file1.txt > results.txt
file1.txt:
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The final objective is to delete all lines in a file (file1.txt) containing the exact pattern found previously by awk.
The file results.txt contains (output of sed):
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
How should I write the sed command? Is there any simple way that sed will recognize the output of awk as its input?
Using GNU awk for multi-char RS:
$ cat file1
>Genome1
ATGCAAAAG
CAATAA
$ cat file2
>Genome1
ATGCAAAAG
CAATAA
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
$ gawk -v RS='^$' -v ORS= 'NR==FNR{rmv=$0;next} {sub(rmv,"")} 1' file1 file2
>Genome2
ATGAAAAA
AAAAAAAA
CAA
>Genome3
ACCC
The stuff that might be non-obvious to newcomers but are very common awk idioms:
-v RS='^$' tells awk to read the whole file as one string (instead of it's default one line at a time).
-v ORS= sets the Output Record Separator to the null string (instead of it's default newline) so that when the file is printed as a string awk doesn't add a newline after it.
NR==FNR is a condition that is only true for the first input file.
1 is a true condition invoking the default action of printing the current record.
Here is a possible sed solution:
someawkcommand | sed -n 's_.*_/&/d;_;H;${x;s_\n__g p}' | sed -f - file1.txt
First sed command turns output from someawkcommand into a sed expression.
Concretely, it turns
>Genome1
ATGCAAAAG
CAATAA
into:
/>Genome1/d;/ATGCAAAAG/d;/CAATAA/d;
(in sed language: delete lines containing those patterns; mind that you will have to escape /,[,],*,^,$ in your awk output if there are some, with another substitution for instance).
Second sed command reads it as input expression (-f - reads sed commands from file -, i.e. gets it from pipe) and applies to file file1.txt.
Remark for other readers:
OP wants to use sed, but as notified in comments, it may not be the easiest way to solve this question. Deleting lines with awk could be simpler. Another (easy) solution could be to use grep with -v (invert match) and -f (read patterns from files) options, in this way:
someawkcommand | grep -v -f - file1.txt
Edit: Following #rici's comments, here is a new command that takes output from awk as a single multiline pattern.
Disclaimer: It gets dirty. Kids, don't do it home. Grown-ups are strongly encouraged to consider avoiding sed for that.
someawkcommand | \
sed -n 'H;${x;s_\n__;s_\n_\\n_g;s_.*_H;${x;s/\\n//;s/&//g p}_ p}' | \
sed -n -f - file1.txt
Output from inner sed is:
H;${x;s/\n//;s/>Genome1\nATGCAAAAG\nCAATAA//g p}
Additional drawback: it will add an empty line instead of removed pattern. Can't fix it easily (problems if pattern is at beginning/end of file). Add a substitution to remove it if you really feel like it.
This is can more easily be done in awk, but the usual "eliminate duplicates" code is not correct. As I understand the question, the goal is to remove entire stanzas from the file.
Here's a possible solution which assumes that the first awk script outputs a single stanza:
awk 'NR == FNR {stanza[nstanza++] = $0; next}
$0 == stanza[i] {++i; next}
/^>/ && i == nstanza {i=0; next}
i {for (j=0; j<i; ++j) print stanza[j]; i=0}
{print $0;}
' <(someawkcommand) file1.txt
This might work for you (GNU sed):
sed '1{h;s/.*/:a;$!{N;ba}/p;d};/^>/!{H;$!d};x;s/\n/\\n/g;s|.*|s/&\\n*//g|p;$s|.*|s/\\n*$//|p;x;h;d' file1
sed -f - file2
This builds a script from file1 and then runs it against file2.
The script slurps in file2 and then does a gobal substitution(s) using the contents of file1. Finally it removes any blank lines at the end file caused by the contents deletion.
To see the script produced from file1, remove the pipe and the second sed command.
An alternative way would be to use diff and sed:
diff -e file2 file1 | sed 's/d/p/g' | sed -nf - file2

bash regex multiple match in one line

I'm trying to process my text.
For example i got:
asdf asdf get.this random random get.that
get.it this.no also.this.no
My desired output is:
get.this get.that
get.it
So regexp should catch only this pattern (get.\w), but it has to do it recursively because of multiple occurences in one line, so easiest way with sed
sed 's/.*(REGEX).*/\1/'
does not work (it shows only first occurence).
Probably the good way is to use grep -o, but i have old version of grep and -o flag is not available.
This grep may give what you need:
grep -o "get[^ ]*" file
Try awk:
awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
You might need to tweak the regex between the slashes for your specific issue. Sample output:
$ awk '{for(i=1;i<=NF;i++){if($i~/get\.\w+/){print $i}}}' file.txt
get.this
get.that
get.it
With awk:
awk -v patt="^get" '{
for (i=1; i<=NF; i++)
if ($i ~ patt)
printf "%s%s", $i, OFS;
print ""
}' <<< "$text"
bash
while read -a words; do
for word in "${words[#]}"; do
if [[ $word == get* ]]; then
echo -n "$word "
fi
done
echo
done <<< "$text"
perl
perl -lane 'print join " ", grep {$_ =~ /^get/} #F' <<< "$text"
This might work for you (GNU sed):
sed -r '/\bget\.\S+/{s//\n&\n/g;s/[^\n]*\n([^\n]*)\n[^\n]*/\1 /g;s/ $//}' file
or if you want one per line:
sed -r '/\n/!s/\bget\.\S+/\n&\n/g;/^get/P;D' file

Regular expression not showing multiple line content

I have a file with following format.
<hello>
<random1>
<random2>
....
....
....
<random100>
<bye>
I want to find whether bye and hello are there, and bye is below hello. I tried this regular expression.
grep "hello.*bye" filename
but it fails to match what I expected.
You could use pcregrep:
pcregrep -M 'hello(\n|.)*bye' filename
The -M option makes it possible to search for patterns that span line boundaries.
For your input, it'd produce:
<hello>
<random1>
<random2>
....
....
....
<random100>
<bye>
IF the input file is small enough, you can try:
grep "hello.*bye" <(tr $'\n' ' ' < filename)
This replaces all newlines with spaces and thus turns the file contents into a single line that grep searches at once.
If you'd rather simply remove newlines, use:
grep "hello.*bye" <(tr -d $'\n' < filename)
$ cat file1.txt
<hello>
<bye>
$ awk '/<hello>/ {hello=1} /<bye>/&&hello {bye=1; exit} END {exit !(hello && bye)}' \
file1.txt \
&& echo found || echo not found
found
$ cat file2.txt
<bye>
<hello>
$ awk '/<hello>/ {hello=1} /<bye>/&&hello {bye=1; exit} END {exit !(hello && bye)}' \
file2.txt \
&& echo found || echo not found
not found
Perl:
perl -0777 -lne 'print (/hello.*bye/s ? "y" : "n")'
or
perl -0777 -ne 'exit(! /hello.*bye/s)'
The -0777 options slurps the whole file as a single string. The "s" flag tells perl to allow "." to match a newline.
With GNU awk for a multi-char RS:
awk -v RS='^$' '{print (/hello.*bye/ ? "y" : "n")}'