Recognise complicated pattern [duplicate] - regex

#!/bin/sh
old="hello"
new="world"
sed -i s/"${old}"/"${new}"/g $(grep "${old}" -rl *)
The preceding script just work for single line text, how can I write a script can replace
a multi line text.
old='line1
line2
line3'
new='newtext1
newtext2'
What command can I use.

You could use perl or awk, and change the record separator to something else than newline (so you can match against bigger chunks. For example with awk:
echo -e "one\ntwo\nthree" | awk 'BEGIN{RS="\n\n"} sub(/two\nthree\n, "foo")'
or with perl (-00 == paragraph buffered mode)
echo -e "one\ntwo\nthree" | perl -00 -pne 's/two\nthree/foo/'
I don't know if there's a possibility to have no record separator at all (with perl, you could read the whole file first, but then again that's not nice with regards to memory usage)

awk can do that for you.
awk 'BEGIN { RS="" }
FILENAME==ARGV[1] { s=$0 }
FILENAME==ARGV[2] { r=$0 }
FILENAME==ARGV[3] { sub(s,r) ; print }
' FILE_WITH_CONTENTS_OF_OLD FILE_WITH_CONTENTS_OF_NEW ORIGINALFILE > NEWFILE
But you can do it with vim like described here (scriptable solution).
Also see this and this in the sed faq.

Related

Find regular expression in a file matching a given value

I have some basic knowledge on using regular expressions with grep (bash).
But I want to use regular expressions the other way around.
For example I have a file containing the following entries:
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
Now I want to use bash to figure out to which line a particular number matches.
For example:
grep 8 file
should return:
line_three=[7-9]
Note: I am aware that the example of "grep 8 file" doesn't make sense, but I hope it helps to understand what I am trying to achieve.
Thanks for you help,
Marcel
As others haven pointed out, awk is the right tool for this:
awk -F'=' '8~$2{print $0;}' file
... and if you want this tool to feel more like grep, a quick bash wrapper:
#!/bin/bash
awk -F'=' -v seek_value="$1" 'seek_value~$2{print $0;}' "$2"
Which would run like:
./not_exactly_grep.sh 8 file
line_three=[7-9]
My first impression is that this is not a task for grep, maybe for awk.
Trying to do things with grep I only see this:
for line in $(cat file); do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done
Using while for file reading (following comments):
while IFS= read -r line; do echo 8 | grep "${line#*=}" && echo "${line%=*}" ; done < file
This can be done in native bash using the syntax [[ $value =~ $regex ]] to test:
find_regex_matching() {
local value=$1
while IFS= read -r line; do # read from input line-by-line
[[ $line = *=* ]] || continue # skip lines not containing an =
regex=${line#*=} # prune everything before the = for the regex
if [[ $value =~ $regex ]]; then # test whether we match...
printf '%s\n' "$line" # ...and print if we do.
fi
done
}
...used as:
find_regex_matching 8 <file
...or, to test it with your sample input inline:
find_regex_matching 8 <<'EOF'
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
EOF
...which properly emits:
line_three=[7-9]
You could replace printf '%s\n' "$line" with printf '%s\n' "${line%%=*}" to print only the key (contents before the =), if so inclined. See the bash-hackers page on parameter expansion for a rundown on the syntax involved.
This is not built-in functionality of grep, but it's easy to do with awk, with a change in syntax:
/[0-3]/ { print "line one" }
/[4-6]/ { print "line two" }
/[7-9]/ { print "line three" }
If you really need to, you could programmatically change your input file to this syntax, if it doesn't contain any characters that need escaping (mainly / in the regex or " in the string):
sed -e 's#\(.*\)=\(.*\)#/\2/ { print "\1" }#'
As I understand it, you are looking for a range that includes some value.
You can do this in gawk:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[7-9]
$ awk -v n=8 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[7-9]
Since the digits are being treated as numbers (vs a regex) it supports larger ranges:
$ cat /tmp/file
line_one=[0-3]
line_two=[4-6]
line_three=[75-95]
line_four=[55-105]
$ awk -v n=92 'match($0, /([0-9]+)-([0-9]+)/, a){ if (a[1]<n && a[2]>n) print $0 }' /tmp/file
line_three=[75-95]
line_four=[55-105]
If you are just looking to interpret the right hand side of the = as a regex, you can do:
$ awk -F= -v tgt=8 'tgt~$2' /tmp/file
You would like to do something like
grep -Ef <(cut -d= -f2 file) <(echo 8)
This wil grep what you want but will not display where.
With grep you can show some message:
echo "8" | sed -n '/[7-9]/ s/.*/Found it in line_three/p'
Now you would like to transfer your regexp file into such commands:
sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file
Store these commands in a virtual command file and you will have
echo "8" | sed -nf <(sed 's#\(.*\)=\(.*\)#/\2/ s/.*/Found at \1/p#' file)

Output the first regex match against STDIN

In bash, I have the following:
#!/bin/bash
curl $1 | tac | tac | perl -e '/(\d\d(?=:\d\d))/g; print $1' > $2
All I want is to the first match from the output of curl and print it to the output file. I run the script with ./scriptname url outputfile.txt but nothing is printed. My regex is valid on http://regexr.com, so I'm sure it's something I don't know about Perl. What am I doing wrong? Thanks.
You can use the following:
#!/bin/bash
curl "$1" | perl -nle'print for /\d\d(?=:\d\d)/g' > "$2"
If you change the match to /script/g, you can see it working with something like
./scriptname http://www.ucsd.edu outputfile.txt
I suppose this means perl -ne is reading the input line by line. Is there a simple way to have perl return only the first result?
Consider using sed:
... | sed '/^.*\([[:digit:]]\{2\}\):[[:digit:]]\{2\}.*/{s//\1/;q};d'
In Perl, that would be:
... | perl -nle 'if (s/^.*(\d\d):\d\d.*/$1/) { print; exit }'
And with GNU Grep compiled with --perl-regexp:
... | grep -m1 -Po '\d\d(?=:\d\d)'
There are a few problems:
You never read from STDIN.
You don't stop trying to match after the first match.
You print unconditionally.
If you want all matches (as per your original question):
perl -nle'print for /\d\d(?=:\d\d)/g'
If you want the first match (as per your comment):
perl -nle'if (/\d\d(?=:\d\d)/) { print $&; exit }'
perl -nle'if (/(\d\d):\d\d/) { print $1; exit }'
grep -Pom1 '\d\d(?=:\d\d)'
Notes:
-n wraps the code with a loop that reads from STDIN.

Match a string that could have a newline anywhere in it in - bash

I have a string containing a number that is represented as follows:
\S2=number_goes_here\
The number could be anything from 0.00000 and up. However, there could be a newline anywhere in that string, and I am not entirely sure how to go about matching that. Ultimately, I just want the number from this. Importantly, this string is amidst a large chunk of text that can be represented by this sample (S2 is found on the last line there):
1.454187\H,0,0.719618,3.525801,1.633708\H,0,-0.454651,2.80328,2.23844\
Ru,0,0.025774,1.557599,-0.253913\\Version=EM64L-G09RevD.01\State=6-A\H
F=-1238.5377983\S2=8.75446\S2-1=0.\S2A=8.750006\RMSD=2.314e-09\Dipole=
I'm open to bash, sed, awk, gawk; whatever thoughts you have to address this.
EDIT:
Here is example, the first answer below does not seem to have worked correctly for this example. It only prints "2."
.631441,-2.132979\H,0,0.20151,-1.464802,-2.95553\H,0,0.377883,-2.50668
5,-1.874761\\Version=EM64L-G09RevD.01\State=3-A\HF=-1265.9035096\S2=2.
053325\S2-1=0.\S2A=2.000966\RMSD=1.590e-04\Dipole=0.7197616,-2.1253769
grep -Po '(?<=S2=)[\d.]+' <(tr -d '\n' < file)
gives
8.75446
You can use perl, read the whole file in slurp mode, remove newline characters and search it using a regular expression:
perl -0777 -nE '
$_ = join q||, split /\n/;
printf qq|%s\n|, $1 if m/\\S2=([\d.]+)/
' infile
It yields:
8.75446
Also possible using just bash, though this won't work so well for very large files.
#!/bin/bash
IFS=$'\n'
string=$(<"test.txt")
var=$(echo $string) # word-splitting will replace each newline with a space here
while IFS= read -r word; do
[[ $word =~ '\S2='([0-9]*\.[0-9]*)'\' ]] && echo ${BASH_REMATCH[1]}
done <<< "$var"
e.g.
> ./abovescript
8.75446
Here is an gnu awk version (due to RS with multiple characters):
awk -F'\' 'NR==2 {print $1}' RS="S2=" file
8.75446
A version that works with most awk
awk -F\\ '{for (i=1;i<=NF;i++) if ($i~/S2=/) {split($i,a,"=");print a[2]}}' file
8.75446

i have a file and i need to extract a particular string followed after the regex 'LN:' from the second line

please refer the file contents below.
#HD VN:1.0 SO:unsorted
#SQ SN:Chr1 LN:30427680
#PG ID:bowtie2 PN:bowtie2 VN:2.1.0
how can i extract just the number 30427680 using awk or any other unix command.
Using sed
sed -n 's/.*LN://p' < input.txt
This will erase everything up until LN:, and print what's left, and only if a substitution did take place.
Using awk
awk -v FS=: '/LN:/ { print $3; }' < input.txt
This will match lines that contain LN:, use : as field separator, and print the 3rd column.
Using grep
grep -o '[0-9]\{3,\}' < input.txt
This will match sequences of 3 or more digits, and print only the matched pattern thanks to the -o.
Depending on other cases not included in your question, you might have to make the patterns more strict.
Using grep:
grep -oP 'LN:\K.*' filename
Just use grep:
grep -o 30427680 file
-o, --only-matching
Prints only the matching part of the lines.
Using perl :
perl -ne 'print $& if /LN:\K.*/' filename
or
perl -ne 'print $1 if /LN:(.*)/' filename
Another awk
awk -F"LN:" 'NF>1 {print $2}' file

Replacing first and second occurrence of the same text with different values

I'm searching for a way to replace the first occurrence of a certain text in a text file with a value ${A} and the second occurrence of the same text, on a different line, with ${B}. Can this be achieved with sed or awk or any other UNIX tool?
The toolset is fairly limited: bash, common UNIX tools like sed, grep, awk etc. Perl, Python, Ruby etc. cannot be used...
Thanks in advance for any advice
Robert
Example:
...
Text
Text
Text
Text
TEXT_TO_BE_REPLACED
Text
Text
Text
TEXT_TO_BE_REPLACED
Text
Text
Text
...
should be replaced with
...
Text
Text
Text
Text
REPLACEMENT_TEXT_A
Text
Text
Text
REPLACEMENT_TEXT_B
Text
Text
Text
...
Sed with one run:
sed -e 's/\(TEXT_TO_BE_REPLACED\)/REPLACEMENT_TEXT_A/1' \
-e 's/\(TEXT_TO_BE_REPLACED\)/REPLACEMENT_B/1' &lt input_file > output_file
Just run your script twice - once to replace the first occurrence with ${A}, once to replace the (now first) occurence with ${B}.
To replace just one occurence:
sed '0,/RE/s//to_that/' file
(shamelessly stolen from How to use sed to replace only the first occurrence in a file?)
Here is a possible solution using awk:
#!/usr/bin/awk -f
/TEXT_TO_BE_REPLACED/ {
if ( n == 0 ) {
sub( /TEXT_TO_BE_REPLACED/, "REPLACEMENT_TEXT_A", $0 );
n++;
}
else if ( n == 1 ) {
sub( /TEXT_TO_BE_REPLACED/, "REPLACEMENT_TEXT_B", $0 );
n++;
}
}
{
print
}
awk 'BEGIN { a[0]="REPLACEMENT_A"; a[1]="REPLACEMENT_B"; } \
/TEXT_TO_BE_REPLACED/ { gsub( "TEXT_TO_BE_REPLACED", a[i++]); i%=2 }; 1'
So, you can use sed to do this like so:
First, I made a file named test.txt that contained:
well here is an example text example
and here is another example text
I choose to use the word "example" to be the value to change.
Here is the command: cat test.txt | sed -e 's/(example)/test2/2' -e 's/(example)/test1/1'
which provides the following output:
well here is an test1 text test2
and here is another test1 text
Now the sed command broken down:
s - begins search + replace
/ - start search ended with another /
The parentheses group our text ie example
/test2/ what we are putting in place of example
The number after the slashes is the occurrence we want to replace.
the -e allows you to run both commands on one command line.
You may also use the text editor ed:
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-'EOF' | sed -e 's/^ *//' -e 's/ *$//' | ed -s file
H
/TEXT_TO_BE_REPLACED/s//REPLACEMENT_TEXT_A/
/TEXT_TO_BE_REPLACED/s//REPLACEMENT_TEXT_B/
wq
EOF