Write a regular expression to parse following string - regex

I am using sed command and I want to parse following string:
Mr. XYZ Mr. ABC, PQR
Ward-2, abc vs. MG Road, Pune,
Pune Dist.,
(Appellant) (Respondent)
Now I want to parse the above string and I want to get Appellant part separated from above example and respondent part separated.
That is I want following output:
Mr. XYZ Ward-2, abc(Appellant) that is one output and Mr. ABC, PQR MG Road, Pune, Pune Dist.,(Respondent) is another output by using sed command.
I used following regex but not getting proper output:
sed -n '/assessment year/I{ :loop; n; /Respondent/Iq; p; b loop}' abc.txt

sed is always the wrong tool for any job that involves looking at multiple lines. Just use awk, it's what it was invented for. Here's GNU awk for a couple of extensions:
$ cat tst.awk
BEGIN { FIELDWIDTHS="30 7 99" }
{
for (i=1;i<=NF;i++) {
gsub(/^\s*|\s*$/,"",$i)
if ($i != "") {
rec[i] = (rec[i]=="" ? "" : rec[i] " ") $i
}
}
}
/^\(/ {
print rec[1]
print rec[3]
delete rec
}
$
$ awk -f tst.awk file
Mr. XYZ Ward-2, abc (Appellant)
Mr. ABC, PQR MG Road, Pune, Pune Dist., (Respondent)

I achieved this with following way by using ruby:
appellant_respondent = %x(sed -n '/assessment year/I{ :loop; n; /respondent/Iq; p; b loop}' #{#file_name}).split("\n")
appellant_name_array = []
respondent_name_array = []
appellant_respondent.delete("")
appellant_respondent.each do |names|
names_array = names.split(/\s+\s+/)
appellant_name_array << names_array.first if names_array.first != ""
respondent_name_array << names_array.last if names_array.last != ""
end
#item[:appellant] = appellant_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip
#item[:respondent] = respondent_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip

Related

Using gawk to extract rows with string in a column

I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!
You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"

Use sed to replace letters [a-z] and [A-Z] and ['] with underscores

...for all characters but the first letter of every word on a line excluding the first word. All text is English language.
Would like to use sed to convert input like this:
Mary had a little lamb
It's fleece was white as snow
to this:
Mary h__ a l_____ l___
It's f_____ w__ w____ a_ s___
For a project that looks at cued recall.
Looked at several intros to sed and regex. Would be using the flavor of sed on the terminal shipped with MacOS 10.14.5.
This might work for you (GNU sed):
sed -E 'h;y/'\''/x/;s/\B./_/g;G;s/\S+\s*(.*)\n(\S+\s*).*/\2\1/' file
Make a copy of the current line in the hold space. Translate ''s to `x's so that such words can be filled with underscores other than the first letter of each word. Append the copied line and using grouping and back references replace the first word of the line unadulterated.
sed is for doing simple s/old/new operations on individual strings, that is all. For anything else you should be using awk, e.g. with GNU awk for the 3rd arg to match():
$ awk '{
out = $1
$1 = ""
while ( match($0,/(\S)(\S*)(.*)/,a) ) {
out = out OFS a[1] gensub(/./,"_","g",a[2])
$0 = a[3]
}
print out $0
}' file
Mary h__ a l_____ l___
It's f_____ w__ w____ a_ s___
With any awk in any shell on every UNIX box including the default awk on MacOS:
$ awk '{
out = $1
$1 = ""
while ( match($0,/[^[:space:]][^[:space:]]*/) ) {
str = substr($0,RSTART+1,RLENGTH-1)
gsub(/./,"_",str)
out = out OFS substr($0,RSTART,1) str
$0 = substr($0,RSTART+RLENGTH)
}
print out $0
}' file
Mary h__ a l_____ l___
It's f_____ w__ w____ a_ s___
Here is another awk script (all awk versions), I enjoyed creating for this quest.
script.awk
{
for (i = 2; i <= NF; i++) { # for each input word starting from 2nd word
head = substr($i,1,1); # output word head is first letter from current field
tail = substr("____________________________", 1, length($i) - 1); # output word tail is computed from template word
$i = head tail; # recreate current input word from head and tail
}
print; # output the converted line
}
input.txt
Mary had a little lamb
It's fleece was white as snow
run:
awk -f script.awk input.txt
this could be also condensed into single line:
awk '{for (i = 2; i <= NF; i++) $i = substr($i,1,1) substr("____________________________", 1, length($i) - 1); print }' input.txt
output is:
Mary h__ a l_____ l____
It's f_____ w__ w____ a_ s___
I enjoyed this task.

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123
With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)
You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.
I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123
You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123
You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$
Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?
grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.
Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

replace all to "hello" except string that is in between double quotes

$ cat file1
"rome" newyork
"rome"
rome
What do I need to fill in the blank?
$ sed ____________________ file1
I want output like
"rome" newyork
"rome"
hello
if my input is like this
$ cat file1
/temp/hello/ram
hello
/hello/temp/ram
if I want to change the hello that does not have slashes what should I do? (change hello to happy)
temp/hello/ram
happy
/hello/temp/ram
sed 's/[^\"]rome[^\"]/hello/g' your_file
tested below:
> cat temp
"rome" newyork
"rome"
rome
> sed 's/[^\"]rome[^\"]/hello/g' temp
"rome" newyork
"rome"
hello
>
Why is rome changed to hello but newyork is not? If I'm reading the question correctly, you're trying to replace everything not in double quotes with hello?
Depending on the exact use cases you want (what happens to the input string ""?), you probably want something like this:
sed 's/\".*\"/hello/'
I dont see a direct way to replace all others except those enclosed inside " "
However, with recursive sed, a brute force method, you can achieve it.
cat file1 | sed "s/\"rome\"/\"italy\"/g" | sed "s/rome/hello/g" | sed "s/\"italy\"/\"rome\"/g"
The second problem can be solved with a simple perl one-liner (assuming only one hello per line):
perl -pe 'next if /\//; s/hello/happy/;'
The first problem requires some internal book keeping to keep track of whether you are inside a string or not. This can also be solved with perl:
#!/usr/bin/perl -w
use strict;
use warnings;
my $state_outside_string = 0;
my $state_inside_string = 1;
my $state = $state_outside_string;
while (my $line = <>) {
my #chars = split(//,$line);
my $not_yet_printed = "";
foreach my $char (#chars) {
if ($char eq '"') {
if ($state == $state_outside_string) {
$state = $state_inside_string;
$not_yet_printed =~ s/rome/hello/;
print $not_yet_printed;
$not_yet_printed = "";
} else {
$state = $state_outside_string;
}
print $char;
next;
}
if ($state == $state_inside_string) {
print $char;
} else {
$not_yet_printed .= $char;
}
}
$not_yet_printed =~ s/rome/hello/;
print $not_yet_printed;
}