replace strings with certain format in bash

replace strings with certain format in bash - regex

I have a file like this. it is a 7-column tab file with separator of one space (sep=" ").
however, in the 4th column, it is a string with some words which also contains spaces. Then last 3 column are numbers.
test_find.txt
A UTR3 0.760 Sterile alpha motif domain|Sterile alpha motif domain;Sterile alpha motif domain . . 0.0007
G intergenic 0.673 BTB/POZ domain|BTB/POZ domain|BTB/POZ domain . . 0.0015
I want to replace space into underscore (e.g. replace "Sterile alpha motif domain" to "Sterile_alpha_motif_domain"). Firstly, find the pattern starting with letters and end with "|", then treat as one string and replace all spaces to "_". Then move to next line and find next patter. (Is there any easier way to do it?)
I was able to use sed -i -e 's/Sterile alpha motif domain/Sterile_alpha_motif_domain/g' test_find.txt to only first row, but cannot generalize it.
I tried to find all patterns using sed -n 's/^[^[a-z]]*{\(.*\)\\[^\|]*$/\1/p' test_find.txt but doesn't work.
can anyone help me?
I want output like this:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Thank you!!!!

We'll need to two-step processing: first extract the 4th column which may
contain spaces; next replace the spaces in the 4th column with underscores.
With GNU awk:
gawk '{
if (match($0, /^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$/, a)) {
gsub(/ /, "_", a[3])
print a[1] a[3] a[4]
}
}' test_find.txt
Output:
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
The regex ^(([^ ]+ ){3})(.+)(( [0-9.]+){3})$ matches a line capturing
each submatches.
The 3rd argument (GNU awk extension) a is an array name which is
assigned to the capture group. a[1] holds 1st-3rd columns,
a[3] holds 4th column, and a[4] holds 5th-7th columns.
The gsub function replaces whitespaces with an underscores.
Then the columns are concatenated and printed.

Assuming you have special character at the end before the final column with integers, You can try this sed
$ sed -E 's~([[:alpha:]/]+) ~\1_~g;s/_([[:punct:]])/ \1/g' input_file
0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015

Without making any assumptions on the content of each field, you can 'brute force' the expected result by counting the number of characters in each field (+ the number of field separators) for the beginning of the line and the end of the line, and use this to manipulate the '4th column', e.g.
awk '{start=length($1)+length($2)+length($3)+4; end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6; text=substr($0, start, end); gsub(" ", "_", text); print $1, $2, $3, text, $(NF-2), $(NF-1), $NF}' test.txt
'Neat' version:
awk '{
start=length($1)+length($2)+length($3)+4
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6
text=substr($0, start, end)
gsub(" ", "_", text)
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt
A UTR3 0.760 Sterile_alpha_motif_domain|Sterile_alpha_motif_domain;Sterile_alpha_motif_domain . . 0.0007
G intergenic 0.673 BTB/POZ_domain|BTB/POZ_domain|BTB/POZ_domain . . 0.0015
Breakdown:
awk '{
# How many characters are there before column 4 begins (length of each field + total count of field separators (in this case, "4"))
start=length($1)+length($2)+length($3)+4;
# How many characters are there in column 4 (total - (first 3 fields + last 3 fields + total field separators (6)))
end=length($0)-length($NF)-length($(NF-1))-length($(NF-2))-length($1)-length($2)-length($3)-6;
# Use the substr function to define column 4
text=substr($0, start, end);
# Substitute spaces for underscores in column 4
gsub(" ", "_", text);
# Print everything
print $1, $2, $3, text, $(NF-2), $(NF-1), $NF
}' test.txt

Related

Search for Pattern in Text String, then Extract Matched Pattern

I am trying to match and then extract a pattern from a text string. I need to extract any pattern that matches the following in the text string:
10289 20244
Text File:
KBOS 032354Z 19012KT 10SM FEW060 SCT200 BKN320 24/17 A3009 RMK AO2 SLP187 CB DSNT NW T02440172 10289 20244 53009
I am trying to achieve this using the following bash code:
Bash Code:
cat text_file | grep -Eow '\s10[0-9].*\s' | head -n 4 | awk '{print $1}'
The above code attempts to search for any group of approximately five numeric characters that begin with 10 followed by three numeric characters. After matching this pattern, the code prints out the rest of text string, capturing the second group of five numeric characters, beginning with 20.
I need a better, more reliable way to accomplish this because currently, this code fails. The numeric groups I need are separated by a space. I have attempted to account for this by inserting \s into the grep portion of the code.

grep solution:
grep -Eow '10[0-9]{3}\b.*\b20[0-9]{3}' text_file
The output:
10289 20244
[0-9]{3} - matches 3 digits
\b - word boundary

awk '{print $(NF-2),$(NF-1)}' text_file
10289 20244
Prints next to last and the one previous.

awk '$17 ~ /^10[0-9]{3}$/ && $18 ~ /^20[0-9]{3}$/ { print $17, $18 }' text_file
This will check field 17 for "10xxx" and field 18 for "20xxx", and when BOTH match, print them.

Matching the last K occurrences of a pattern in a line

Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.

another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.

You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.

To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10

One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.

Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments

This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!

find first value matching the substring

The 9th column has multiple values separated with ";". I am trying to find first occurrence of string after "name_id" in column $9 of a tab limited file - the first line of the file looks like this eg.
1 NY state 3102016 3102125 . + . name_id "ENSMUSG8868"; trans_id "ENSMUST00000082908"; number "1"; id_name "Gm26206";ex_id "ENSMUSE000005";
There are multiple values separated by";" in 9th column. I could come up with this command that pulls out the last "ENSMUSE000005" id
sed 's|.*"\([0-9_A-Z]\+\)".*|\1|' input.txt | head
Can it be done with regex in awk? thanks a lot!

echo $x |awk -F';' '{split($1,a," ");gsub(/"/ ,"" ,a[10]);print a[10]}'
ENSMUSG8868
Where x is your line.
Based on OP's comments :
echo $x |awk -F';' '{split($1,a," ");gsub(/"/ ,"" ,a[10]);print a[1],a[10]}'
1 ENSMUSG8868

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

How to remove rows that match one of several regex patterns?

I have a tab-delimited text file and wish to efficiently remove whole rows that fulfil either of the following criteria:
values in the ALT column that are equal to .
values in the NA00001 column and subsequent columns that have the same digit before and after either of the two delimiters, | or /, for e.g. 0|0, 1|1, 2/2 etc.
An example input file is below:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 0|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1110696 rs6040360 A . 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
Example output file is:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

Your example doesn't appear to include any lines that meet the "values in the ALT column that are equal to ." criterion, or lines that don't meet the second criterion (except the header line). So I added some lines of my own to your example for testing; I hope I've understood your criteria.
The first criterion is easily matched by testing the particular field, if we're using something like awk: $5 == "." {next} in an awk script would skip that line. Just using a regular expression is pretty simple too: ^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I, where ^I is a tab character, matches lines with just "." in the fifth (ALT) field.
With strict regular expressions you can't express "the same digit before and after [a delimiter]" directly. You have to do it with alternation of sub-expressions with specific values: 0[|/]0|1[|/]1|2[|/]2... But there are only 10 digits, so this isn't particularly burdensome. So, for example, you can do this filtering with one long egrep command line:
egrep -v '^[^^I]*^I[^^I]*^I[^^I]*^I[^^I]*^I\.^I|0[|/]0|1[|/]1|2[|/]2|3[|/]3|4[|/]4|5[|/]5|6[|/]6|7[|/]7|8[|/]8|9[|/]9' input-file
Obviously that's not something you'd want to type by hand on a regular basis, and isn't ideal for maintenance. A little awk script is better:
#! /usr/bin/awk -f
# Skip lines with "." in the fifth (ALT) field
$5 == "." {next}
# Skip lines with the same digit before and after the delimiter in any field
/0[|/]0/ {next}
/1[|/]1/ {next}
/2[|/]2/ {next}
/3[|/]3/ {next}
/4[|/]4/ {next}
/5[|/]5/ {next}
/6[|/]6/ {next}
/7[|/]7/ {next}
/8[|/]8/ {next}
/9[|/]9/ {next}
# Copy all other lines to the output
{print}
I've put the individual digit checks as separate awk statements for readability.
With extended regular expressions (EREs), you can express "same character before and after the delimiter" directly, using a back-reference. Backreferences should be used with caution, since they can create pathological performance characteristics; and, of course, you'll have to use a language that supports them, such as perl. POSIX awk and Gnu gawk don't. Here's a Perl one-liner that handles the second criterion:
LINE: while (<STDIN>) { next LINE if /(\d)[|\/]\g1/; print }
That's probably not very good Perl - I almost never use the language - but it works in my testing. The (\d) matches and remembers the digit before the delimiter, and the \g1 matches the remembered digit after the delimiter.

perl -alnE '$F[4] eq "." and
$F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
say'
Update: Sorry the OP ask for the oposite...
perl -alnE 'say unless (
$f[4] eq "." or
( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1!
)
)'
or equivalent
perl -ane 'next if ( $f[4] eq ".");
next if ( $F[9] =~ m!(\d)[|/]\1! and
$F[10] =~ m!(\d)[|/]\1! and
$F[11] =~ m!(\d)[|/]\1! );
print '

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

replace strings with certain format in bash - regex

Related

Search for Pattern in Text String, then Extract Matched Pattern

Matching the last K occurrences of a pattern in a line

find first value matching the substring

Using awk to find a domain name containing the longest repeated word

How to remove rows that match one of several regex patterns?

Categories

Resources