Matching multiple patterns in the same line using unix utilities

Matching multiple patterns in the same line using unix utilities - regex

I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be
Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
The expected output would be
Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
I have got this far using my little knowledge of grep and stackoverflow.
< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'
Any ideas to make this simpler or cleaner and to achieve the complete functionality.
Update 1: Few other examples could be:
Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
There could be more fields in some lines.
The order of fields are not necessarily preserved either. I could get around this by treating the files which have different order separately or transforming them to this order somehow. So this condition can be relaxed.
Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.
Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

#!/usr/bin/awk -f
BEGIN {
p[0] = "^[0-9]+:[0-9]{2}$"
p[1] = "^[[:alpha:]][[:alnum:]]*$"
p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}
{
i = 0
for (j = 1; j <= NF; ++j) {
for (k = 0; k in p; ++k) {
if ($j ~ p[k] && !q[k]++ && j > ++i) {
$i = $j
}
}
}
q[0] = q[1] = q[2] = 0
NF = i
print
}
Input:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
Output:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

Perl-regex style should solve the problem:
(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))
It will capture the following data (procesing each line You provided separately):
RESULT$VAR1 = [
'12:23',
'ASDFGH',
'1,232.00'
];
RESULT$VAR1 = [
'21:22',
'ASDSDS',
'22.00'
];
RESULT$VAR1 = [
'12:21',
'ASADSS',
'11.00'
];
RESULT$VAR1 = [
'22:22',
'BASDASD',
'1,231.00'
];
Example perl script.pl:
#!/usr/bin/perl
use strict;
use Data::Dumper;
open my $F, '<', shift #ARGV;
my #strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;
foreach my $string (#strings) {
chomp $string;
next if not $string;
my #tab = $string =~ $qr;
print join(" ", #tab) . "\n";
}
Run as:
perl script.pl test_data.txt
Cheers!

Related

perl - range between with regex

I have a file like
$ cat num_range.txt
rate1, rate2, rate3, rate4, rate5
pay1, pay2, rate1, rate2, rate3, rate4
rev1, rev2
rate2, rate3, rate4
And I need to filter the comma-separated rows by matching against a prefix and a numeric range.
For example - if the input is "rate" and range is 2 to 5, then I should get
rate2, rate3, rate4, rate5
rate2, rate3, rate4
rate2, rate3, rate4
If it is 5 to 10, then I should get
rate5
when I use perl -ne ' while ( /rate(\d)/g ) { print "$&," } ; print "\n" ' num_range.txt I get all the matches for the prefix,
But below one is not working.
perl -ne ' while ( /rate(\d){2,5}/g ) { print "$&," } ; print "\n" ' num_range.txt

A straightforward way
perl -wnE'
print join",", grep { /rate([0-9]+)/ and $1 >= 2 and $1 <= 5 } split /\s*,\s*/
' file
The hard-coded keyword rate and limits (2 and 5) can of course be variables set from input

Your code does nothing to compare the matched number to the range.
Also, you are gratuitously printing a comma after the last entry.
Try this instead.
perl -ne '$sep = ""; while (/(rate(\d+))/g ) {
if ($2 >= 2 and $2 <= 5) {
print "$sep$1"; $sep=", ";
}
}
print "\n" if $sep' num_range.txt
Notice also how \d+ is used to match any number after rate and extracted into a separate numeric comparison. This is slightly clumsy in isolation, but easy to adapt to different number ranges.

To explain why your code isn't working:
/rate(\d){2,5}/g
This doesn't do what you think it does. The {x,y} syntax defines the number of times the previous string occurs.
So this matches "the string 'rate' followed by between 2 and 5 digits". And that won't match anything in your data.

This does the job:
perl -anE '#rates=();while(/rate(\d+)/g){push #rates,$& if $1>=2 && $1<=15}say"#rates" if #rates' file.txt
Output:
rate2 rate3 rate4 rate5
rate2 rate3 rate4
rate2 rate3 rate4

How to locate the data column instead of character position NOT using loop to search in Perl and Bash?

What I want to know is how to locate atom H location (as column number instead of character number) in a string/array using PERL or Bash? I tried to avoid unnecessary loops to search the H because my data has more than million lines.
I have research data shown below
FRAM_# 20000000 5000000(fs) CN= 1 PRMRYTGT 16652 O 16654 H 1.036 8140 CA 2.586 7319 AL 1.963
Where, there are O,H,CA,and AL atoms. The first atom is target atom oxygen, and the rest of them are neighbors which bond with the target oxygen. Except for the first atom (oxygen), the integer number before each atom is the atom ID, and the float number after it is the bond length with the first atom O(ID=16652).
$line = 'FRAM_# 20000000 5000000(fs) CN= 1 PRMRYTGT 16652'
. ' O 16654 H 1.036 8140 CA 2.586'
. ' 7319 AL 1.963';
#values = split(/\s+/, $line);
my $bondlength;
my $neighbor_ID;
for (my $i = 10; $i <= $#values; $i = $+3) {
if ($values[$i] eq 'H') {
$neighbor_ID = $values[$i-1];
$bondlength = $values[$i+1];
} else {
next;
}
I can use loop to search the position of H in the array #values. However, is there any other way (not loop), like regex or BASH scripts, to get the position of H in the array? I highly appreciate it if you could provide me extra suggestion and help.
I want to find the hydrogen bond (bond length is longer than 1.5 angstrom) between H and target oxygen. So, I have to get the atom ID of H and related bond length. So, first, I need to find the location of H. And then locate the atom ID and related bond length. And then I can do further data analysis.
NOTE: I have data lines more than 1 million, thus, I have to consider the code efficiency. H is my target atom in this example. In the data lines, the amount of H may be various.

It is not clear what exactly the expected result for the given input.
If it is the pair of numbers before and after letter H, following will do.
sed -E 's/.*\s+(\S+)\s+H\s+(\S+)\s.*/\1,\2/' < input.txt
Sample input:
FRAM_# 20000000 5000000(fs) CN= 1 PRMRYTGT 16652 O 16654 H 1.036 8140 CA 2.586 7319 AL 1.963
Sample output:
16654,1.036

open(FH, "data.txt") or die "Can’t open data.txt: $!";
while(<FH>)
{
if (#d=/\bO\s+(\d+)\s+H\s+(1\.[5-9]\d*|[2-9][\d.]*)/) {print "$_\n" for #d}
$ID=$d[0]
$len=$d[1]
}
each data line only results in ID, put in $d[0], and bond length, in $d[1] if is greater/equal to 1.5, of #d array

perl: use firstidx from List::MoreUtils
use List::MoreUtils qw/ firstidx /;
my $line = '...';
my #values = split ' ', $line;
my $h_idx = firstidx {$_ eq 'H'} #values;
print "H appears at index $h_idx\n";
This will use a loop under the hood. I don't see how you can avoid it. If your list was sorted, you could use a binary search, but your list is not sorted.
On the other hand, you don't need to split your line in to a list at all, which should save you some time:
my ($neighbour_id, $bond_length) = $line =~ /(\d+) \s+ H \s+ (\S+)/x;
return $neighbour_id if $bond_length > 1.5;

Here's your code, reformatted slightly for readability (whitespace is not in short supply!)
$line = "FRAM_# 20000000 5000000(fs) CN= 1 PRMRYTGT 16652 O 16654 H 1.036 8140 CA 2.586 7319 AL 1.963";
#values = split(/\s+/, $line);
my ($bondlength, $neighbor_ID);
for (my $i = 10; $i <= $#values; $i += 3) {
if ($values[$i] eq 'H') {
$neighbor_ID = $values[$i-1];
$bondlength = $values[$i+1];
}
else {
next;
}
}
The else clause in your loop is completely unnecessary. The loop will just go on to the next iteration anyway.
One obvious optimisation is to stop looking once you've found the hydrogen atom. So your code would look like this:
$line = "FRAM_# 20000000 5000000(fs) CN= 1 PRMRYTGT 16652 O 16654 H 1.036 8140 CA 2.586 7319 AL 1.963";
#values = split(/\s+/, $line);
my ($bondlength, $neighbor_ID);
for (my $i = 10; $i <= $#values; $i = $+3) {
if ($values[$i] eq 'H') {
$neighbor_ID = $values[$i-1];
$bondlength = $values[$i+1];
last; # stop looking once you've found it
}
}
I don't know if it's enough of an optimisation to solve all of your problems, but it's a start.

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?

grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.

Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.

awk with joined field

I am trying to extract data from one file, based on another.
The substring from file1 serves as an index to find matches in file2.
All works when the string to be searched inf file2 is beetween spaces or isolated, but when is joined to other fields awk cannot find it. is there a way to have awk match any part of the strings in file2 ?
awk -vv1="$Var1" -vv2="$var2" '
NR==FNR {
if ($4==v1 && $5==v2) {
s=substr($0,4,8)
echo $s
a[s]++
}
next
}
!($1 in a) {
print
}' /tmp/file1 /tmp/file2
example that works:
file1:
1 554545352014-01-21 2014-01-21T16:18:01 FS 14001 1 1.10
1 554545362014-01-21 2014-01-21T16:18:08 FS 14002 1 5.50
file2:
55454535 11 17 102 850Sande Fiambre 1.000
55454536 11 17 17 238Pesc. Dourada 1.000
example that does not work:
file2:
5545453501/21/20142 1716:18 1 1 116:18
5545453601/21/20142 1716:18 1 1 216:18
the string to be searched, for instance : 55454535 finds a match in the working example, but it doesn't in the bottom one.

You probably want to replace this:
!($1 in a) {
print
}
with this (or similar - your requirements are unclear):
{
found = 0
for (s in a) {
if ($1 ~ "^"s) {
found = 1
}
}
if (!found) {
print
}
}

Use a regex comparison ~ instead of ==
ex. if ($4 ~ v1 && $5 ~ v2)
Prepend v1/v2 with ^ if you want to the word to only begin with string and $ if you want to word to only end with it

Identifying pseudo-duplicates with Perl

I have a list that contains names. There are multiples of the same name. I want to catch the first instance of these pseudo-dupes and anchor them.
Example input
Josh Smith
Josh Smith0928340938
Josh Smith and friends
hello
hello1223
hello and goodbye.
What I want to do is identify the first occurrence of Josh Smith or hello and put an anchor such as a pipe | in front of it to validate. These are also wildcards as the list is large, so I cannot specifically look for the first match of Josh Smith and so on.
My desired output would be this:
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.
I did not provide any code. I am a little in the dark on how to go about this and was hoping maybe someone had been in a similar situation using regex or Perl.

I think based on what I understand of your requirements you are looking for something like this:
$prefix = '';
$buffered = '';
$count = 0;
while ($line = <>) {
$linePrefix = substr($line,0,length($prefix));
if ($buffered ne '' && $linePrefix eq $prefix) {
$buffered .= $line;
$count++;
} else {
if ($buffered ne '') {
print "|" if ($count > 1);
print $buffered;
}
$buffered = $line;
$prefix = $line;
chomp $prefix;
$count = 1;
}
}
if ($buffered ne '') {
if ($count > 1) {
print "|";
}
print $buffered;
}

Actually, IMO this is a rather interesting question, because you can be creative. As you do not know how to identify the root name, I have to ask if you have to? I have a feeling that you do not need a perfect solution. Therefore, I would go for something simple:
#!/usr/bin/perl -wn
$N = 4;
if (#prev) {
$same_start = length $_ >= $N &&
substr($prev[0], 0, $N) eq substr($_, 0, $N);
unless ($same_start) {
print "|", shift #prev if $#prev;
#prev = grep { print;0 } #prev;
}
}
push #prev, $_;
}{ print for #prev
edit: fixed bug: <print "|", shift #prev;> to <print "|", shift #prev if $#prev;>
Sample output:
$ perl josh.pl <josh-input.txt
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching multiple patterns in the same line using unix utilities - regex

Related

perl - range between with regex

How to locate the data column instead of character position NOT using loop to search in Perl and Bash?

how to use regular expression in awk or sed, for find all homopolymers in DNA sequence?

awk with joined field

Identifying pseudo-duplicates with Perl

Categories

Resources