awk unix - match regex - regex string size limit | ideas? - regex

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).
awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'
Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here
If I want to search with more mismatches and a longer string I will come up with very long regex expressions:
example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)
/
The problem with my solution is:
very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
Error: "bash: /usr/bin/awk: Argument list too long"
possible solution: SO-Link but I don't find the solution...
My question is:
Can I somehow still use the long regex expression?
splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
Is there another way to approach this?
("agrep" will work, but not to find the positions)

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.
As he also points out, your fundamental approach is not optimal. Below are two alternatives.
Perl has many features that will aid you with this.
You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:
use strict;
use warnings;
use 5.014;
my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "AAAAAA";
my $max_subs = 3;
my $len_in = length $seq;
my $len_pat = length $pat;
my %posn;
sub strDiffMaxDelta {
my ( $s1, $s2, $maxDelta ) = #_;
# XOR the strings to find the count of differences
my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
return $diffCount <= $maxDelta;
}
for my $i ( 0 .. $len_in - $len_pat ) {
my $substr = substr $seq, $i, $len_pat;
# save position if there is a match up to $max_subs substitutions
$posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}
say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;
Running this prints:
6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT
Substituting:
$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;
Prints:
1 => AATC
8 => AAGC
15 => AAAC
It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.
You can also write a similar approach in awk:
echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
cnt=0
for(j=1;j<=length(seq); j++)
if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
if (cnt<=mc) print i-1 " => " substr($1,i, length(seq))
}
}'
Prints:
1 => AATC
8 => AAGC
15 => AAAC
And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.
(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

The "Argument list too long" problem is not from Awk. You're running into the operating system's memory size limit on the argument material that can be passed to a child process. You're passing the Awk program to Awk as a very large command line argument.
Don't do that; put the code into a file, and run it with awk -f file, or make the file executable and put a #!/usr/bin/awk -f or similar hash-bang line at the top.
That said, it's probably not such such great idea to include your data in the program source code as a giant literal.

Is there another way to approach this?
Looking for fuzzy matches is easy with Python. You just need to install the PyPi regex module by running the following in the terminal:
pip install regex # or pip3 install regex
and then create the Python script (named, say, script.py) like
#!/usr/bin/env python3
import regex
filepath = r'myfile.txt'
with open(filepath, 'r') as file:
for line in file:
for x in regex.finditer(r"(?:AATC){s<=1}", line):
print(f'{x.start()}:{x.group()}')
Use the pattern you want, here, (?e)(?:AATC){s<=1} means you want to match AATC char sequence allowing one substitution at most in the match, with (?e) attempting to find a better fit.
Run the script using python3 script.py.
If myfile.txt contains just one AAATCGAAAAGCDFAAAACGT line, the output is
1:AATC
8:AAGC
15:AAAC
meaning that there are three matches at positions 1 (AATC), 8 (AAGC) and 15 (AAAC).
You can get the values themselves by replacing x.start() with x.group() in the Python script.
See an online Python demo:
import regex
line='AAATCGAAAAGCDFAAAACGT'
for x in regex.finditer(r"(?:AATC){s<=1}", line):
print(f'{x.start()}:{x.group()}')

Related

Remove character from the middle of a string

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA
I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T
Have mostly been trying tr, but this is just removing all hyphens from the file.:
tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam
input is a large SAM file of >10,000,000 lines like this:
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGC-TCGTAA
Desired output (i.e. last field)
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGCTCGTAA
How do I solve this problem?
awk
awk '{sub(/-/,"",$NF)}1' file
is what you need.
Explanation
From this it is clear that you're concerned only about the last field.
NF is the total number of fields that a record contains, hence $NF is the last field.
sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.
GNU sed
For this same reason,
sed -Ei 's/^(.*)-/\1/' file
will work. It has an added advantage that it can perform an inplace edit.
Explanation
The -E option enables the extended regular expression engine.
The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
The () makes sed remember what was matched.
In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.
Note : The GNU awk support -i inplace, but I'm not sure from which version on.
I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.
import pysam
import sys
from re import sub
# Provide a bam file
if len(sys.argv) == 2:
assert sys.argv[1].endswith('.bam')
# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)
inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)
# Counters for reads processed and written
n = 0
w = 0
# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
n += 1
umi = read.get_tag('RX')
assert umi is not None
umifix = umi[:6] + umi[7:]
read.set_tag('RX', umifix, value_type='Z')
if '-' in umifix:
print('Hyphen found in UMI:', umifix, read)
break
else:
w += 1
outbam.write(read)
inbam.close()
outbam.close()
print ('Processed', n, 'reads:\n',
w, 'UMIs written.\n',
str(int((w / n) * 100)) + '% of UMIs fixed')
The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.
Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).
Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:
awk -F '[[:space:]]*' '{
for (i = 12; i <= NF; i++) {
if ($i ~ /^RX:Z:/) gsub("-", "", $i)
}
}
1' file.sam
And here’s a roughly equivalent solution as a Perl “one-liner”:
perl -ape '
for (#F[11..(scalar #F)]) {
s/-//g if /^RX:Z:/;
}
$_ = join("\t", #F);
' file.sam
To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).
This pattern is on many records that you want to edit, and is always at the end of the line? If so -
sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

BASH: Search a string and exactly display the exact number of times a substring happens inside it

I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.
sample.txt is:
AAAAA
I want to find the exact times the combination "AAA" happens. If you just use for example
grep -o 'AAA' sample.txt | wc -l
We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.
I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.
How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.
This might work for you (GNU sed & wc):
sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l
Lose any characters other than A's, and single or double A's.Then print a triple A and lose the first A and repeat. Finally count the number of lines printed.
This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:
#!/bin/bash
search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0
# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do
# Look at just the length of the search term
text_substr=${text:0:${term_len}}
# If we see the search term, increment occurences
if [ "$text_substr" = "$search_term" ]; then
((occurences++))
fi
# Remove the first character from the main text
# (e.g. "AAAAA" becomes "AAAA")
text=${text:1}
done
printf "%d occurences of %s\n" "$occurences" "$search_term"
This is the awk version
echo "AAAAA AAA AAAABBAAA" \
| gawk -v pat="AAA" '{
for(i=1; i<=NF; i++){
# current field length
m=length($i)
#search pattern length
n=length(pat)
for(l=1 ; l<m; l++){
sstr=substr($i,l,n)
#print i " " $i " sub:" sstr
# substring matches pattern
if(sstr ~ pat){
count++
}else{
print "contiguous count on field " i " = " count
# uncomment next line if non-contiguous matches are not needed
#break
}
}
print "total count on field " i " = " count
count=0
}
}'
I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.
$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
while(t=index(x,$0)){ n++; x=substr(x,t+1) }
print $0,n
}' combinations
Explanation:
The variables:
sample: is the raw sample text slurp in from the file sample.txt with the -v argument
x: is the targeting string, before each test, the value is reset to sample
$0: is the testing string from the file combination, each line feeds a testing string
n: is the counter, number of occurences of the testing string($0)
t: is the position of the first character of the matched testing string($0) in the targeting string(x)
Update: Added $0 != "" before the main while loop to skip EMPTY strings which lead to unlimited loop.
The code:
awk -v sample="$(<sample.txt)" '
# reset the targeting string(with the sample text) and the counter "n"
{ x = sample; n = 0 }
# below the main block where $0 != "" to skip the EMPTY testing string
($0 != ""){
# the function index(x, $0) returns the position(assigned to "t") of the first character
# of the matched testing string($0) in the targeting string(x).
# when no match is found, it returns zero and thus step out of the while loop.
while(t=index(x,$0)) {
n++; # increment the number of matches
x = substr(x, t+1) # modify the targeting string to remove all characters before the position(t) inclusively
}
print $0, n # print the testing string and the counts
}
' combinations
awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:
$ more sample.txt
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH
$ more combinations
AA
HH
AAA
HHH
AAH
HHA
ZK
Tested Environment: GNU Awk 4.0.2, Centos 7.3

perl's $-[0] produces unexpected results for non-ASCII data

Consider the following input data in file y.txt (encoded in UTF-8).
bar
föbar
and a file y.pl, which puts the two input lines into an array and processes them, looking for substring start positions.
use open qw(:std :utf8);
my #array;
while (<>) {
push #array, $_;
print $-[0] . "\n" if /bar/;
}
# $array[0] = "bar", $array[1] = "föbar"
print $-[0] . "\n" if $array[1] =~ /$array[0]/u;
If I call perl y.pl < y.txt, I get
0
2
3
as the output. However, I would expect that the last number is 2 also, but for some reason the second /.../ regexp behaves differently. What am I missing? I guess it's an encoding issue, but whatever I tried, I didn't succeed. This is Perl 5.18.2.
It appears to be a bug in 5.18.
$ 5.18.2t/bin/perl a.pl a
0
2
3
$ 5.20.1t/bin/perl a.pl a
0
2
2
I can't find a workaround. Adding utf8::downgrade($array[0]); or utf8::downgrade($array[0], 1); works in the case you presented, but not using the following data or any other where the interpolated pattern contains characters >255.
♠bar
f♠♠bar
It appears that this can only be fixed by upgrading your Perl, which is actually quite simple. (Just make sure to install it to a different directory than your system perl by following the instructions in INSTALL!)

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".
Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.
No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}
Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Get digit from filename immediately preceeding file extension, with other digits in filename

I'm trying to extract the last number before a file extension in a bash script. So the format varies but it'll be some combination of numbers and letters, and the last character will always be a digit. I need to pull those digits and store them in a variable.
The format is generally:
sdflkej10_sdlkei450_sdlekr_1.txt
I want to store just the final digit 1 into a variable.
I'll be using this to loop through a large number of files, and the last number will get into double and triple digits.
So for this file:
kej10_sdlkei450_sdlekr_310.txt
I'd need to return 310.
The number of alphanumeric characters and underscores varies with each file, but the number I want always is immediately before the .txt extension and immediately after an underscore.
I tried:
bname=${f%%.*}
number=$(echo $bname | tr -cd '[[:digit:]]')
but this returns all digits.
If I try
number = $(echo $(bname -2) it changes the number it returns.
The problem i'm having is mostly related to the variability, and the fact that I've been asked to do it in bash. Any help would really be appreciated.
regex='([0-9]+)\.[^.]*$'
[[ $file =~ $regex ]] && number=${BASH_REMATCH[1]}
This uses bash's underappreciated =~ regex operator which stores matches in an array named BASH_REMATCH.
You could do this using parameter substitution
var=kej10_sdlkei450_sdlekr_310.txt
var=${var%.*}
var=${var##*_}
echo $var
310
Use a Series of Bash Shell Expansions
While not the most elegant solution, this one uses a sequence of shell parameter expansions to achieve the desired result without having to define a specific extension. For example, this function uses the length and offset expansions to find the digit after removing filename extensions:
extract_digit() {
local basename=${1%%.*}
echo "${basename:$(( ${#basename} - 1 ))}"
}
Capturing Function Output
You can capture the output in a variable with something like:
$ foo=$(extract_digit sdflkej10_sdlkei450_sdlekr_1.txt)
$ echo $foo
1
Sample Output from Function
$ extract_digit sdflkej10_sdlkei450_sdlekr_1.txt
1
$ extract_digit sdflkej10_sdlkei450_sdlekr_9.txt
9
$ extract_digit sdflkej10_sdlkei450_sdlekr_10.txt
0
This should take care of your situation:
INPUT="some6random7numbers_12345_moreletters_789.txt"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)'`
echo $SUBSTRING
This will output 789
No need of regex here, you can utilize IFS
var="kej10_sdlkei450_sdlekr_310.txt"
v=$(IFS=[_.] read -ra arr <<< "$var" && echo "${arr[#]:(-2):1}")
echo "$v"
310