perl's $-[0] produces unexpected results for non-ASCII data

perl's $-[0] produces unexpected results for non-ASCII data - regex

Consider the following input data in file y.txt (encoded in UTF-8).
bar
föbar
and a file y.pl, which puts the two input lines into an array and processes them, looking for substring start positions.
use open qw(:std :utf8);
my #array;
while (<>) {
push #array, $_;
print $-[0] . "\n" if /bar/;
}
# $array[0] = "bar", $array[1] = "föbar"
print $-[0] . "\n" if $array[1] =~ /$array[0]/u;
If I call perl y.pl < y.txt, I get
0
2
3
as the output. However, I would expect that the last number is 2 also, but for some reason the second /.../ regexp behaves differently. What am I missing? I guess it's an encoding issue, but whatever I tried, I didn't succeed. This is Perl 5.18.2.

It appears to be a bug in 5.18.
$ 5.18.2t/bin/perl a.pl a
0
2
3
$ 5.20.1t/bin/perl a.pl a
0
2
2
I can't find a workaround. Adding utf8::downgrade($array[0]); or utf8::downgrade($array[0], 1); works in the case you presented, but not using the following data or any other where the interpolated pattern contains characters >255.
♠bar
f♠♠bar
It appears that this can only be fixed by upgrading your Perl, which is actually quite simple. (Just make sure to install it to a different directory than your system perl by following the instructions in INSTALL!)

Related

awk unix - match regex - regex string size limit | ideas?

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).
awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'
Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here
If I want to search with more mismatches and a longer string I will come up with very long regex expressions:
example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)
/
The problem with my solution is:
very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
Error: "bash: /usr/bin/awk: Argument list too long"
possible solution: SO-Link but I don't find the solution...
My question is:
Can I somehow still use the long regex expression?
splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
Is there another way to approach this?
("agrep" will work, but not to find the positions)

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.
As he also points out, your fundamental approach is not optimal. Below are two alternatives.
Perl has many features that will aid you with this.
You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:
use strict;
use warnings;
use 5.014;
my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat = "AAAAAA";
my $max_subs = 3;
my $len_in = length $seq;
my $len_pat = length $pat;
my %posn;
sub strDiffMaxDelta {
my ( $s1, $s2, $maxDelta ) = #_;
# XOR the strings to find the count of differences
my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
return $diffCount <= $maxDelta;
}
for my $i ( 0 .. $len_in - $len_pat ) {
my $substr = substr $seq, $i, $len_pat;
# save position if there is a match up to $max_subs substitutions
$posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}
say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;
Running this prints:
6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT
Substituting:
$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;
Prints:
1 => AATC
8 => AAGC
15 => AAAC
It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.
You can also write a similar approach in awk:
echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
cnt=0
for(j=1;j<=length(seq); j++)
if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
if (cnt<=mc) print i-1 " => " substr($1,i, length(seq))
}
}'
Prints:
1 => AATC
8 => AAGC
15 => AAAC
And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.
(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

The "Argument list too long" problem is not from Awk. You're running into the operating system's memory size limit on the argument material that can be passed to a child process. You're passing the Awk program to Awk as a very large command line argument.
Don't do that; put the code into a file, and run it with awk -f file, or make the file executable and put a #!/usr/bin/awk -f or similar hash-bang line at the top.
That said, it's probably not such such great idea to include your data in the program source code as a giant literal.

Is there another way to approach this?
Looking for fuzzy matches is easy with Python. You just need to install the PyPi regex module by running the following in the terminal:
pip install regex # or pip3 install regex
and then create the Python script (named, say, script.py) like
#!/usr/bin/env python3
import regex
filepath = r'myfile.txt'
with open(filepath, 'r') as file:
for line in file:
for x in regex.finditer(r"(?:AATC){s<=1}", line):
print(f'{x.start()}:{x.group()}')
Use the pattern you want, here, (?e)(?:AATC){s<=1} means you want to match AATC char sequence allowing one substitution at most in the match, with (?e) attempting to find a better fit.
Run the script using python3 script.py.
If myfile.txt contains just one AAATCGAAAAGCDFAAAACGT line, the output is
1:AATC
8:AAGC
15:AAAC
meaning that there are three matches at positions 1 (AATC), 8 (AAGC) and 15 (AAAC).
You can get the values themselves by replacing x.start() with x.group() in the Python script.
See an online Python demo:
import regex
line='AAATCGAAAAGCDFAAAACGT'
for x in regex.finditer(r"(?:AATC){s<=1}", line):
print(f'{x.start()}:{x.group()}')

Fetch specific value from output using grep+regex and store in variable in shell

First seeking apology for this silly question as i am not expert at all on this.
I have get the out put from a task like below:
image4.png PNG 1656x839 1656x839+0+0 8-bit sRGB 155KB 0.040u 0:00.039
image4.png PNG 1656x839 1656x839+0+0 8-bit sRGB 155KB 0.020u 0:00.030
Image: image4.png
Channel distortion: AE
red: 0
green: 0
blue: 0
all: 0
image4.png=>tux_difference.png PNG 1656x839 1656x839+0+0 8-bit sRGB
137KB 0.500u 0:00.140
From here, i only want to get the value of all
For this i am trying to do this:
var="$(compare -verbose -metric ae path/actual.png path/dest.png path/tux_difference.png 2>&1 | grep 'all:\s(\d*)')"
But it does nothing.
use
"sudo apt-get install imagemagick"
to make compare workable. recommend to use same Image for source and destination, otherwise you will get error for some image mismatch.

You might need to escape your parenthesis (or just remove them) in:
grep 'all:\s\(\d*\)'
However grep by default will print the whole line, which is not what you want. Printing only the matched text is possible, but extracting the number from that requires a more complex regex which may or may not be available in your version of grep. GNU grep has the P flag to enable Perl like regex, and outputting the match only can be done with the o flag.
On the other hand, my recommendation is to use Perl directly:
perl -ne 'print $1 if /all: (\d+)/'
Note that you also don't need those quotes around $(). Considering your compare call is working properly and outputting the text in your question, then this should do what you asked:
var=$( compare [...] | perl -ne 'print $1 if /all: (\d+)/' )
echo $var
You can also use variations like /all:\s*(\d+)/ if the white space before the number is not guaranteed to be there.
The Perl code used here is largely based on the -n flag, which assumes the following loop around the program:
while (<>) {
# ...
}
This loops iterates over the input line by line, and the <> already assumes input as either stdin or filenames given as arguments.
The -e flag precedes the code itself:
print $1 if /all: (\d+)/;
Which is just a shorthand for:
if (/all: (\d+)/) {
print $1;
}
Here the match operator m// (or /<regex> for short) tests the default variable $_ to see if there is a match for the regex. Who had set the $_ variable? The loop itself in its (<>) construct. It automatically sets $_ to each line being read.
If the regex matches, we print it's first set of parenthesis (group), which has its contents set to $1. If the regex had other groups, they would be stored in $2, $3, and so forth.

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".

Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.

No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}

Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Edit line names with a new name containing an incremented value

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:
I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.
an example input sequence name:
>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC
Which I then want to output like this
>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC
so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.
$n = 1
if {
s/>.*/(Contig)++$n/e
++$n
}

$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC

Try something like this:
#!/usr/bin/perl -w
use strict;
open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');
my $n = 1;
# For each line of the input file
while(<$fh>) {
# Try to update the name, if successful, increment $n
if ($_ =~ s/^>.*/>Contig$n/) { $n++; }
print $fh1 $_;
}

When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like
s/>.*/">Contig " . ++$n/e

perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt
Output:
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC

I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).
One possible gawk solution could be
awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file

Get digit from filename immediately preceeding file extension, with other digits in filename

I'm trying to extract the last number before a file extension in a bash script. So the format varies but it'll be some combination of numbers and letters, and the last character will always be a digit. I need to pull those digits and store them in a variable.
The format is generally:
sdflkej10_sdlkei450_sdlekr_1.txt
I want to store just the final digit 1 into a variable.
I'll be using this to loop through a large number of files, and the last number will get into double and triple digits.
So for this file:
kej10_sdlkei450_sdlekr_310.txt
I'd need to return 310.
The number of alphanumeric characters and underscores varies with each file, but the number I want always is immediately before the .txt extension and immediately after an underscore.
I tried:
bname=${f%%.*}
number=$(echo $bname | tr -cd '[[:digit:]]')
but this returns all digits.
If I try
number = $(echo $(bname -2) it changes the number it returns.
The problem i'm having is mostly related to the variability, and the fact that I've been asked to do it in bash. Any help would really be appreciated.

regex='([0-9]+)\.[^.]*$'
[[ $file =~ $regex ]] && number=${BASH_REMATCH[1]}
This uses bash's underappreciated =~ regex operator which stores matches in an array named BASH_REMATCH.

You could do this using parameter substitution
var=kej10_sdlkei450_sdlekr_310.txt
var=${var%.*}
var=${var##*_}
echo $var
310

Use a Series of Bash Shell Expansions
While not the most elegant solution, this one uses a sequence of shell parameter expansions to achieve the desired result without having to define a specific extension. For example, this function uses the length and offset expansions to find the digit after removing filename extensions:
extract_digit() {
local basename=${1%%.*}
echo "${basename:$(( ${#basename} - 1 ))}"
}
Capturing Function Output
You can capture the output in a variable with something like:
$ foo=$(extract_digit sdflkej10_sdlkei450_sdlekr_1.txt)
$ echo $foo
1
Sample Output from Function
$ extract_digit sdflkej10_sdlkei450_sdlekr_1.txt
1
$ extract_digit sdflkej10_sdlkei450_sdlekr_9.txt
9
$ extract_digit sdflkej10_sdlkei450_sdlekr_10.txt
0

This should take care of your situation:
INPUT="some6random7numbers_12345_moreletters_789.txt"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)'`
echo $SUBSTRING
This will output 789

No need of regex here, you can utilize IFS
var="kej10_sdlkei450_sdlekr_310.txt"
v=$(IFS=[_.] read -ra arr <<< "$var" && echo "${arr[#]:(-2):1}")
echo "$v"
310

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

perl's $-[0] produces unexpected results for non-ASCII data - regex

Related

awk unix - match regex - regex string size limit | ideas?

Fetch specific value from output using grep+regex and store in variable in shell

Can adding a particular number to a bunch of "time" strings, be done in Regex

Edit line names with a new name containing an incremented value

Get digit from filename immediately preceeding file extension, with other digits in filename

Categories

Resources