Tcl: Regsub does not substitute a string while parsing HTML snipet - regex

I'm trying to find a specific string within an array element. Since array element is a string which can contain multiple occurrences of the string I perform recursive substitution of the result. Algorithm works on simple example, but when I use it with HTML (which is the purpose of the program) it stuck in an infinite while loop.
Here is an (ugly) expression that I'm using:
set expression {\<div\sclass\=\"fileText\"\sid\=\"[^\"]+\"\>File\:\s\<a\s(title\=\"[^\"]+\"\s)?href\=\"([^\"]+)\"\starget\=\"\_blank\"\>([^\<]+)\<\/a\>[^\<]+\<\/div\>};
Here is an element of the array I from which I want to extract strings (it containes 2 occurences of the given expression):
set htmlForParse(0) {file" id="f51456520"><div class="fileText" id="fT51456520">File: 48912-arduinouno_r3_front.jpg (1022 KB, 1800x1244)</div><a class="fileThumb" href="//example.com" target="_blank"><img " title="Reply to this post">YesNo?</a></span></div><div class="file" id="f51456769"><div class="fileText" id="fT51456769">File: 892991578.jpg (32 KB, 400x422)</div><a class="fileThumb" href="//example.com" target="_blank"><img src};
And here are the loops that I'm using to achieve this:
for {set k 0} {$k < [array size htmlForParse]} {incr k} {
while {[regexp $expression $htmlForParse($k) exString]} {
regsub -- $exString $htmlForParse($k) {} htmlForParse($k);
puts $htmlForParse($k);
} }
Purpose of the regsub is to substitute one hit from regexp at a time, until no hits are left and regexp returns 0. At that moment, while loop is finished, and next element of the array can be examined. But that doesn't happen, it continues to loop forever, and it seem that regsub does not substitute found string with an empty string (nor will it substitute with anything else either). Why?

The problem is that the string you are matching contains unquoted RE metacharacters. The ones I notice are parentheses (around the sizes):
% regexp $expression $htmlForParse($k) exString
1
% puts $exString
<div class="fileText" id="fT51456520">File: 48912-arduinouno_r3_front.jpg (1022 KB, 1800x1244)</div>
This means that the substring you extract doesn't actually match as a regular expression in the regsub, and no change is made. Next time round the loop, you get to match everything exactly as it was once again. Not what you want!
The easiest fix is to tell the regsub that the string it is using as a pattern is a literal string. This is done by preceding the RE with ***=, like this:
while {[regexp $expression $htmlForParse($k) exString]} {
regsub -- ***=$exString $htmlForParse($k) {} htmlForParse($k)
puts $htmlForParse($k)
}
With your sample text, this will perform two replacements. I hope that's what you want.
Also, your initial RE has far too many backslashes in it. None of /, < and > are RE metacharacters. It's not harmful to quote them, but I hope you are generating that RE from something, not writing it by hand!

Related

Perl String Regular Expression - Need Explanation

I am pretty new to Perl. I have the following code fragment that works just fine, but I don't fully understand it:
for ($i = 1; $i <= $pop->Count(); $i++) {
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
}
$pop->Head is a string or an array of strings returned by the function Mail::POP3Client, and it is the headers of a bunch of emails. Line 3 is some kind of regular expression that extracts the FROM and the SUBJECT from the header.
My question is how does the print function only print the From and the Subject without all the other stuff in the header? What does "and" mean - this surely can't be a boolean and can it? Most important, I want to put the From string into its own variable (my $fromline). How do I do this?
I am hoping that this will be easy for some Perl professional, it has got me baffled!
Thanks in advance.
ARGHHH... The question was edited while I was typing the answer. OK, throwing out the part of my answer that's no longer relevant, and focusing on the specific questions:
The outer loop iterates over all the messages in the mailbox.
The inner loop doesn't specify a loop variable, so the special variable $_ is used.
In each iteration through the inner loop, $_ is one header line from message number $i.
/^(From|Subject):\s+/i and print $_, "\n";
The first part of this line, up to the and is a pattern. We didn't specify what to do with the pattern, so it's implicitly matched against $_. (That's one of the things that makes $_ special.) This gives us a yes/no test: does the pattern match the header line or not?
The pattern tests whether that item begins with (<) either of the words "From" or "Subject", followed immediately by a colon and one or more whitespace characters. (This not the correct pattern to match an RFC 822 header. Whitespace is optional on both sides of the colon. The pattern should more properly be /^(From|Subject)\s*:\s*/i. But that's a separate issue.) the i at the end of the pattern says to ignore case, so from or SUBJECT would be OK.
The and says to continue evaluating (i.e., executing) the expression if there is a match. If there's no match, whatever follows and is ignored.
The rest of the expression prints the header line ($_) and a newline ("\n").
In perl, and and or are boolean operators. They're synonyms for && and ||, except that they have much lower precedence, making it easier to write short-ciruit expressions without clutter from lots of parentheses.
The smallest change that captures the From line into a separate variable would be to add the following line to the inner loop:
/^From\s*:\s*(.*)$/i and $fromline = $1;
You should probably also put
$fromline = undef
before the loop so you can test, after the loop, whether there was a From: line.
There are other ways to do it. In fact, that's one of the mantras of perl: "There's more than one way to do it." I've stripped out the "From: " from the beginning of the line before storing the balance in $fromline, but I don't know your needs.
It's a logical and with short-circuiting. If the left side evaluates to true -- say, if that regular expression matches -- it'll evaluate the right side, the print.
If the expression on the left is false, it doesn't need to evaluate the right hand side, because the net result would still be false, so it skips it.
See also: perldoc perlop

Match Regular Expression to other regular expression in perl

I want to find whether a given regular expression is a subset of larger regular expression.
For example given a larger regular expression ((a*)(b(a*))), I want to find if a regular expression like (aab.*) or (a.*) matches to it. I am developing a program where I need to find all sub string of given length that can be formed from given regular expression.
$count=0;
$len=0;
sub match{
my $c=$_[1];
my $str=$_[0];
my $reg=$_[2];
#if($str.".*"!~/^$reg$/){
# return;
#}
if($c==$len){
if($str=~/^reg$/){
$count++;
}
return;
}
my $t=$str.'a';
&match($t,$c+1,$reg);
my $t=$str.'b';
&match($str.'b',$c+1,$reg);
}
for(<>){
#arr=split(/\s/,$_);
$len=$arr[1];
&match('a',1,$arr[0]);
&match('b',1,$arr[0]);
print $count;
}
So I thought that I would start strings of given length using recursion and when the string size reaches desired length, I would compare it to original exp. This works fine for small sub strings but runs into stack overflow for larger sub strings. So I was thinking that while generating part of string itself I would check the expression to given reg exp. But that didn't work. For above given reg exp ((a*)(b(a*))) if we compare it to partial string (aa) it will fail as the reg exp doesn’t match. So in order for it to work, I need to compare two regular expression by adding .* behind every partial sub stirng. I tried to find answer on web but was unsuccessful.
I tried the following code but naturally it failed. Can any one suggest some other approach.
if("a.*"=~/((a*)(b(a*)))/){
print match;
}
But here the first part is considered as an actual string. Can you help me how to convert code so I can compare (a.*) as a regular expression instead of string.
I think one approach is to find the length of matched string if it can be done. For instance if you match (aab) to (aac) than you can obtain length where the matched stopped.
Now compare the position where the match stopped, if its equal to length of your string than its equivalent to regex of str(.*). I read that it can be done in some other languages but I am not sure about perl.

How do I use regex capture group as array index?

I'm trying to use regsub in TCL to replace a string with the value from an array.
array set myArray "
one 1
two 2
"
set myString "\[%one%\],\[%two%\]"
regsub -all "\[%(.+?)%\]" $myString "$myArray(\\1)" newString
My goal is to convert a string from "[%one%],[%two%]" to "1,2". The problem is that the capture group index is not resolved. I get the following error:
can't read "myArray(\1)": no such element in array
while executing
"regsub -all "\[%(.+?)%\]" $myString "$myArray(\\1)" newString"
This is a 2 step process in Tcl. Your main mistake here is using double quotes everywhere:
array set myArray {one 1 two 2}
set myString {[%one%],[%two%]}
regsub -all {\[%(.+?)%\]} $myString {$myArray(\1)} new
puts $new
puts [subst -nobackslash -nocommand $new]
$myArray(one),$myArray(two)
1,2
So we use regsub to search for the expression and replace it with the string representation of the variable we want to expand. Then we use the rarely-used subst command to perform the variable (only) substitution.
Apart from using regsub+subst (which is a decidedly tricky pair of commands to use safely in general) you can also do relatively simple transformations using string map. The trick is in how you prepare the mapping:
# It's conventional to use [array set] like this…
array set myArray {
one 1
two 2
}
set myString "\[%one%\],\[%two%\]"
# Build the transform
set transform {}
foreach {from to} [array get myArray] {
lappend transform "\[%$from%\]" $to
}
# Apply the transform
set changedString [string map $transform $myString]
puts "transformed from '$myString' to '$changedString'"
As long as each individual thing you want to go from and to is a constant string at the time of application, you can use string map to do it. The advantage? It's obviously correct. It's very hard to make a regsub+subst transform obviously correct (but necessary if you need a more complex transform; that's the correct way to do %XX encoding and decoding in URLs for example).

tcl regsub will not work

I'm trying to write an extremely simple piece of code and tcl is not cooperating. I can only imagine there is a very simple error I am missing in my code but I have absolutely no idea what it could be please help I'm so frustrated!!
My code is the following ...
proc substitution {stringToSub} {
set afterSub $stringToSub
regsub {^.*?/projects} "$stringToSub" "//Path/projects" afterSub
regsub {C:/projects} "$stringToSub" "//Path/projects" afterSub
return $afterSub
}
puts "[substitution /projects] "
puts "[substitution C:/projects] "
The substitution works fine for the second expression but not the first one. Why is that??
I have also tried using
regsub {^/projects} "$stringToSub" "//Path/projects" afterSub
and
regsub {/projects} "$stringToSub" "//Path/projects" afterSub
but neither are working. What is going on??
Since yours two regsub calls don't change the input string (i.e.: $stringToSub) but put the result in the string $afterSub which is returned by the function. You will always obtain the result of the last regsub call and the result of the first regsub call in $aftersub is always overwritten.
Note that the first pattern is more general and include all the strings matched by the second (assuming that $stringToSub is always a path). If you hope to obtain "//Path/projects" for your sample strings, you can simply remove the second regsub call:
proc substitution {stringToSub} {
set afterSub $stringToSub
regsub {^.*?/projects} "$stringToSub" "//Path/projects" afterSub
return $afterSub
}
The first two lines in your procedure will effectively do nothing, since regsub always overwrites the destination variable (afterSub) even when there's 0 matches/substitutions made. From the regsub manual:
This command matches the regular expression exp against string, and either copies string to the variable whose name is given by varName or returns string if varName is not present. (Regular expression matching is described in the re_syntax reference page.) If there is a match, then while copying string to varName (or to the result of this command if varName is not present) the portion of string that matched exp is replaced with subSpec.
There's no need to match C:/projects specifically, because ^.*?/projects will match that text?
The issue is that your second use of the regsub operation is overwriting the substituted value from the first regsub use.
We could simplify the code to just this:
proc substitution {stringToSub} {
return [regsub {^.*?/projects} $stringToSub "//Path/projects"]
}

find ORF with minimal size of 45 bases using perl regular expression - why this regex doesn't work

I am using perl and regular expression to find an ORF (open reading frame) with a minimal size of 45 bases using.
Basically it means:
Find a substring a string that is composed ONLY of the letters ATGC (no spaces or new lines) that:
Starts with "ATG"
ends with "TAG" or "TAA" or "TGA",
is at least 39 chars long
is dividable by 3
My first code was:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if($CDSString =~ m/(ATG.{45,}(TAG|TAA|TGA))/)
{
my $CDSCurrent = $1;
if ((length($CDSCurrent) % 3) == 0)
{
# do something
}
}
which works fine, but I thought there might be a better way.
So I tried:
$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATGA";
if ($CDSString =~ m/ATG(...){13,}(TAG|TAA|TGA)/ )
{
# do something
}
but for some reason it doesn't match the string above it, and I can't figure out why.
Can anyone figure it out? Thank you in advance.
Your regex is not making sure that everything between the start and stop codons is in fact composed of the letters ATGC only. You should be using:
if ($CDSString =~ m/ATG(?:[ATGC]{3}){13,}(?:TAG|TAA|TGA)/i) {...}
(But your original regex works, too, it just won't reject invalid matches. So there may be another problem somewhere else.)
There is a problem with the code thus far. What you should be looking for is the FIRST instance of a stop codon. If your CDS is no good, it might contain internal stops. Internal stop codons make an invalid ORF, so you need something more finessed:
if($CDSString =~ m/ATG(?:[ATGC]{3}(?<!TAG|TAA|TGA)){13,}(?:TAG|TAA|TGA)/i) {...}
This will return a sequence without internal stops that has at least 13 codons between the start and the first stop.
This portion of the code: (?:[ATGC]{3}(?<!TAG|TAA|TGA)) says "match three nucleotides that are not TAG, TAA, or TGA". The (?
Here's how it looks in action:
perl -e '$CDSString = "ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATAGTAGTAGTGA";if ($CDSString =~ m/(ATG(?:[ATGC]{3}(?<\!TAG|TAA|TGA)){13,}(TAG|TAA|TGA))/ ){print "$1\n"}'
ATGCACACACACACACACACACACACACACACACACACACACACACACACACACACATAG
Note, the last 3 stop codons (TAGTAGTGA) are not returned as part of the sequence.