If I want to match DEF_23 using the following regexp:
expect {
-re "DEF_\[0-9]*"
set result $expect_out(1,string)
}
why does it say no such element in array?
How does $expect_out work, and how can I capture the DEF using a regexp and assign it to the variable result?
You're looking for expect_out(0,string) -- the array element 1,string would be populated if you had capturing parentheses in your regular expression.
The expect manpage documents the use of expect_out in the documentation of the expect command:
Upon matching a pattern (or eof or full_buffer), any matching and previously unmatched output is saved in the variable expect_out(buffer). Up to 9 regexp substring matches are saved in the variables expect_out(1,string) through expect_out(9,string). If the -indices flag is used before a pattern, the starting and ending indices (in a form suitable for lrange) of the 10 strings are stored in the variables expect_out(X,start) and expect_out(X,end) where X is a digit, corresponds to the substring position in the buffer. 0 refers to strings which matched the entire pattern and is generated for glob patterns as well as regexp patterns.
There is an illustrative example in the manpage.
It seems that the above explication is not precise!
Check this example:
$ cat test.exp
#!/usr/bin/expect
set timeout 5
log_user 0
spawn bash
send "ls -1 db*\r"
expect {
-re "^db.*$" {
set bkpfile $expect_out(0,string)
}
}
send_user "The filename is: $bkpfile\n"
close
$ ls -1 db*
dbupgrade.log
$ ./test.exp
can't read "bkpfile": no such variable
while executing
"send_user "The filename is: $bkpfile\n""
(file "./test.exp" line 15)
$
The test result is the same when $expect_out(1,string) or $expect_out(buffer)is used.
Am I missing something or this is the expected behavior?
Aleksandar - it should work if you change the match to "\ndb.*$".
If you turn on exp_internal 1, you will see the buffer contains something like this: "ls -1 db*\r\ndbupgrade.log\r\n08:46:09"
So, the caret (^) will throw your pattern match off.
Related
The overall goal here is to remove a block of text starting with a particular string and ending with a positive lookahead. From the testing I've done, it seems that newlines are causing the problem, but I'm not sure what exactly is going on or the best way to fix it.
More context: I want to remove taxa from a .fasta file, including the taxon name and header information and the associated sequence. (fasta format begins with a header >locusname-locusnumber-species_name |locusname-locusnumber \n). Missing data in the sequence is coded as "-". Eventually I would like to do this for several species_names and do so for each of several thousand files in a directory.
I presumed this would be a simple task to do as a perl one-liner in bash (Ubuntu 18.04.2).
As an example, from the excerpt below I would like to remove the entire sequence of Pseudomymrex seminole D1367, i.e. the string that starts with >uce-483_Pseudomyrmex_seminole_D1367 |uce-483 and ends with the newline before >uce-483_Pseudomyrmex_seminole_D1435. . ..
For this, I have: perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367)[\s\S]+(?=>)//' infile.fasta > outfile.fasta
or equivalently perl -pe 's/>(.)+(Pseudomyrmex_seminole_D1367(.)+(?=>)//s' infile.fasta > outfile.fasta
Both of these seem to have no effect at all (i.e. diff infile.fasta outfile.fasta is empty.) If I remove the positive lookahead, it works correctly but only up to the first newline.
Here's an excerpt from the .fasta for context and testing:
>uce-483_Pseudomyrmex_seminole_D1366 |uce-483
------------------------------------------------------------
---------------------------------------------------tgtaaacgt
tataatacatgcgtatgaaaaaaaaaagtgaacacccggtacgtacccgtgctgaaacgt
tcagatttacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata----tgtgtgtgtgtgcgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaatgttgaattattttcactcgctc
cacgcgcgagtgttcgctccttttacgcacaacgagtccttctgctgcagc--gagatag
aaaatatttttgcgcggtaatcgtaaacgtatgagtgcctttcgacgtgaattctcttat
ggcagttctcacggtgtaaattataatcgaattaacattgcgagtgtgatctcaatataa
ttatagcgtctaagaacaaacacgtaacatgcacacacacacacacacac----------
---
>uce-483_Pseudomyrmex_seminole_D1367 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ttcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatg---atatatatgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactctgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtacgcgcgcacacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatg-----------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
>uce-483_Pseudomyrmex_seminole_D1435 |uce-483
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------tacatccatttgtagtagcattttcgctagttttttcaagagcaaaaaggaca
cattcaaaactgaatatacatgtcacagatgtttgtttgtgtgcaggtacctgtaatttt
gcaaacatatacctatatatgtgtgtcgcatatatatcatgtagtagatttccatgttat
gcaacatcttctcacaatgacaatcggtcgtttccttcactccgaaatgttcatgcgaac
agttaatctatatcccaagcagcgatgtaatgttatgcggcgcgcaagtctcattagact
tgtaaaccgtccgagtttcgacttaccata--tgtgtgtgtgtgtgtgcgcgtatgtgca
cgtac------acacgtttgtttatacatttgtctatacatttgcgtgtgaacgcgggat
gaacagagatttgcgcacacatagacatgagaaacgtcacttgtcgatgtagatactaat
tgtggaaaatacatattcctcttcagatacacgggaa-----------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---
With -p (or -n) the one-liner is reading a line at a time; so it just can't match multiline patterns. One solution is to "slurp" the whole file in, if it isn't too large (see end for line-by-line solution)
perl -0777 -pe'...' in > out
See Command Switches in perlrun.
Then, the code shown in the question has an unbalanced parenthesis and it doesn't compile. Further, there is no reason to capture those .s so drop the parentheses around. Next, the pattern
s/>.+Pseudomyrmex_seminole_D1367...//;
matches everything from the very first > to the name of interest, so all preceding sequences are matched and removed as well. Instead, match >[^>]+...D1367 for example, so everything that isn't > after a >, to that phrase.
Finally, the last .+(?=>) will match everything to the very last > and thus the regex will remove all following sequences, not what you want according to the description. Instead, limit it to match to the first following >, either by making it "non-greedy" with .+?(?=>) or, more simply, with [^>]+.
All corrected
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367[^>]+//' in > out
Note that there is no need for /s modifier now, since its purpose is to make . match a newline and here we don't need that since the [^>] does match newlines as well (anything other than >). The quantifier is +? to (hopefully) prevent backtracking each whole sequence that doesn't match.
Or, with your original use of lookahead
perl -0777 -pe's/>[^>]+?Pseudomyrmex_seminole_D1367.+?(?=>)//s' in > out
These work as expected with your sample, as well as with an extended example I made up with further sequences (>...) added.
For reference, and since a fasta file can be too big to slurp into a string, here it is line by line.
Once you see the >... line of interest set a flag; print a line if that flag isn't set (and if we aren't on that very line). Once you reach the next > clear the flag (print that line, too).
perl -ne'
if (/^>.+?Pseudomyrmex_seminole_D1367/) { $f = 1 }
elsif (not $f) { print }
elsif (/^>/) { $f = 0; print }
' in > out
I suspect that this may also perform considerably better on very large files.
The regex in the first solution has to scan each sequence whole in order to find that it is not the one of interest; it is only once it hits the next > that it can decide that the sequence doesn't match (and with no backtracking, hopefully, since +? would've stopped it had the right phrase been encountered).
Here the code mostly checks the first character and a flag.
So it's an incomparably lesser workload here -- but here the regex engine is started up on every line, and that is expensive. I can't tell with confidence how they stack against each other without trying.
You can also use > as input record separator. This way you avoid to slurp the whole file and since the main loop loads your file block by block, you only have to test which one is the target to not print it (without to describe the whole block in a pattern):
perl -ln076e's/\n$//;print ">$_" if $_ && !/Pseudomyrmex_seminole_D1367/' file
The l switch sets the output record separator to the input record separator (a newline by default).
The 0 switch sets the input record separator to > (76 in octal).
Is it possible with notepad++ (or maybe from linux bash shell) to create multiple lines from a pattern found , as many times as the pattern is found and also append single found pattern in the newly created line?
The multi pattern is val=[0-9]+
The single pattern is id=[a-zA-Z0-9]+
Example:
Input lines:
id=af2477,val=333,val=777
id=af3456,val=222,val=444,val=678
id=af3327,val=3234,val=123,val=701
Output lines:
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
I have tried with 2 subgroups but it wont work. It will only replace the second group once:
find what:(id=[a-zA-Z0-9]+,)(val=[0-9]+,)*
replace:\n\1,\2
UPDATE: Both answers from Toto and Wiktor Stribiżew seem to do the job. Haven't tested them yet. I would still like to see how this can work with the use of Notepad++ (even if multiple steps are needed)
Since you also consider using Linux tools for this, an awk solution looks much more viable:
awk 'BEGIN{FS=OFS=","} /^id=[a-zA-Z0-9]+(,val=[0-9]+)*$/{
for(i=2; i<=NF; i++) {
print $1,$i
}; next;
}{print $0}' file > outfile
See the online demo.
Here, any line that matches ^id=[a-zA-Z0-9]+(,val=[0-9]+)*$ (i.e. matches the format of the lines you need to expand) is split the way you need with for(i=2; i<=NF; i++) {print $1,$i}; next;. Else, the line is written as is (print $0).
The BEGIN{FS=OFS=","} part sets the input and output field separator to a comma.
This perl one-liner does the job (output on STDOUT):
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file
id=af2477,val=333
id=af2477,val=777
id=af3456,val=222
id=af3456,val=444
id=af3456,val=678
id=af3327,val=3234
id=af3327,val=123
id=af3327,val=701
Explanation:
($id,$vals)=/(id=\w+),(.+)$/; # explode id and values for each line in input file
say "$id,$_" for split/,/,$vals # print id and each value
You can redirect the output to another file:
perl -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file > outputfile
Or do the change in-place:
perl -i -anE '($id,$vals)=/(id=\w+),(.+)$/;say "$id,$_" for split/,/,$vals' file
It is possible, yet very complex to do that with one regular expression for which you are gonna have to use (?R) and conditional statements.
With multiple steps would be pretty simple. You can for instance do find and replace using the max number of val that you might have in the longest lines, such as, imagine 4 would be the largest number of val, then we'll have four of (,val=[^\r\n,]*) in our initial expression:
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and replace that with four lines,
$1$2\n$1$3\n$1$4\n$1$5
---- ---- ---- ----
Demo for Step 1
For any additional step, we can simply remove one val and one line from the end of initial expression and replacement. For example, our expression would look like
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
in the second step, for which we'd replace it with:
$1$2\n$1$3\n$1$4
---- ---- ----
Demo for Step 2
In the third and final step, our expression has two vals,
^(id=[^\r\n,]*)(,val=[^\r\n,]*)(,val=[^\r\n,]*)$
and our replacement will have two lines:
$1$2\n$1$3
---- ----
Demo for Step 3
For the case exampled in the question, only two steps are required and the second and third expressions would likely work just fine.
I have a file I need to take just its name:
/var/www/foo/dog.tur-tles.chickens.txt
I want to match just the:
dog.tur-tles.chickens
I have tried this in regexer:
([^\/]*)$
This matches:
dog.tur-tles.chickens.txt
I can't figure out how to only exclude that last period.
You can assume it will always be a .txt, but I wanted to build in the ability that if a file was named dog-turtles.txt.txt it would see that the name is dog-turtles.txt.
You could use something like so: ([^\/]*)(\.).+?$.
An example is available here. Not though that this will fail for extensions such as .tar.gz and so on.
You may use File::Basename.fileparse to get the file name, then use rindex to get the last index of . and then get the required substring using substr:
use File::Basename;
$x = fileparse('/var/www/foo/dog.tur-tles.chickens.txt');
print substr($x, 0, rindex($x, '.')) . "\n";
Output of a sample program:
dog.tur-tles.chickens
$name = ($pathname =~ s{.*/}{}r =~ s{\.[^.]+$}{}r)
substitution 1 : just remove dir
substitution 2 : just remove extension if presente
Just add .txt to your regex and since * is greedy by default it will match everything till last .txt
([^\/]*)\.txt$
Input:
/var/www/foo/dog.tur-tles.chickens.txt.txt
/var/www/foo/dog.tur-tles.chickens.txt
Output:
dog.tur-tles.chickens.txt
dog.tur-tles.chickens
See DEMO
Following is a line from an ftp log:
2013-03-05 18:37:31 543.21.12.22 []sent
/home/mydomain/public_html/court-9746hd/Chairman-confidential-video.mpeg
226 court-9746hd#mydomain.com 256
I am using a program called Simple Event Correlate which pulls values from inside the parenthesis of a regex expression and sets those values to a variable.
So, here is an entry in a SEC config file which is supposed to operate on the previous log file line:
pattern=sent \/home\/mydomain\/public_html\/(.*)\/(.*)
This succeeds in pulling out the logged in user, court-9746hd, and setting it to a variable, but fails to properly extract the file name downloaded, or, Chairman-confidential-video.mpeg
Instead, it pulls out the file downloaded as: Chairman-confidential-video.mpeg 226 court-9746hd#mydomain.com 256
So you see, I'm having difficulty getting the second extraction to stop at the first white space after the file name. I've tried:
pattern=sent \/home\/mydomain\/public_html\/(.*)\/(.*)\s
but I only get the same result. Any help would be greatly appreciated.
If you only want to match non-whitespace, replace .* with \S* or if space is the only character you want to exclude then use [^ ]* instead.
Also, man perlre is a good reference.
Rather than using the .* construct, use something narrower in scope, as a general rule. In this case what you want is something which is not a white space, so say that explicitly:
pattern=sent \/home\/mydomain\/public_html\/([^\s]+)\/([^\s]+)
One option is to first capture the full path from the line, and then use File::Spec to get the user and file info:
use strict;
use warnings;
use File::Spec;
my $line = '2013-03-05 18:37:31 543.21.12.22 []sent /home/mydomain/public_html/court-9746hd/Chairman-confidential-video.mpeg 226 court-9746hd#mydomain.com 256';
my ( $path ) = $line =~ m!\s+(/home\S+)\s+!;
my ( $user, $file ) = ( File::Spec->splitdir($path) )[ -2, -1 ];
print "User: $user\nFile: $file";
Output:
User: court-9746hd
File: Chairman-confidential-video.mpeg
However, if you want to only use a regex, the following will work:
m!/home/.+/.+/([^/]+)/(\S+)!
I'm trying to find a regex that works to match a string of escape characters (an Expect response, see this question) and a six digit number (with alpha-numeric first character).
Here's the whole string I need to identify:
\r\n\u001b[1;14HX76196
Ultimately I need to extract the string:
X76196
Here's what I have already:
interact {
#...
#...
#this expression does not identify the screen location
#I need to find "\r\n\u001b[1;14H" AND "([a-zA-Z0-9]{1})[0-9]{5}$"
#This regex was what I was using before.
-nobuffer -re {^([a-zA-Z0-9]{1})?[0-9]{5}$} {
set number $interact_out(0,string)
}
I need to identify the escape characters to to verify that it is a field in that screen region. So I need a regex that includes that first portion, but the backslashes are confusing me...
Also once I have the full string in the $number variable, how do I isolate just the number in another variable in Tcl?
If you just want the number at the end, then this should be enough...
[0-9]{6}
Update with new information
Assuming \n is a newline character, rather than a literal \ followed by a literal n, you can do this...
\r\n\u001B\[1;14H(X[0-9]{5})
I found out a few things with some more digging. First of all I wasn't looking at the output of the program but the input of the user. I needed to add the "-o" flag to look at the program output. I also shortened the regex to just the necessary part.
The regex example from #rikh led me to look at why his or my own regex was failing, and that was due to the fact that I wasn't looking at the output but the input. So the original regex that I tried wasn't at fault but the data being looked at (missing the "-o" flag)
Here's the complete answer to my problem.
interact {
#...
-o -nobuffer -re {(\[1;14H[a-zA-Z0-9]{1})[0-9]{5}} {
#get number in place
set numraw $interact_out(0,string)
#get just number out
set num [string range $numraw 6 11]
#switch to lowercase
set num [string tolower $num]
send_user " stored number: $num"
}
}
I'm a noob with Expect and Tcl so if any of this doesn't make sense or if you have any more insights into the interact flags, please set me straight.