Regular expression, tcl - regex

I'm trying to extract the specific lines from a trace file like below:
- 0.118224 0 7 ack 40 ------- 1 2.0 7.0 0 2
r 0.118436 1 2 tcp 40 ------- 2 7.1 2.1 0 1
+ 0.118436 1 2 ack 40 ------- 2 3.1 2.1 0 3
- 0.118436 1 2 ack 40 ------- 2 4.1 2.1 0 3
r 0.120256 0 7 ack 40 ------- 1 2.0 7.0 0 2
I want to extract any line that have the following:
r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x.
Note: x means any value and numbers could be between 3-to-7.
here is my try-its not working !!:
if {[regexp \r+ ([0-9.]+) 1 2.*- ([3-7.]+) 2.*- ([0-9.]+) $line -> time]}
Any suggestion??

Here's another approach: extract the fields you want to use for comparison
while {[gets $f line] != -1} {
lassign [split $line] a - b c - - - - d e - -
if {
$a eq "r" &&
$b == 1 &&
$c == 2 &&
3 <= floor($d) && floor($d) <= 7 &&
floor($e) == 2
} {
puts $line
}
}

You have to escape the . with a \. It means "any character" in regexp.
So your regexp could look like:
if {[regexp {r \d\.\d{5} 1 2 \d{3} \d{2} ------- \d [3-7]\.\d 2\.\d \d \d} $line -> time ]} {
# ...
}
Now you have to place () around the part you want.
Btw: I used the following transformation on your description of what you want to match:
set input {r x.xxxxx 1 2 xxx xx ------- x numbers.x 2.x x x}
set re [subst [regsub -all {x{2,}} $data {\\\\d{[string length \0]}}]]
set re [string map {. {\.} x {\d} numbers {[3-7]}} $re]

Related

regex match non white space characters except new line with perl

I want to match the fifth column i.e ",,," and ",,,," and "," except new line i.e "\n" and then replace them with some value. the following is a file delimited by space . I tried following code:
Note: Though the example shows commas in the fifth column.It could be any characters (including tab \t) other than newline (\n).
my $delimiter="**";
my $dir_to_check=$DIR;
opendir my $DIR, $dir_to_check or die "Error in opening dir '$dir_to_check' because: $!";
my #files = readdir($DIR);
closedir($DIR);
foreach my $file (#files)
{
if($file =~ /\.fmt/)
{
unless ( open( CONTRL_FILE, "< $dir_to_check/$file" ) ) {
print "error while opening file $dir_to_check/$file \n"
} # UNLESS
if ($file eq 'test.fmt')
{
unless ( open( CONTRL_FILE_1, "> $dir_to_check/$file.temp" ) ) {
print "error while opening file $file \n"
} # UNLESS
while(<CONTRL_FILE>)
{
$_ =~ s/"[^\s]+"/"$delimiter"/ ;
print CONTRL_FILE_1 $_;
}
close(CONTRL_FILE_1);
}
}
}
Data:
1 SQLCHAR 0 5 ",,," 1 ""
2 SQLCHAR 0 25 ",,,," 2 ""
3 SQLCHAR 0 1 "," 3 ""
4 SQLCHAR 0 12 "," 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
Result:
1 SQLCHAR 0 5 "*****" 1 ""
2 SQLCHAR 0 25 "*****" 2 ""
3 SQLCHAR 0 1 "*****" 3 ""
4 SQLCHAR 0 12 "*****" 4 ""
5 SQLCHAR 0 1 "*****" 5 ""
Expected Result :
1 SQLCHAR 0 5 "**" 1 ""
2 SQLCHAR 0 25 "**" 2 ""
3 SQLCHAR 0 1 "**" 3 ""
4 SQLCHAR 0 12 "**" 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
If you are using an older version of Perl then that may be a factor. In any case, I would suggest you make a minor modification ...
$_ =~ s/"[^\s"]+"/"$delimiter"/;
... that is a ", one or more NOT whitespace OR ", then a "
C:\Users\Ken>type test.pl
#!C:\Strawberry\perl\bin\perl -w
$\="\n";
my $d="**";
my $L1="5 SQLCHAR 0 1 \",,,\" 5 \"\"";
my $L2="5 SQLCHAR 0 1 \"\n\" 5 \"\"";
foreach my $L ($L1,$L2)
{
print "LineIn=$L";
if ($L=~ s/"[^\s"]+"/"$d"/) {print "#YES L=$L";}
else {print "#NO L=$L";}
}
C:\Users\Ken>test.pl
LineIn=5 SQLCHAR 0 1 ",,," 5 ""
#YES L=5 SQLCHAR 0 1 "**" 5 ""
LineIn=5 SQLCHAR 0 1 "
" 5 ""
#NO L=5 SQLCHAR 0 1 "
" 5 ""
Since the OP says in his comment that the contents on 4th column could be "any combination of non white-space characters", and that he states that he does not want substitution to happen for the case when 4th column contains "\n" literally, I suggest he matches the contents of 4th column and then test, in two steps, whether what is in quotes includes a literal representation of what Perl would understand as a whitespace.
For doing that, we could use eval or we could use a regexp with the ee modifier, which is better and safer.
Here is an example using the latter (update - dataset correctly includes the OP's and additional cases):
#!/usr/bin/perl
use strict;
use warnings;
my $delimiter="**";
while (<DATA>) {
# we capture the contents of the quotes in
# 4th column, checking also the expected format
if (/(^([^\s]+\s+){4})"([^"]+)"(.*)/) {
my $st = $3;
# "\n" in the file is actually "\\n" for Perl
# so, to have Perl understand it as "\n", we need
# to have Perl effectively escape it, we can
# do that with a regexp and the ee modifier
$st =~ s/\\([tnfr])/"qq{\\$1}"/gee;
# now this will match an "\n", "\r", "\f" or "\t"
if (!($st =~ /\s/)) {
print "$1\"$delimiter\"$4\n";
} else {
print $_;
}
} else {
print "error: wrong line format: $_\n";
}
}
__DATA__
1 SQLCHAR 0 5 ",,," 1 ""
2 SQLCHAR 0 25 ",,,," 2 ""
3 SQLCHAR 0 1 "," 3 ""
4 SQLCHAR 0 12 "," 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
6 SQLCHAR 0 8 "a b" 6 ""
7 SQLCHAR 0 8 "\t" 7 ""
8 SQLCHAR 0 9 "\" 8 ""
9 SQLCHAR 0 9 "stuff\" 8 ""
which would result in:
1 SQLCHAR 0 5 "**" 1 ""
2 SQLCHAR 0 25 "**" 2 ""
3 SQLCHAR 0 1 "**" 3 ""
4 SQLCHAR 0 12 "**" 4 ""
5 SQLCHAR 0 1 "\n" 5 ""
6 SQLCHAR 0 8 "a b" 6 ""
7 SQLCHAR 0 8 "\t" 7 ""
8 SQLCHAR 0 9 "**" 8 ""
9 SQLCHAR 0 9 "**" 8 ""
Please note that there is no easy way to determine what a given script running on a given environment could understand as being a "Perl whitespace", since it depends on many factors, and that [\t\n\f\r ] is just a simplified view of what Perl can understand as whitespace.
Quoting a bit of perlrecharclass:
Whitespace
\s matches any single character considered whitespace.
If the /a modifier is in effect ...
In all Perl versions, \s matches the 5 characters [\t\n\f\r ]; that is, the horizontal tab, the newline, the form feed, the carriage
return, and the space. Starting in Perl v5.18, it also matches the
vertical tab, \cK . See note 1 below for a discussion of this.
otherwise ...
For code points above 255 ...
\s matches exactly the code points above 255 shown with an "s" column in the table below.
For code points below 256 ...
if locale rules are in effect ...
\s matches whatever the locale considers to be whitespace. (...)

What is the regular expression for a total 10 digit number with a decimal precision of 1 or 2?

I am trying a regex that satisfy the following for a total 10 digit number.
Tried this so far :
^(\d){0,8}(\.){0,1}(\d){0,2}$
It works fine but fails if I give the following :
123456789.0
Valid example:
1234567890 (total 10 digits)
1234567.1 (total 8 digits)
12345678.10 (total 10 digits)
123456789.1 (total 10 digits)
Invalid example :
12345678901 (11 characters)
Here is a way to go:
^(?:\d{1,10}|(?=\d+\.\d\d?$)[\d.]{3,11})$
Explanation:
^ : begining of string
(?: : start non capture group
\d{1,10} : 1 upto 10 digits
| : OR
(?= : start look ahead
\d+\.\d\d?$ : 1 or more digits then a dot then 1 or 2 digits
) : end lookahead
[\d.]{3,11} : only digit or dot are allowed, with a length from 3 upto 11
) : end group
$ : end of string
In action:
#!/usr/bin/perl
use Modern::Perl;
my $re = qr~^(?:\d{1,10}|(?=\d+\.\d\d?$)[\d.]{3,11})$~;
while(<DATA>) {
chomp;
say (/$re/ ? "OK: $_" : "KO: $_");
}
__DATA__
1
123
1.2
1234567890
1234567.1
12345678.10
123456789.1
12345678901
1.2.3
Output:
OK: 1
OK: 123
OK: 1.2
OK: 1234567890
OK: 1234567.1
OK: 12345678.10
OK: 123456789.1
KO: 12345678901
KO: 1.2.3
The solution using String.prototype.match() and RegExp.prototype.text() functions:
var isValid = function (num) {
return /^\d+(\.\d+)?$/.test(num) && String(num).match(/\d/g).length <= 10;
};
console.log(isValid(1234567890));
console.log(isValid(12345678.10));
console.log(isValid(12345678901));
console.log(isValid('123d3457'));
you can break your pattern in 3 step:
First step
You need at least 8 digit + 1 or 2 precision that both are optional
\d{8}\.?\d?\d? Here . and both digit are optional
Second step
You need at least 9 digit + 1 precision and that's it
\d{9}\.?\d? Here . and digit are optional
Then you can mix these three rule together with or | keyword
^(\d{8}\.?\d?\d?|\d{9}\.?\d?)$
Okay now this regex only matches 7 to 10 digit with 1 or 2 precision
It never matches less than 8 digit and a tricky part is here that you can change second step \d{8} with \d{1,8} and then It match from 1 to 9999999999 and plus 1 or 2 precision.
what you want:
^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$
echo 1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1
echo 9999999999 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
9999999999
echo 1.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1.1
echo 1.12 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1.12
echo 1234567.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1234567.1
echo 1234567.12 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
1234567.12
echo 99999999.9 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
99999999.9
echo 99999999.99 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
99999999.99
not match
echo 1.111 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 1234567.111 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 123456781.11 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 1234567891.1 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'
echo 123456789101 | perl -lne '/^(\d{1,8}\.?\d?\d?|\d{9}\.?\d?)$/ && print $&'

Get next 5 lines after regexp is matched in tcl

How to get the next 5 lines after a certain pattern is matched in TCL
I've some 30lines of output and need only few lines in between...
Might be easier to split the output into a list of lines so you can use lsearch:
% set output [exec seq 10]
1
2
3
4
5
6
7
8
9
10
% set lines [split $output \n]
1 2 3 4 5 6 7 8 9 10
% set idx [lsearch -regexp $lines {4}]
3
% set wanted [lrange $lines $idx+1 $idx+5]
5 6 7 8 9
Just append something to your regular expression! Like this:
([^\n]*\n){5}
Glenn Jackman's solution is probably better, but the line processing command in fileutil can be preferable for some variations.
package require fileutil
Given a file that looks like this:
% cat file.txt
1
2
3
4
5
6
7
8
9
10
Now, for each line in the file
set n 0
set re 4
set nlines 5
::fileutil::foreachLine line file.txt {
if {$n > 0} {
puts $line
incr n -1
}
if {$n == 0 && [regexp $re $line]} {
set n $nlines
}
}
If the counter n is greater than 0, print the line and decrement. If n is equal to 0 and the regular expression matches the line, set n to $nlines (5).
# output:
5
6
7
8
9
Documentation: fileutil package, if, incr, package, puts, Syntax of Tcl regular expressions, regexp, set

Perl: Regex - matching values with alphabets

I have written a small perl "hack" to replace 1's with alphabets in a range of columns in a tab delimited file. The file looks like this:
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
The gola is to take columns 11 through 16 and append letters A to Z if a value of 1 is seen in those columns. My code so far is producing an empty output and this is my first time doing regular expressions.
cat infile.txt \
| perl -ne '#alphabet=("A".."Z");
$is_known_intron = 0;
$is_known_donor = 1;
$is_known_acceptor = 1;
chomp;
$_ =~ s/^\s+//;
#d = split /\s+/, $_;
#d_bool=#d[$11-$16];
$ct=1;
$known_intron = $d[$10];
$num_of_overlapping_gene = $d[$9];
$known_acceptor = $d[$8];
$known_donor = $d[$7];
$k="";
if (($known_intron == $is_known_intron) and ($known_donor == $is_known_donor) and ($known_acceptor == $is_known_acceptor)) {
for ($i = 0; $i < scalar #d_bool; $i++){
$k.=$alphabet[$i] if ($d_bool[$i])
}
$alphabet_ct{$k}+=$ct;
}
END
{
foreach $k (sort keys %alphabet_ct){
print join("\t", $k, $alphabet_ct{$k}), "\n";
}
} '\
> Outfile.txt
What should I be doing instead?
Thanks!
* Edit *
Expected Output
ABCD 45
BCD 23
ABCDEF 1215
so on and so forth.
I converted your code into a script for ease of debugging. I've put comments in the code to point out dodgy bits:
use strict;
use warnings;
my %alphabet_ct;
my #alphabet = ( "A" .. "Z" );
my $is_known_intron = 0;
my $is_known_donor = 1;
my $is_known_acceptor = 1;
while (<DATA>) {
# don't process the first line
next unless /chr10/;
chomp;
# this should remove whitespace at the beginning of the line but is doing nothing as there is none
$_ =~ s/^\s+//;
my #d = split /\s+/, $_;
# the range operator in perl is .. (not "-")
my #d_bool = #d[ 10 .. 15 ];
my $known_intron = $d[9];
my $known_acceptor = $d[7];
my $known_donor = $d[6];
my $k = "";
# this expression is false for all the data in the sample you provided as
# $is_known_intron is set to 0
if ( ( $known_intron == $is_known_intron )
and ( $known_donor == $is_known_donor )
and ( $known_acceptor == $is_known_acceptor ) )
{
for ( my $i = 0; $i < scalar #d_bool; $i++ ) {
$k .= $alphabet[$i] if $d_bool[$i];
}
# it is more idiomatic to write $alphabet_ct{$k}++;
# $alphabet_ct{$k} += $ct;
$alphabet_ct{$k}++;
}
}
foreach my $k ( sort keys %alphabet_ct ) {
print join( "\t", $k, $alphabet_ct{$k} ) . "\n";
}
__DATA__
Chr Start End Name Score Strand Donor Acceptor Merged_Transcript Gencode Colon Heart Kidney Liver Lung Stomach
chr10 100177483 100177931 . . - 1 1 1 1 1 0 1 1 0 0
chr10 100178014 100179801 . . - 1 1 1 1 1 1 1 1 1 0
chr10 100179915 100182125 . . - 1 1 1 1 1 1 1 0 1 0
chr10 100182270 100183359 . . - 1 1 1 1 0 0 1 0 1 0
chr10 100183644 100184069 . . - 1 1 1 1 0 0 1 0 1 0
With $is_known_intron set to 1, the sample data gives the results:
ABCDE 1
ABCE 1
ACD 1
CE 2

Break down text file in bash

I have a text file in the following format:
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
I need a way to break this down into several files named for example "chr1.txt", "chr10.txt and "chr11.txt". How would I go about doing this?
I about the the following way:
cat file.txt | \
while IFS=$'\t' read -r -a rowArray; do
echo -e "${rowArray[0]}\t${rowArray[1]}\t${rowArray[2]}"
done > $file.mod.txt
That reads line by line and then saves line by line. However, I need something a little more elaborate that spans rows. "chr1.txt" would include everything from the row 10161 1 to row 10251 1, "chr10.txt" would include everything from the row 70711 1 to row 71181 1, etc. It's also specific in that I have to read in the actual chr# from each line as well, and save that as the file name.
The help is really appreciated.
awk -F'[ =]' '
$1 == "variableStep" {file = $3 ".txt"; next}
file != "" {print > file}' < input.txt
This worked for me:
IFS=$'\n'
curfile=""
content=($(< file.txt))
for ((idx = 0; idx < ${#content[#]}; idx++)); do
if [[ ${content[idx]} =~ ^.*chrom=(\\b.*?\\b)\ .*$ ]]; then
curfile="${BASH_REMATCH[1]}.txt"
rm -rf ${curfile}
elif [ -n "${curfile}" ]; then
echo ${content[idx]} >> ${curfile}
fi
done
Awk is appropriate for this problem domain because the text file is already (more or less) organized into columns. Here's what I would use:
awk 'NF == 3 && index($2, "=") { filename = substr($2, index($2, "=") + 1) }
NF == 2 && filename { print $0 > (filename ".txt") }' < input.txt
Explanation:
Think of the lines starting with variableStep as "three columns" and the other lines as "two columns". The above script says, "Parse the text file line-by-line; if a line has three columns and the second column contains an '=' character, assign 'all of the characters in the second column that occur after the '=' character' to a variable called filename. If a line has two columns and the filename variable's been assigned, write the entire line to the file that's constructed by concatenating the string in the filename variable with '.txt'".
Notes:
NF is a built-in variable in Awk that represents the "number of fields", where a "field" (in this case) can be thought of as a column of data.
$0 and $2 are built-in variables that represent the entire line and the second column of data, respectively. ($1 represents the first column, $3 represents the third column, etc...)
substr and index are built-in functions described here: http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
The redirection operator (>) acts differently in Awk than it does in a shell script; subsequent writes to the same file are appended.
String concatenation is performed simply by writing expressions next to each other. The parenthesis ensure the concatenation happens before the file gets written to.
More details can be found here: http://www.gnu.org/software/gawk/manual/gawk.html#Two-Rules
i used sed to filter ....
code part :
Kaizen ~/so_test $ cat zsplit.sh
cntr=1;
prev=1;
for curr in `cat ztmpfile2.txt | nl | grep variableStep | tr -s " " | cut -d" " -f2 | sed -n 's/variableStep//p'`
do
sed -n "$prev,$(( ${curr} - 1))p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
prev=$curr; cntr=$(( $cntr + 1 ));
done
sed -n "$prev,$ p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
output :
Kaizen ~/so_test $ ./zsplit.sh
+ ./zsplit.sh
zchap1.txt :: 1 :: 1
displaying : : zchap1.txt
variableStep chrom=chr1 span=10
zchap2.txt :: 1 :: 12
displaying : : zchap2.txt
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
zchap3.txt :: 12 :: 25
displaying : : zchap3.txt
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
displaying : : zchap4.txt
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
from the result zchap* files , iff you want you can remove the line : variableStep chrom=chr11 span=10 by using sed -- sed -i '/variableStep/d' zchap*
does this help ?