Regex to capture data - regex

I am trying to capture a set of strings using regular expressions.
The strings are of the following format
CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
The expression that I came up with is
[A-Za-z][a-z0-9A-Z_-]*\s*[0-6]\s*[0-4]\s\s[\s\d]\d\s*[0-9,]*\s*[0-9,]*\s*[0-9,]*\s*[0-9,]*\s*0\s*[0-9,]*
Although this expression works for me,and gives the necessary output ,I feel that it is not optimized .
Can someone help me optimize the expression ?

<?php
$text = 'CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694';
preg_match("/^[A-Z]+\s+[A-Za-z_]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9,]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9]+\s+[0-9,]+$/", $text, $m);
print_r($m);
preg_match("/^([A-Z]+)\s+([A-Za-z_]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9,]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9]+)\s+([0-9,]+)$/", $text, $m);
print_r($m);
/*
Output:
Array
(
[0] => CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
)
Array
(
[0] => CID _At 1 2 99 1,198,498,377 414 0 0 0 3,694
[1] => CID
[2] => _At
[3] => 1
[4] => 2
[5] => 99
[6] => 1,198,498,377
[7] => 414
[8] => 0
[9] => 0
[10] => 0
[11] => 3,694
)
*/
remove the * and if you need group the internal matches with ()
example in php, if you need some length, replace "+" by "{1}" (1 char length)
if you need, minimum 1 and max 3 {1,3}
if you need minimum 1 and maximum infinite {1,}

Related

How can I split a variable by line in TCL?

I have a variable named "results" with this value:
{0 0 0 0 0 0 0 0 0 0 0 3054 11013}
{0 0 0 0 0 0 0 0 0 0 0 5 13 15}
{0.000 3272.744 12702.352 30868.696}
I'd like to store each line (values between the '{}') in a separate variable and then, compare each of the elements of each line with a threshold (this threshold will be different for each line, that's why I need to split them).
I've tried
set result [split $results \n]
But it doesn't really give me a neat list of elements. Any to get 3 lists from the variable "results"?
If I understand correctly, and the representation of your exemplary data is accurate, then you do not have to process ([split]) the data held by results, but leave that to Tcl's list parser. In other words, the input is already a valid string representation of a Tcl list eligible for further processing. Watch:
set results {
{0 0 0 0 1}
{2 2 3 3 3}
{1 1 2 3 4}
};
set thresholds {
3
2
1
}
lmap values $results threshold $thresholds {
lmap v $values {expr {$v >= $threshold}}
}
This will produce:
{0 0 0 0 0} {1 1 1 1 1} {1 1 1 1 1}
Background: when $results is worked on by [lmap], it will be turned into a list automatically.
I think its better to split according to new line character and then apply regexp to fetch the data. I have tried a sample code.
set results "{0 0 0 0 1}
{2 2 3 3 3}
{1 1 2 3 4}";
set result [split $results \n];
foreach line $result {
if {[regexp {^\s*\{(.+)\}\s*} $line Complete_Match Content]} {
puts "$Content\n";
}
}

SAS: Wildcard characters in IF statements

I have a dataset with patient diagnosis codes, and I need to use wildcard characters to categorize their diagnoses.
patientID diagnosis cancer age gender
1 250.0 0 65 M
1 250.00 1 65 M
2 250.01 1 23 M
2 250.02 0 23 M
3 250.11 0 50 F
3 250.12 0 50 F
4. 513.01. 1 34 M
Diagnoses with the 5th character as 0 or 2 need to be classified as type 2 diabetes, and those ending in 1 and 3 need to be classified as type 1 diabetes. However, 250.0 only has 4 characters and needs to be classified as type 2.
This in the data step doesn't work
if diagnosis_code ='250.%0' then t2dm = 1;
if diagnosis_code ='250.%1' then t1dm = 1;
No need for wildcards for that test. Use the colon modifier to test prefix of the code and substr() function to test the 6th character (5th digit).
if diagnosis_code='250.0' or
(diagnosis_code=:'250.' and substr(diagnosis_code,6)='0') then t2dm = 1;
if diagnosis_code=:'250.' and substr(diagnosis_code,6)='1' then t1dm = 1;
Wildcard matches in DATA step if statements can be done using the PRXMATCH function. PRX means Perl regular expression.
PRXMATCH (regular-expression-pattern,text-to-evaluate)
PRXMATCH Function documentation
Sample data
data have; input
patientID diagnosis_code $ cancer age gender $; datalines;
1 250.0 0 65 M
1 250.00 1 65 M
2 250.01 1 23 M
2 250.02 0 23 M
3 250.11 0 50 F
3 250.12 0 50 F
4. 513.01. 1 34 M
run;
Example code
data want;
set have;
t2dm = prxmatch('/^250\.\d*0$/', trim(diagnosis_code)) > 0;
t1dm = prxmatch('/^250\.\d*1$/', trim(diagnosis_code)) > 0;
run;
Notes for the sample code
/ bounds a regex pattern
^ match at the beginning
250 match 250
\. match an actual period
\d match a digit
\d* match zero or more digits
0 1 match a 0 or 1
0$ 1$ match the 0 or 1 at the end
trim() trim the text to evaluate so the match at the end works
> 0 a match will return position p in text or 0 if no match, p > 0 will logically evaluate to 0 or 1 and be assigned to the flag variable

Positive look ahead in R - passing variables

I got stuck in a regular expression.
I usually use this line of code to find overlapping repetitions in strings:
gregexpr("(?=ATGGGCT)",text,perl=TRUE)
[[1]]
[1] 16 45 52 75 203 210 266 273 327 364 436 443 480 506 534 570 649
attr(,"match.length")
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
attr(,"useBytes")
[1] TRUE
Now I want to give to gregexpr a pattern contained in a variable:
x="GGC"
and of course if I pass the variable x, gregexpr is going to search "x" and not what the variable contains
gregexpr("(?=x)",text,perl=TRUE)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
How can I pass my variable to gregexpr in this case of positive look ahead?
I'd play with the sprintf function:
x <- "AGA"
text <- "ACAGAGACTTTAGATAGAGAAGA"
gregexpr(sprintf("(?=%s)", x), text, perl=TRUE)
## [[1]]
## [1] 3 5 12 16 18 21
## attr(,"match.length")
## [1] 0 0 0 0 0 0
## attr(,"useBytes")
## [1] TRUE
sprintf substitutes the occurrence of %s by the value of x.
You could use paste0 which is short for paste(x, sep="") ...
x <- "GGC"
text <- 'ATGGGCTATGGGCTATGGGCTATGGGCT'
gregexpr(paste0('(?=', x, ')'), text, perl=TRUE)
# [[1]]
# [1] 4 11 18 25
# attr(,"match.length")
# [1] 0 0 0 0
# attr(,"useBytes")
# [1] TRUE
And if you want to access the overlapping matches, take a look at Overlapping matches in R
The fn$ prefix in gsubfn package supports string interpolation:
library(gsubfn)
# test data
text <- "ATGGGCTAAATGGGCT"
x <- "GGGC"
fn$gregexpr("(?=$x)", text, perl = TRUE)
See ?fn , the gsubfn home page and the gsubfn vignette, vignette("gsubfn") .
ok I solved it in this way:
text="ATGGGCTAAATGGGCT"
x="GGC"
c=paste("(?=",x,")",sep="")
r=gregexpr(c,text,perl=TRUE)

tcl:loop and extract with Regex

I have large data and I want to extract two types of data based on two conditions. I wrote a tcl script to extract the data by using regex (newbie to regex).
I have used the following condition which works fine and produces part of the desired output:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time ] {
I'm using the variable time somewhere in the script. The above condition produces the following o/p(this is just a sample as the file is large):
+ 30.808352 1 2 tcp 40 ------- 30 6.7 2.30 81 2073
+ 30.808416 1 2 tcp 40 ------- 128 8.16 2.159 81 2069
+ 30.809513 1 2 tcp 40 ------- 156 12.19 2.187 1 2077
+ 30.809641 1 2 tcp 80 ------- 156 12.19 2.187 1 2078
+ 30.809878 1 2 tcp 40 ------- 151 7.18 2.182 41 2079
+ 30.813096 1 2 tcp 40 ------- 161 9.20 2.192 0 2083
+ 30.813352 1 2 tcp 40 ------- 157 13.19 2.188 1 2085
+ 30.81348 1 2 tcp 80 ------- 157 13.19 2.188 1 2086
+ 30.815362 1 2 tcp 40 ------- 148 12.18 2.179 41 2088
+ 30.815426 1 2 tcp 40 ------- 148 5.9 2.179 41 2089
+ 30.818096 1 2 tcp 40 ------- 162 10.20 2.193 0 2091
+ 30.818544 1 2 tcp 40 ------- 158 3.78 2.189 1 2093
+ 30.818672 1 2 tcp 80 ------- 158 14.19 2.189 1 2094
+ 30.820657 1 2 tcp 40 ------- 153 9.19 2.184 41 2096
+ 30.821579 1 2 tcp 40 ------- 154 10.19 2.185 41 2097
Then, inside the above if condition, I want check the 9th column :
//condition 1
if (9th between [3-6].*) ( such as 3.78,6.7, 5.9)
The second condition is :
//condition 2
if (9th between [7-14].*) ( such as 14.19,12.18,10.19, 9.19,.....)
I'm struggling with two conditions above. I tried the following, I didn't get an error, however, no matching occurred !!
condition 1:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([3-9])\..*/ } $line ] {
I know I'm repeating part of the main if condition, becuase I don't know how to skip the columns !!!
condition 2:
if [regexp {\+ ([0-9.]+) 1 2.*-* ([7-9]|1[0-4])\..*/} $line ] {
Any suggestions !!!
Why don't you split on space? You can achieve pretty much the same outcome using a few more lines. It will be readable and can people will understand the code better:
if [regexp {\+ ([0-9.]+) 1 2.*- } $line -> time] {
set elements [split $line " "] ;# You can actually omit the " " in this case
set 9th [lindex $elements 8]
# Condition 1
if {$9th >= 3 && $9th < 7} { do something }
# Condition 2
if {$9th >= 7 && $9th < 15} { do something }
}
match 7-14 \+ ([0-9.]+) 1 2.*- \d+\s(?:[7-9]|1[0-4]) Demo
match 3-6 \+ ([0-9.]+) 1 2.*- \d+\s[3-6] Demo

How to split a string with a Regex when the pattern "prefix" is variable

I have the following string:
Giants 2 9 : 10 L.Tynes 22 yd . Field Goal ( 4 - - 3 , 1 : 20 ) 0 3 Cowboys 2 1 : 01 K.Ogletree 10 yd . pass from T.Romo ( D.Bailey kick ) ( 7 - 73 , 2 : 33 ) 7 3 Cowboys 3 10 : 24 K.Ogletree 40 yd . pass from T.Romo ( D.Bailey kick ) ( 9 - 80 , 4 : 36 ) 14 3 Giants 3 5 : 11 A.Bradshaw 10 yd . run ( L.Tynes kick ) ( 9 - 89 , 5 : 13 ) 14 10 Cowboys 3 0 : 40 D.Bailey 33 yd . Field Goal ( 8 - 65 , 4 : 31 ) 17 10 Cowboys 4 5 : 57 M.Austin 34 yd . pass from T.Romo ( D.Bailey kick ) ( 8 - 82 , 7 : 06 ) 24 10 Giants 4 2 : 36 M.Bennett 9 yd . pass from E.Manning ( L.Tynes kick ) ( 12 - 79 , 3 : 21 ) 24 17 Time : 2 : 53
The prefix to the subtrings will either be "Cowboys" or "Giants". The string always ends with a right parenthesis ) and two numbers.
I can't even imagine what Regex to use. I can use string functions and loop over the string, but a Regex would help me later on. Maybe I could use the split function, but that's over my head.
I suppose I could parse "Cowboys" then "Giants".
I think this RegEx gives what you want:
(Cowboys|Giants).*?\)\s\d+\s\d+
"Cowboys" or "Giants" followed by arbitrary characters until you get a right paren, a space, some digits, a space, and some more digits.
I don't know ColdFusion, but this does the job in python:
match = re.findall(re.compile('((Giants|Cowboys)(.(?!Cowboys|Giants))*.)', re.DOTALL), s)
where s is the provided string. re.DOTALL implies that . matches whitespace. re.findall means to do a global search, which reFindAll probably does as well.
The regex does this:
Create a spanning group
Look for "Giants" or "Cowboys" as the starting string
Look for any character (.) that's not followed by the string "Cowboys" or "Giants" and matches as many as possible (which means, match all characters until the one succeeded by "Cowboys" or "Giants".
Match another character.
Since there's three groups, the group you're interested in might be numbered differently in ColdFusion. In python, they're embedded in the parent group.
>>> match[0]
('Giants 2 9 : 10 L.Tynes 22 yd . Field Goal ( 4 - - 3 , 1 : 20 ) 0 3', 'Giants', '3')
>>> match[1]
('Cowboys 2 1 : 01 K.Ogletree 10 yd . pass from T.Romo ( D.Bailey kick ) ( 7 - 73 , 2 : 33 ) 7 3', 'Cowboys', '3')
>>> match[2]
('Cowboys 3 10 : 24 K.Ogletree 40 yd . pass from T.Romo ( D.Bailey kick ) ( 9 - 80 , 4 : 36 ) 14 3', 'Cowboys', '3')
I think in most other languages you would address match[1], match[4], match[7], ... instead.