regular expression to extract tokens from block of text - regex

I have following block of text retrieved from a log file
SELECT statement with ID: AE12400 SELECT /*+ ALL_ROWS */
T1.CONFLICT_ID, T1.LAST_UPD, T1.CREATED,
T1.LAST_UPD_BY, T1.CREATED_BY, T1.MODIFICATION_NUM,
T1.ROW_ID, T1.DFLT_LIC_FLG, T1.NAME, T1.VAL,
:1 FROM SIEBEL.S_LST_OF_VAL T1 WHERE
(T1.ACTIVE_FLG = :2 OR T1.ACTIVE_FLG IS NULL) AND (T1.TYPE = :3
AND T1.BU_ID IS NULL) ORDER BY T1.TYPE, T1.ORDER_BY, T1.VAL
Bind variable 1: ,,,SADMIN,00000002579c129c:0,,List Of Values
(Internal), Bind variable 2: Y Bind variable 3: ZERO_DTIME_MODE
***** SQL Statement Execute Time: 0.028 seconds ***** 3 row(s) retrieved by ID: AE0EF18
I need to get following tokens out of this block
Statement Id : AE12400
SQL_Query: SELECT /*+ ALL_ROWS */
T1.CONFLICT_ID, T1.LAST_UPD, T1.CREATED,
T1.LAST_UPD_BY, T1.CREATED_BY, T1.MODIFICATION_NUM,
T1.ROW_ID, T1.DFLT_LIC_FLG, T1.NAME, T1.VAL,
:1 FROM SIEBEL.S_LST_OF_VAL T1 WHERE
(T1.ACTIVE_FLG = :2 OR T1.ACTIVE_FLG IS NULL) AND (T1.TYPE = :3 AND T1.BU_ID IS NULL) ORDER BY T1.TYPE, T1.ORDER_BY, T1.VAL
Bind Variable : [",,,SADMIN,00000002579c129c:0,,List Of Values (Internal)","Y","ZERO_DTIME_MODE"]
SQL Time: 0.028
SQL Rows: 3
I have come up with following regular experssion so far to extract the statement, time & rows
SQL Rows : \s\d{1,4}\s
SQL Time: \d{1,3}\.\d{1,4}
Statement Id: (ID:)(\s\w+)
But I am not sure how to extract the SQL along with Bind Variables from the text.

Your current patterns are not precise as they may match another sub-strings that are not expected ones. Hence, I'll go to provide all needed expressions:
SQL rows:
\d{1,4}(?=\s*row)
Query running time:
(\d+(?:\.\d+)?)(?=\s*second)
Statement ID:
ID:\s*(\w+)
SQL statement (m: dot matches newlines):
(?m)ID:\s\w+\s(.*?)(?=Bind variable)
Bind variables:
(?m)Bind variable\s*\d+:\s*(.*?)(?=Bind variable|$)
For bind variables you should work with a matchAll() or findAll() similar method in your programming language. [Live demo, look at Match groups block]

These could be cleaned up; they're not very efficient as they are. But this should head you in the right direction.
SQL_Query: SELECT(?! statement with ID)[\W\w]*?(?=Bind variable \d)
If you're regexing a whole log with more than one of those text blocks, you'll have to get all the bind variables first, and then each one taken from that. Otherwise, you can skip that step.
Find bind vars: Bind variable \d+:[\W\w]*?(?=\s+\*\*\*\*\*)
Extract vars: Bind variable \d+:\s*([\W\w]*?)(?=Bind variable)
Also there could be problems if, for example, there is the text "Bind variable" within your SQL Query... But it would be difficult to get 100% on that, and it's unlikely there would be such things mixed into the other parts of the log, I'm guessing.

Related

Truncation when using CASE in SQL statement in SAS (Enterprise Guide)

I am trying to manipulate some text files in SAS Enterprise Guide and load them line by line in a character variable "text" which gets the length 1677 characters.
I can use the Tranwrd() function to create a new variable text21 on this variable and get the desired result as shown below.
But if I try to put some conditions on the execution of exactly the same Tranwrd() to form the variable text2 (as shown below) it goes wrong as the text in the variable is now truncated to around 200 characters, even though the text2 variable has the length 1800 characters:
PROC SQL;
CREATE TABLE WORK.Area_Z_Added AS
SELECT t1.Area,
t1.pedArea,
t1.Text,
/* text21 */
( tranwrd(t1.Text,'zOffset="0"',compress('zOffset="'||put(t2.Z,8.2)||'"'))) LENGTH=1800 AS text21,
/* text2 */
(case when t1.type='Area' then
tranwrd(t1.Text,'zOffset="0"',compress('zOffset="'||put(t2.Z,8.2)||'"'))
else
t1.Text
end) LENGTH=1800 AS text2,
t1.Type,
t1.id,
t1.x,
t1.y,
t2.Z
FROM WORK.VISSIM_IND t1
LEFT JOIN WORK.AREA_Z t2 ON (t1.Type = t2.Type) AND (t1.Area = t2.Area)
ORDER BY t1.id;
QUIT;
Anybody got a clue?
This is a known problem with using character functions inside a CASE statement. See this thread on SAS Communities https://communities.sas.com/t5/SAS-Programming/Truncation-when-using-CASE-in-SQL-statement/m-p/852137#M336855
Just use the already calculated result in the other variable instead by using the CALCULATED keyword.
CREATE TABLE WORK.Area_Z_Added AS
SELECT
t1.Area
,t1.pedArea
,t1.Text
,(tranwrd(t1.Text,'zOffset="0"',cats('zOffset="',put(t2.Z,8.2),'"')))
AS text21 length=1800
,(case when t1.type='Area'
then calculated text21
else t1.Text
end) AS text2 LENGTH=1800
,t1.Type
,t1.id
,t1.x
,t1.y
,t2.Z
FROM WORK.VISSIM_IND t1
LEFT JOIN WORK.AREA_Z t2
ON (t1.Type = t2.Type)
AND (t1.Area = t2.Area)
ORDER BY t1.id
;
If you don't need the extra TEXT21 variable then use the DROP= dataset option to remove it.
CREATE TABLE WORK.Area_Z_Added(drop=text21) AS ....

Is there a way to extract year range from wide data?

I have a series of wide panel datasets. In each of these, I want to generate a series of new variables. E.g., in Dataset1, I have variables Car2009 Car2010 Car2011 in a dataset. Using this, I want to create a variable HadCar2009, which is 1 if Car2009 is non-missing, and 0 if missing, similarly HadCar2010, and so on. Of course, this is simple to do but I want to do it for multiple datasets which could have different ranges in terms of time. E.g., Dataset2 has variables Car2005, Car2006, Car2008.
These are all very large datasets (I have about 60 such datasets), so I wouldn't want to convert them to long either.
For now, this is what I tried:
forval j = 1/2{
use Dataset`j', clear
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
save Dataset`j', replace
}
This works, but I am reluctant to use capture, because perhaps some datasets have a variable called car2008 instead of Car2008, and this would be an error I would like the program to stop at.
Also, the ranges of years across my 60-odd datasets are different. Ideally, I would like to somehow get this range in a local (perhaps somehow using describe? I'm not sure) and then just generate these variables using that local with a simple for loop.
But I'm not sure I can do this in Stata.
Your inner loop could be rewritten from
forval i=2005/2011{
capture gen HadCar`i' = .
capture replace HadCar`i' = 1 if !missing(Car`i')
capture replace HadCar`i' = 0 if missing(Car`i')
}
to
foreach v of var Car???? {
gen Had`v' = !missing(`v')
}
noting the fact in Stata that true or false expressions evaluate to 1 or 0 directly.
https://www.stata-journal.com/article.html?article=dm0099
https://www.stata-journal.com/article.html?article=dm0087
https://www.stata.com/support/faqs/data-management/true-and-false/
This code is going to ignore variables beginning with car. There are other ways to check for their existence. However, if there are no variables Car???? the loop will trigger an error message. A loop over ?ar???? would catch car???? and Car???? (but just possibly other variables too).

How to fix bound sparql query not working as expected?

I have this sparql query that I need to modify. I have the example instance, which has the exampleID 12345. For this case, if the :hasRelatedExample connection exists, the ?test variable will become 2. If the :hasRelatedExample connection doesn't exist from the example instance, the ?test variable doesn't get assigned the 1 value as it should. How could I fix this query to reflect the needed behavior?
PREFIX : <http://www.example.com#>
Select distinct ?test
where
{
?ex a :Example ;
:exampleID "12345" ;
:hasRelatedExample ?relatedExample .
BIND (IF(BOUND(?relatedExample),2,1) as ?test)
}

Check that data are constant within group

I often find myself needing to check whether or not variables are constant within a group. This is how I currently go about this (assume that the group is defined by a-b-c and the variable in question is var):
bys a b c (var): gen isconstant=var[1]==var[_N]
*manually inspect the results of the below tabulation; if all 1's, then it is constant
tab isconstant
drop isconstant
(Note that the above approach assumes that there are no missing observations within a group. I would have to think more about how to approach it if there were missings. And instead of manually checking, could use something along the lines of assert.)
This works fine, but is there a more succinct way to do this? Perhaps a one line solution, roughly analogous to isid ..., but of course checking for something else.
The principle behind your approach is also explained in this FAQ but I am not aware of a dedicated command. Still, it is programmable and you are a programmer, so where is yours?
Here is a quick stab:
*! 1.0.0 NJC 2 March 2020
program homog, sortpreserve
version 8
syntax varname [if] [in] [, MISSing BY(varlist) ]
* missings are ignored by default
if "`missing'" == "" {
marksample touse, strok
if "`by'" != "" markout `touse' `by', strok
}
else marksample touse, novarlist
tempvar OK
bysort `touse' `by' (`varlist') : gen byte `OK' = `varlist'[1] == `varlist'[_N]
quietly summarize `OK' if `touse'
if r(min) == 0 display as err "assertion is false"
end
and some silly examples:
. sysuse auto, clear
(1978 Automobile Data)
. homog mpg
assertion is false
. homog rep78, by(rep78)
. gen one = 1
. homog one
. replace one = . in L
(1 real change made, 1 to missing)
. homog one
. homog one, missing
assertion is false
So, the principles are
No news is good news. The only possible output, other than error messages, is a message "assertion is false". This isn't treated as an error. If your taste runs otherwise, clone the program, rename it and change the way it works.
by() is an option and if specified causes all comparisons to be by the distinct groups of observations so identified.
Missings are ignored by default. The option missing changes that so that for example 42 and missing are reported as different. This applies also to missing values of any by() variables.

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1