SAS where statement - sas

SAS newbie - trying to complete a practice exercise. I'm probably going to face palm myself once someone points out what I'm doing wrong, but for now, I can't tell what the issue is.
I have a datset with 3 variables: ID $ avgNumDonations DonationAmt.
I'm asked to create a subset (i'm doing it in my proc print statement) that contains no records with avgDonation below 20 and DonationAmt under a million.(I believe this is a trick question as there are no cases in the original data set that meet both criteria)
I wrote my where clause as follows:
where DonationAmt >= 1000000 and avgNumDonations >= 20
However, it seems to be acting as an OR statement instead of a AND statement, because my subset is eliminating ID's 45 and 78.
Can someone tell me what I'm missing? As I mentioned, no cases meet the criteria so I expected to have the same cases in my "subset".

I think you may be misunderstanding either the WHERE or AND/OR logic.
WHERE is an inclusion criteria. Almost all of your records meet this criteria, but not all. Note that with AND it does have to meet both of your criteria, if either is false the it is excluded. It sounds like you want an OR instead of AND.
So to determine records that are excluded, either criteria would be false. So look for records where numDonations < 20 - (ID 45) and DonationAmount< 1000000 - ID 78. So those two records would be excluded. Which is what you're seeing.

If both criterias should meet the conditions you have to use OR instead of AND:
data a;
id=12;
avgdon = 58.3;
sumdon=4833722;
output;
id=45;
avgdon = 15.3;
sumdon=14833722;
output;
id=56;
avgdon = 50.3;
sumdon=9833722;
output;
id=78;
avgdon = 39.3;
sumdon=833722;
output;
id=910;
avgdon = 28.3;
sumdon=2833722;
output;
run;
proc print data=a(where=(sumdon>=1000000 OR avgdon>=20));
run;
Otherwise it is correct to use AND. Then 2 rows are eliminated.

Related

Where statement is not capturing my condition correctly

I want to tell SAS to capture specific observation under the variable "rashloc_spcy" (and others) for a string observations ("B", "P/G", "Peri", "Gen"). However, when I see the results, SAS is capturing other observations not described in my statement. Is there anything I can do to modify my code?
output result
proc print data=k.dataset;
var rashloc_GNT rashloc_PER rasloc_Spcfy;
where ((rashloc_GNT = "GNT") OR (rashloc_PER = "PER")) OR rashloc_Spcfy in ("B", "P/G", "Peri", "Gen"));
run;
I should be getting only the quoted keyterms in the variable of interest (rashloc_spcfy)
So you want to exclude cases where the third variable is some other non missing value if they meet the first two criteria?
So perhaps?
where ((rashloc_GNT = "GNT") OR (rashloc_PER = "PER"))
and not (rashloc_Spcfy not in (" ","B","P/G","Peri","Gen"))
;

Searching for Easy Way in SAS to Vertically Stack Different Variable Names to One

Faced with situation where vendor had different varname counts when one variable will actually suffice. Imagine the following very simplified version of my SAS code. Here there are five variables of interest but I am OUTPUTing it to 2 tables/dataframes that may be stacked once I RENAME the key variables in each file.
data
ABA ADDI;
set zach.COMMERCIAL_A12;
keep
PRODUCT_DESC
ABA_NUMERCNT
ADD_INITIATION_NUMERCNT;
if ABA_DENOMCNT = 1 then output ABA;
if ADD_INITIATION_DENOMCNT = 1 then output ADDI;
run;
Right now the program creates the two new OUTPUT files = ABA and ADDI. Each of the files has the same three variables from my KEEP. Later on I will stack them. So for ABA I wish to keep only PRODUCT_DESC & ABA_NUMERCNT and for ADDI I wish to keep PRODUCT_DESC & ADD_INITIATION_NUMERCNT. But before stacking them I would like to sort of automate it so that ABA_NUMERCNT becomes VarTemp and ADD_INITIATION_NUMERCNT again becomes VarTemp before they are stacked.
Is there an easy way to do this?
Looks like you want to use dataset options on your output datasets. It is a little hard to follow the details of your question but it looks like you want something like:
data
ABA (rename=(ABA_DENOMCNT=newvar ) drop=ADD_INITIATION_DENOMCNT )
ADDI (rename=(ADD_INITIATION_DENOMCNT=newvar) drop=ABA_DENOMCNT )
;
set zach.COMMERCIAL_A12;
keep PRODUCT_DESC ABA_NUMERCNT ADD_INITIATION_NUMERCNT;
if ABA_DENOMCNT = 1 then output ABA;
if ADD_INITIATION_DENOMCNT = 1 then output ADDI;
run;

How to record qualitative variable with over 100 dummies to several levels as quantitative in SAS

I am working with SAS and want to record variable which with over 50+ different qualitative dummies. For example, the state of the U.S.
In this case, I just want to reduce them into 4 or 5 levels dummy as quantitative variable.
I get several ideaS, for example to use if/else statement, however, the problem is that i have to write down and specify each of area name in SAS and the code looks like super heavy.
Is there any other ways to do that without redundant code? Or to avoid write each specific name of variable? In SAS.
Any ideas are appreciated!!
Method 1:
Use IN, but you still have to list the variables. You can also do it via a format, but you have to define the format first anyways.
if state in ('AL', 'AK', 'AZ' ... etc) then state_group = 1;
else if state in ( .... ) then state_group = 2;
Method 2:
For a format, you create format using PROC FORMAT and then apply it.
proc format;
value $ state_grp_fmt
'AL', 'AK', 'AZ' = 1
'DC', 'NC' = 2 ;
run;
And then you can use it with a PUT statement.
State_Group = put(state, state_grp_fmt);

What is the Stata-equivalent of this SAS macro?

I will present the simplified version of what I want to do. I know how to do it easily in SAS but not in Stata.
Let's say I am trying to create a "poor" binary variable = 1 if an observation is classified as poor and 0 otherwise. I want to have two classifications, one is based on real income, and another based on real consumption (these are variables in the dataset).
The SAS macro would be
%MACRO poverty_bin(type=, measure=)
DATA dataset;
SET dataset;
IF &measure. <= poverty_line THEN poor&type. = 1 ELSE poor&type. = 0;
RUN;
%MEND
%poverty_bin(type=con, measure=real_consumption);
%poverty_bin(type=inc, measure=real_income);
which should create two binary variables poor_con and poor_inc.
I have no idea how to do this in Stata. I tried doing something like this just to see if nested foreach is what I'm looking for:
foreach x of newlist con inc {
foreach y of newlist real_income real_consumption{
display "`x' and `y'"
}
}
But it gives an error message saying "variable real_income already defined"
The error message you cite implies that earlier code you do not show us created a variable real_income.
I do not know SAS but I can tell you that given a numeric variable x
gen y = x <= 42
will create a variable y with value 1 if x <= 42 and 0 otherwise.
For another such variable, use another similar statement. In Stata and perhaps any other language, setting up a nested loop or defining a program instead of making two statements directly seems overkill. For a number of new variables much larger than 2, that might not be true.
foreach v in x y {
gen new`v' = `v' <= 42
}
For completely arbitrary existing names, new names and thresholds it is likely to be easier to write out statements individually.
This is documented. See for example 13.2.2 in [U] or this FAQ.

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0