sas drop value through the whole data - sas

I want to turn invalid values to a single value and them print only the ID's of the obs that has these values=> I have multiple error is the second part (the loop), is it possible to fix somehow this code, or is it simply impossible to do that this way?
data comb8;set comb;
if q2 not in (1,2,3,4,5)then q2=666;
if q3 not in (1,2,3,4,5)then q3=666;
if q10 not in (1,2,3,4,5)then q10=666;
if gender not in(0,1)then gender=666;
if married not in(0,1)then married=666;
run;
data comb10; set comb8;
do i=1 to n;
if i NE 666 then drop(?????????);
end;
keep id;
run;

Hoping I've understood this right - the end result you're after is to just keep the obervation number of any observation that contains an invalid value for any of the criteria you're checking for? If so, try this:
data comb8(keep=id where=(not missing(obid)));
set comb;
if q2 not in (1,2,3,4,5) then obid=_n_;
if q3 not in (1,2,3,4,5) then obid=_n_;
if q10 not in (1,2,3,4,5) then obid=_n_;
if gender not in(0,1) then obid=_n_;
if married not in(0,1) then obid=_n_;
run;
_n_ is an automatic variable that identifies the observation number that has been loaded in from the set statement. You can set obid to this value when an issue is found and then use (where=(not missing(obid))) to only keep the observations that had invalid values
Just in case it's of help to anyone else later on...
data comb8;set comb;
if q2 not in (1,2,3,4,5)then q2=666;
if q3 not in (1,2,3,4,5)then q3=666;
if q10 not in (1,2,3,4,5)then q10=666;
if gender not in(0,1)then gender=666;
if married not in(0,1)then married=666;
run;
data comb10(keep=obid where=(not missing(obid)));
set comb8;
Array Q q1-q10 gender married . ;
do i=1 to dim(Q) ;
if Q[i] EQ 666 then obid=_n_;
end;
run;

Related

SAS - Flag if variable is present in another column of same dataset

I have a SAS dataset with 2 columns that I want to compare (VAR1 and VAR2). I would like to check if for each value of VAR1 this value exists anywhere in the column VAR2. If the VAR1 value does not exist anywhere in the column VAR2 I want to flag it as 1.
For exemple :
I have this :
TABLE in
VAR1
VAR2
k3
t7
t7
g7
p8
k3
...
...
And would want this
TABLE out
VAR1
VAR2
FLAG
k3
t7
0
t7
g7
0
p8
k3
1
...
...
...
I tried using
FLAG = ifn(indexw(VAR2,VAR1,0,1)
But this method only compare the two columns for the current row.
Thank you in advance for your help !
Edit : I tried running this code as suggested by Joe but ran into an error.
Code :
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o';
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
Error:
ERROR: This range is repeated, or values overlap: .- ..
In SAS, everything is based on one row at a time in the data step, so you can't do what you're looking to directly.
What you can do, though, is use a lookup technique - there are quite a few - and that will let you get what you're after.
The easiest one to use in your case is probably a format.
data for_fmt;
set your_table;
fmtname = 'VAR2F';
start = var2;
label = '0';
output;
if _n_ eq 1 then do;
hlo = 'o'; *this is for "other" (not found) records;
start = .;
label = '1';
output;
end;
run;
proc sort nodupkey data=for_fmt;
by start;
run;
proc format cntlin=for_fmt;
quit;
data want;
set your_table;
flag = put(var1,var2f.);
run;
This is pretty fast (only limited by dataset read/write time) unless you have millions of unique rows.
You could also merge the dataset to itself, or do this in SQL, or use a hash table, but the format approach is probably simplest.
As #Joe says, it is always on one row at a time in the data step. So if you can make all possible value of var1 in one row, you can do the character match work easily.
data your_table;
length VAR1 $2;
length VAR2 $2;
input VAR1 VAR2;
datalines;
k3 t7
t7 g7
p8 k3
;
data want;
array _char_[&sysnobs.]$200._temporary_;
do until(eof1);
set your_table end=eof1;
i+1;
_char_[i]=var2;
end;
do until(eof2);
set your_table end=eof2;
flag=1;
do i=1to dim(_char_) until(flag=0);
if var1=_char_[i] then flag=0;
end;
output;
end;
run;
Here's the method I typically use in situations like this. I first create a list of the variables to check against, then merge with that and can easily pick out the ones that are found.
proc sort data=have (keep=var2) out=var2levels (rename=(var2=var1)) nodupkey; by var2;
proc sort data=have; by var1;
data want;
merge have (in=in1) var2levels (in=in2);
by var1;
if in1;
flag = in2;
run;
So here the first proc sort creates a list of all the unique values of var2. The output data set renames that to var1 for merging purposes (this can be done more clearly but less efficiently by renaming multiple variables). Then we simply merge the original data set (keeping all records) with the list of existing var2 values and set the flag accordingly.

Deleting missing values when dealing with panel data

I am working with a panel dataset, so many countries and many variables throughout a period. The problem is that some countries have no value for certain variables across the whole period and I would like to get rid of them. I found this code for deleting rows with missing values :
DATA data0;
SET data1;
IF cmiss(of _all_) then delete;
RUN;
But all this does is check every row, while I would like to delete a whole country if it has no observations in at least one variable.
Here's a part of the data :
If you want to delete the whole country if it has any information missing, you are on the right track, you just need to add a (group) by statement.
If your data is already sorted by country, as it appears to be in the picture, you can just run:
data want;
set have;
IF cmiss(of _all_) then delete;
by country;
If it is not sorted, you need to first run:
proc sort data=have;
by country;
However, if you have 60 years of data for every country, my guess is that you will not find a single one that have all the information for every year. It will be probably better to do some substantive choices of countries and periods you want to analyze, and then perform multiple imputatiom of missing data: https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
You can use a DOW loop to compute which variable(s) contain only missing values within a group.
A second DOW loop outputs only those groups in which all variables contain at least on value.
Example:
data have;
call streaminit (2020);
do country = 1 to 6;
do year = 1960 to 1999;
array x gini kof tradegdp fdi gdp age_dep educ;
do over x;
x = rand('integer', 20, 100);
end;
if country = 1 then call missing (gini);
if country = 2 then call missing (educ);
if country = 4 then call missing (fdi);
output;
end;
end;
run;
data want;
* count number of non-missing values over group for each arrayed variable;
do _n_ = 1 by 1 until (last.country);
set have;
by country;
array x gini kof tradegdp fdi gdp age_dep educ;
array flag(100) _temporary_; * flag if variable has a non-missing value in group;
do _index = 1 to dim(x);
if not(flag(_index)) then flag(_index) = 1 - missing(x(_index));
end;
end;
* check if at least one variable has no values;
_remove_group_flag = sum(of flag(*)) ne dim(x);
do _n_ = 1 to _n_;
set have;
if not _remove_group_flag then output;
end;
call missing (of flag(*));
run;
Will LOG
NOTE: There were 240 observations read from the data set WORK.HAVE. First DOW loop
NOTE: There were 240 observations read from the data set WORK.HAVE. Second DOW loop
NOTE: The data set WORK.WANT has 120 observations and 11 variables. Conditional output

condense multiple records into single record in sas

I have row data by account level and I wish to group them by the account owner as a new data. Yes will take the priority.
Account_Owner Account_No Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
Desired Outcome
Account_Owner Ever_Purchase Ever_Purchase_within_2days Ever_Deliver_in_2weeks
Tom Yes Yes Yes
I am sorry that I don't have any code because I don't know where to start.
You can use a DOW loop to track the group result for each ever_* variable in a temporary array.
proc format;
value yesno .,0 = 'No' other='Yes';
data have; input
Account_Owner $ Account_No Ever_Purchase $ Ever_Purchase_within_2days $ Ever_Deliver_in_2weeks $;
datalines;
Tom 12345 Yes Yes No
Tom 34567 Yes No Yes
Tom 09876 No No No
;
data want;
array evals(100) _temporary_; * presume never more than 100 flag variables;
call missing (of evals(*));
* dow loop;
do until (last.account_owner);
set have;
by account_owner;
array flags ever:;
do _n_ = 1 to dim(flags);
evals(_n_) = evals(_n_) or flags(_n_) = 'Yes'; * compute aggregate result;
end;
end;
* move results back into original variables;
do _n_ = 1 to dim(flags);
flags(_n_) = put(evals(_n_), yesno.);
end;
* implicit output, one row per group combination;
run;
Note: In an alternative solution you can convert Yes/No to numeric 1/0 you can use Proc SUMMARY or Proc MEANS to computed the group result (max of var would be 1 if any Yes and 0 if all No)

SAS - Row by row Comparison within different ID Variables of Same Dataset and delete ALL Duplicates

I need some help in trying to execute a comparison of rows within different ID variable groups, all in a single dataset.
That is, if there is any duplicate observation within two or more ID groups, then I'd like to delete the observation entirely.
I want to identify any duplicates between rows of different groups and delete the observation entirely.
For example:
ID Value
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
The output I desire is:
ID Value
1 D
3 Z
I have looked online extensively, and tried a few things. I thought I could mark the duplicates with a flag and then delete based off that flag.
The flagging code is:
data have;
set want;
flag = first.ID ne last.ID;
run;
This worked for some cases, but I also got duplicates within the same value group flagged.
Therefore the first observation got deleted:
ID Value
3 Z
I also tried:
data have;
set want;
flag = first.ID ne last.ID and first.value ne last.value;
run;
but that didn't mark any duplicates at all.
I would appreciate any help.
Please let me know if any other information is required.
Thanks.
Here's a fairly simple way to do it: sort and deduplicate by value + ID, then keep only rows with values that occur only for a single ID.
data have;
input ID Value $;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;
run;
proc sort data = have nodupkey;
by value ID;
run;
data want;
set have;
by value;
if first.value and last.value;
run;
proc sql version:
proc sql;
create table want as
select distinct ID, value from have
group by value
having count(distinct id) =1
order by id
;
quit;
This is my interpretation of the requirements.
Find levels of value that occur in only 1 ID.
data have;
input ID Value:$1.;
cards;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
;;;;
proc print;
proc summary nway; /*Dedup*/
class id value;
output out=dedup(drop=_type_ rename=(_freq_=occr));
run;
proc print;
run;
proc summary nway;
class value;
output out=want(drop=_type_) idgroup(out[1](id)=) sum(occr)=;
run;
proc print;
where _freq_ eq 1;
run;
proc print;
run;
A slightly different approach can use a hash object to track the unique values belonging to a single group.
data have; input
ID Value:& $1.; datalines;
1 A
1 B
1 C
1 D
1 D
2 A
2 C
3 A
3 Z
3 B
run;
proc delete data=want;
proc ds2;
data _null_;
declare package hash values();
declare package hash discards();
declare double idhave;
method init();
values.keys([value]);
values.data([value ID]);
values.defineDone();
discards.keys([value]);
discards.defineDone();
end;
method run();
set have;
if discards.find() ne 0 then do;
idhave = id;
if values.find() eq 0 and id ne idhave then do;
values.remove();
discards.add();
end;
else
values.add();
end;
end;
method term();
values.output('want');
end;
enddata;
run;
quit;
%let syslast = want;
I think what you should do is:
data want;
set have;
by ID value;
if not first.value then flag = 1;
else flag = 0;
run;
This basically flags all occurrences of a value except the first for a given ID.
Also I changed want and have assuming you create what you want from what you have. Also I assume have is sorted by ID value order.
Also this will only flag 1 D above. Not 3 Z
Additional Inputs
Can't you just do a sort to get rid of the duplicates:
proc sort data = have out = want nodupkey dupout = not_wanted;
by ID value;
run;
So if you process the observations by VALUE levels (instead of by ID levels) then you just need keep track of whether any ID is ever different than the first one.
data want ;
do until (last.value);
set have ;
by value ;
if first.value then first_id=id;
else if id ne first_id then remapped=1;
end;
if not remapped;
keep value id;
run;

SAS - if and then condition statements

My data is more than 70,000. I have more than 50 variables. (Var1 to Var50). In each variable, there are about about 30 groups (I'll use a to z). I am trying to get a selection of data using if statements. I'd like to select every data with the same group. Eg data in var 1 to 30 with a, data with var 1 to 30 in b.
I seem to be writing
If (Var1="a" and Var2="a" and Var3="a" and Var4="a" and all the way to var50=
"a") or (Var1="b" and Var2="a" and Var3="b" and Var4="b" and all the way to var50=
"b")...
How do I consolidate? I tried using an array but it didnt work and i was not sure if arrays work in the IF and then statement.
IF (VAR2="A" or VAR2="B" or VAR2="C" or VAR2="D"
or VAR3="A" or VAR3="B" or VAR3="C" or VAR3="D"
or VAR4="A" or VAR4="B" or VAR4="C" or VAR4="D"
or VAR5="A" or VAR5="B" or VAR5="C" or VAR5="D"
or VAR6="A" or VAR6="B" or VAR6="C" or VAR6="D"
or VAR7="A" or VAR7="B" or VAR7="C" or VAR7="D"
or VAR8="A" or VAR8="B" or VAR8="C" or VAR8="C"
or VAR9="A" or VAR9="B" or VAR9="C" or VAR9="D"
or VAR10="A" or VAR10="B" or I10_D10="C" or VAR10="D"
or VAR12="A" or VAR12="B" or VAR12="C" or VAR12="D"
or VAR13="A" or VAR13="B" or VAR13="C" or VAR13="D"
or VAR14="A" or VAR14="B" or VAR14="C" or VAR14="D"
or VAR15="A" or VAR15="B" or VAR15="C" or VAR15="D"
or VAR6="A" or VAR16="B" or VAR16="C" or VAR16="D"
or VAR17="A" or VAR17="B" or VAR17="C" or VAR17="D"
or VAR18="A" or VAR18="B" or VAR18="C" or VAR18="C"
or VAR19="A" or VAR19="B" or VAR19="C" or I10_D19="D"
or VAR20="A" or VAR20="B" or I10_D20="C" or VAR20="D"
or VAR21="D" or VAR22="A" or VAR22="B" or VAR22="C" or VAR22="D"
or VAR23="A" or VAR23="B" or VAR23="C" or VAR23="D"
or VAR24="A" or VAR24="B" or VAR24="C" or VAR24="D"
or VAR25="A" or VAR25="B" or VAR25="C" or VAR25="D"
or VAR26="A" or VAR26="B" or VAR26="C" or VAR26="D"
or VAR27="A" or VAR27="B" or VAR27="C" or VAR27="D"
or VAR28="A" or VAR28="B" or VAR28="C" or VAR28="C"
or VAR29="A" or VAR29="B" or VAR29="C" or VAR29="D"
or VAR30="A" or VAR30="B" or I10_D30="C" or VAR30="D")
then Group=1; else Group=0;
You probably don't need a macro, however a macro might be faster.
%let value=a;
data want;
set have;
array var[50];
keepit=1;
do i=1 to 50;
keepit = keepit and (var[i]="&value");
if ^keepit then
leave;
end;
if keepit;
drop i keepit;
run;
I create a signal variable and update it's value, it will be false if any value in the var[] array is not the &value. I leave the loop early if we find 1 non-matching value, to make it more efficient.
It's not exactly clear what you want. If you want to avoid checking all variables you can use WHICHC to find if any in a list are A.
X = whichc('a', of var1-var30);
If you want to see what different groups you have across all the variables, I think a big proc freq is what you want:
proc freq data=have noprint;
table var1*var2*var3*var4....*var30*gender*age / list out=table_counts;
run;
And then check the table_counts data set to see if that has what you want.
If neither of these are what you want, you need to add more details to your question. A sample of data and expected output would be perfect.
When I need to search several variables for a particular value what I will do is - combine all variables into one string and then search that string. Like this:
*** CREATE TEST DATA ***;
data have;
infile cards;
input VAR1 $ VAR2 $ VAR3 $ VAR4 $ VAR5 $;
cards;
J J K A M
S U I O P
D D D D D
l m n o a
Q U J C S
;
run;
data want;
set have;
*** USE CATS FUNCTION TO CONCATENATE ALL VAR# INTO ONE VARIABLE ***;
allvar = cats(var1, var2, var3, var4, var5);
*** IF NEEDED, APPLY UPCASE TO CONCATENATED VARIABLE ***;
*allvar = upcase(allvar);
*** USE INDEXC FUNCTION TO SEARCH NEW VARIABLE ***;
if indexc(allvar, "ABCD") > 0 then group = 1;
else group = 0;
run;
I'm not sure if this is exactly what you need, but hopefully this is something you can modify for your particular task.
The code as posted is testing if ANY of a list of variables match ANY of a list of values.
Let's make a simple test dataset.
data have ;
input id (var1-var5) ($);
cards;
1 E F G H I
2 A B C D E
;;;;
Make one array of the values you want to find and one array of the variables you want to check. Loop over the list of variables until you either find one that contains one of the values or you run out of variables to test.
data want ;
set have;
array values (4) $8 _temporary_ ('A' 'B' 'C' 'D');
array vars var1-var5 ;
group=0;
do i=1 to dim(vars) until (group=1);
if vars(i) in values then group=1;
end;
drop i;
run;
You could avoid the array for the list of values if you want.
if vars(i) in ('A' 'B' 'C' 'D') then group=1;
But using the array will allow you to make the loop run over the list of values instead of the list of variables.
do i=1 to dim(values) until (group=1);
if values(i) in vars then group=1;
end;
Which might be important if you wanted to keep the variable i to indicate which value (or variable) was first matched.