I am working in SAS Enterprise guide and want to create a table that contains all possible permutations of some columns. Here is an example:
Lets say I have three columns
apple pear plum
0 good blue
1 middle violet
bad
I would want my output table to look as follows:
apple pear plum
0 good blue
0 good violet
0 middle blue
0 middle violet
0 bad blue
0 bad violet
1 good blue
1 good violet
1 middle blue
1 middle violet
1 bad blue
1 bad violet
My actual code has more columns with more distinct values, so hard coding is definitely not an option. How can I create such a table in SAS?
Thanks up front for the help!
You can use PROC SQL to create full cross product.
proc sql ;
create table want as
select *
from (select distinct apple from have where not missing(apple))
, (select distinct pear from have where not missing(pear))
, (select distinct plum from have where not missing(plum))
;
quit;
PROC SUMMARY
data testx;
input apple pear $ plum $;
cards;
0 good blue
1 middle violet
1 bad blue
;;;;
run;
proc summary nway completetypes chartype;
class _all_;
output out=testb(drop=_:);
run;
proc print;
run;
Obs apple pear plum
1 0 bad blue
2 0 bad violet
3 0 good blue
4 0 good violet
5 0 middle blue
6 0 middle violet
7 1 bad blue
8 1 bad violet
9 1 good blue
10 1 good violet
11 1 middle blue
12 1 middle violet
Related
I have a dataframe with a column containing:
1 Tile 1 up Red 2146 (75) Green 1671 (75)
The numbers 1 can be upto 10
up can be also be down
The 2146 and 1671 can be any digit upto 9999
Whats the best way to break out each of these into separate columns without using split. I was looking at regex but not sure how to handle this (especially the white spaces). I liked the idea of putting the new column names in too and started with
Pixel.str.extract(r'(?P<num1>\d)(?P<text>[Tile])(?P<Tile>\d)')
Thanks for any help
To avoid an overly complicated regex pattern, perhaps you can use str.extractall to get all numbers, and then concat to your current df. For up or down, use str.findall:
df = pd.DataFrame({"title":["1 Tile 1 up Red 2146 (75) Green 1671 (75)",
"10 Tile 10 down Red 9999 (75) Green 9999 (75)"]})
df = pd.concat([df, df["title"].str.extractall(r'(\d+)').unstack().loc[:,0]], axis=1)
df["direction"] = df["title"].str.findall(r"\bup\b|\bdown\b").str[0]
print (df)
#
title 0 1 2 3 4 5 direction
0 1 Tile 1 up Red 2146 (75) Green 1671 (75) 1 1 2146 75 1671 75 up
1 10 Tile 10 down Red 9999 (75) Green 9999 (75) 10 10 9999 75 9999 75 down
I want to keep only the row with the highest rank1 for each team. If there is a tie, I want the row with the higher rank2. And then the higher rank3.
For example,
data test;
input name $ team $ rank1 rank2 rank3 country $
datalines;
Bob A 5 6 5 US
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Jim B 9 4 4 FRA
Bob C 3 4 1 FRA
Dan D 5 2 7 GER
Ike D 5 2 7 US
Jay D 5 2 8 UK
run;
I want:
Joe A 8 2 6 UK
Dav B 9 7 2 GER
Bob C 3 4 1 FRA
Jay D 5 2 8 UK
What is the most efficient way to do this? The dataset I'm working with is pretty big and is not sorted. I tried the below code but the sorts take forever to run. And the second sort sorts already sorted data. What if most teams only appear once in the dataset? Is it faster to split into duplicates and non-duplicates, sort only the duplicates and then append?
proc sort data=test;
by team descending rank1 descending rank2 descending rank3;
run;
proc sort data=test nodupkey;
by team;
run;
You can do that with PROC SUMMARY. Not sure about performance compared to what you are already doing.
proc summary data=test nway;
class team;
output out=ranked(drop=_:) idgroup(max(rank:) out(name rank: country)=);
run;
I am currently restructuring my package from SAS Base to SAS Enterprise Guide in a knowledge transfer to a client. Unfortunately, one aspect I have to sacrifice is the change from using compress to strip in my proc sql left joins, for example the following code doesn't work
data have;
input ID VarA;
datalines;
1 2
2 3
3 4
4 5
;
run;
data have1;
input ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9;
datalines;
1 3 4 6 7 3 6 6 7 8
2 2 2 2 2 5 6 7 2 1
3 5 6 7 8 4 5 3 4 3
4 3 4 6 7 4 6 8 3 6
;
run;
proc sql;
create table Want as
select a.*
,b.Var1
,b.Var2
,b.Var3
,b.Var4
,b.Var5
,b.Var6
,b.Var7
,b.Var8
,b.Var9
from Have as a
left join Have1 as b
on compress(a.ID) = compress(b.ID);
quit;
I can use the strip function at times but it is safer to deliver a package with compress as there is often misplaced spaces in observations. any ideas?
Edit: to save further confusion, I usually use the compress function to look up reference rates of bonds like EURIBOR 006m - this makes my generic example incorrect but the left join typically uses character variables
You need a character variable to use the compress function. Your ID variables are numeric.
Try converting to character:
on compress(put(a.ID,8.)) = compress(put(b.ID,8.));
I have read the online document and from it, I think that it only works with the column input method. How can this be used with list input method?
/This Works/
data new;
input height 25-26 #;
if height = 6 ;
input name $ 1-8 colour $ 9-13 place $ 16-24 ;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
/* But This Doesn't*/
data new;
input height #;
if height = 6;
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
LOG:
NOTE: Invalid data for height in line 79 1-6.
79 Deepak Red Delhi 6
height=. name= colour= place= _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 80 1-5.
80 Aditi Yellow Delhi 5
height=. name= colour= place= _ERROR_=1 _N_=2
The fixed layout of the first data lines make it possible to input a field from a specific location.
The second layout is variable in layout, so it is harder to arbitrarily grab a specific field.
So, what is wrong? In the second DATA step the input will read from the start of the line, so it won't read a number from where a name is.
Don't worry about 'reducing processing' by reading only part of a line. Held input and conditional processing is more often used for processing data lines that have some sort of variant or conditional data items within the content.
For both of those formats I would read all of the variables and then add logic to filter based on values.
If you really need to check if the last "word" on the line matched some criteria before deciding HOW to read the line then you might want to try using the automatic _infile_ variable.
data new;
input # ;
if scan(_infile_,-1,' ') = '6';
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
I've tried to Google and read around this problem, but I can't seem to find an adequate solution. I'm hoping someone here can help me. I'm sorry if it's too simple but I would appreciate any advice or help.
I'm working with a longitudinal dataset and I would like to assign an encounter number for each person (ID) who may have had one or more interactions with our laboratory (accesssion). The dataset looks something like this, and I would like to create a new variable (encounter) that numbers each unique encounter for each individual sequentially.
ID accession encounter
----------------------------------
1 1234 1
1 1234 1
1 1235 2
1 1236 3
1 1236 3
2 1000 1
2 1001 2
2 1001 2
3 1111 1
3 1112 2
4 1001 1
4 1001 1
I've tried using first.variable statements such as:
data new; set old;
by id accession;
if first.id & first.accession then encounter=1;
else encounter+1;
run;
I haven't been successful because it won't retain the same encounter number if both the id and accession number remain the same.
Thank you in advance for helping to point me in the right direction.
Your close. At the first of each ID you want to set it to 0, and at the first of each accession you want to increment.
data new; set old;
by id accession;
Retain encounter;
if first.id then encounter=0;
If first.accession then encounter+1;
run;