Transpising observations into one row in SAS - sas

I have a table if registrations of cars (https://i.stack.imgur.com/Qjnl6.png). I need to transpose it into one row per vin number with all info about its registrations so that i will have smth like this:
vin|company_1|start_date_1|end_date_1|company_2|start_date_2|end_date_2|...|company_n|start_date_n|end_date_n, where n is max number of registrations. Please help with code or hints.
I tried proc transpose, but i got start_and and end_date in separate rows, so it doesn't go
proc transpose data = test_vin name=VarName out= outdata;
by vin_number;
var company start_date_date9 end_date_date9;
run;

In the future, please do not post data as images.
You will need to do this manually with a data step. You'll need to first scan your dataset to find the maximum number of columns needed for an array to be sure that there are enough columns for each VIN.
We'll do this with a data step, arrays, and the retain statement. We'll continually add values to each array until we reach the last VIN. When we are at the last VIN, we'll output all of our results and reset them.
Sample data:
data have;
input vin$ company$ start_date9:date9. end_date9:date9.;
format start_date9 end_date9 date9.;
datalines;
A company1 01JAN2020 05JAN2020
A company2 06JAN2020 10JAN2020
A company3 11JAN2020 15JAN2020
A company4 16JAN2020 19JAN2020
B company5 01FEB2020 02FEB2020
B company6 03FEB2020 10FEB2020
B company7 11FEB2020 20FEB2020
B company8 21FEB2020 28FEB2020
;
run;
Code:
data want;
set have;
by vin;
array company_[4] $;
array start_date_[4];
array end_date_[4];
/* Do not reset values at the start of each row */
retain company_: start_date_: end_date_:;
/* Reset the counter and values for each VIN */
if(first.vin) then do;
i = 1;
call missing(of company_:, of stat_date_:, of end_date_:);
end;
else i+1;
/* Store each company and date */
company_[i] = company;
start_date_[i] = start_date9;
end_date_[i] = end_date9;
/* Only output one row per VIN */
if(last.VIN) then output;
format start_date_: end_date_: date9.;
keep vin company_: start_date_: end_date_:;
run;
Output:
vin company_1 company_2 company_3 company_4 start_date_1 ...
A company1 company2 company3 company4 01JAN2020 ...
B company5 company6 company7 company8 01FEB2020 ...

Related

Deleting missing values when dealing with panel data

I am working with a panel dataset, so many countries and many variables throughout a period. The problem is that some countries have no value for certain variables across the whole period and I would like to get rid of them. I found this code for deleting rows with missing values :
DATA data0;
SET data1;
IF cmiss(of _all_) then delete;
RUN;
But all this does is check every row, while I would like to delete a whole country if it has no observations in at least one variable.
Here's a part of the data :
If you want to delete the whole country if it has any information missing, you are on the right track, you just need to add a (group) by statement.
If your data is already sorted by country, as it appears to be in the picture, you can just run:
data want;
set have;
IF cmiss(of _all_) then delete;
by country;
If it is not sorted, you need to first run:
proc sort data=have;
by country;
However, if you have 60 years of data for every country, my guess is that you will not find a single one that have all the information for every year. It will be probably better to do some substantive choices of countries and periods you want to analyze, and then perform multiple imputatiom of missing data: https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
You can use a DOW loop to compute which variable(s) contain only missing values within a group.
A second DOW loop outputs only those groups in which all variables contain at least on value.
Example:
data have;
call streaminit (2020);
do country = 1 to 6;
do year = 1960 to 1999;
array x gini kof tradegdp fdi gdp age_dep educ;
do over x;
x = rand('integer', 20, 100);
end;
if country = 1 then call missing (gini);
if country = 2 then call missing (educ);
if country = 4 then call missing (fdi);
output;
end;
end;
run;
data want;
* count number of non-missing values over group for each arrayed variable;
do _n_ = 1 by 1 until (last.country);
set have;
by country;
array x gini kof tradegdp fdi gdp age_dep educ;
array flag(100) _temporary_; * flag if variable has a non-missing value in group;
do _index = 1 to dim(x);
if not(flag(_index)) then flag(_index) = 1 - missing(x(_index));
end;
end;
* check if at least one variable has no values;
_remove_group_flag = sum(of flag(*)) ne dim(x);
do _n_ = 1 to _n_;
set have;
if not _remove_group_flag then output;
end;
call missing (of flag(*));
run;
Will LOG
NOTE: There were 240 observations read from the data set WORK.HAVE. First DOW loop
NOTE: There were 240 observations read from the data set WORK.HAVE. Second DOW loop
NOTE: The data set WORK.WANT has 120 observations and 11 variables. Conditional output

SAS: Creating dummy variables from categorical variable

I would like to turn the following long dataset:
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
Into a wide dataset that looks like this:
ID Ankle Shoulder Head
1 1 1 0
2 1 0 1
3 0 1 1'
This answer seemed the most relevant but was falling over at the proc freq stage (my real dataset is around 1 million records, and has around 30 injury types):
Creating dummy variables from multiple strings in the same row
Additional help: https://communities.sas.com/t5/SAS-Statistical-Procedures/Possible-to-create-dummy-variables-with-proc-transpose/td-p/235140
Thanks for the help!
Here's a basic method that should work easily, even with several million records.
First you sort the data, then add in a count to create the 1 variable. Next you use PROC TRANSPOSE to flip the data from long to wide. Then fill in the missing values with a 0. This is a fully dynamic method, it doesn't matter how many different Injury types you have or how many records per person. There are other methods that are probably shorter code, but I think this is simple and easy to understand and modify if required.
data test;
input Id Injury $;
datalines;
1 Ankle
1 Shoulder
2 Ankle
2 Head
3 Head
3 Shoulder
;
run;
proc sort data=test;
by id injury;
run;
data test2;
set test;
count=1;
run;
proc transpose data=test2 out=want prefix=Injury_;
by id;
var count;
id injury;
idlabel injury;
run;
data want;
set want;
array inj(*) injury_:;
do i=1 to dim(inj);
if inj(i)=. then inj(i) = 0;
end;
drop _name_ i;
run;
Here's a solution involving only two steps... Just make sure your data is sorted by id first (the injury column doesn't need to be sorted).
First, create a macro variable containing the list of injuries
proc sql noprint;
select distinct injury
into :injuries separated by " "
from have
order by injury;
quit;
Then, let RETAIN do the magic -- no transposition needed!
data want(drop=i injury);
set have;
by id;
format &injuries 1.;
retain &injuries;
array injuries(*) &injuries;
if first.id then do i = 1 to dim(injuries);
injuries(i) = 0;
end;
do i = 1 to dim(injuries);
if injury = scan("&injuries",i) then injuries(i) = 1;
end;
if last.id then output;
run;
EDIT
Following OP's question in the comments, here's how we could use codes and labels for injuries. It could be done directly in the last data step with a label statement, but to minimize hard-coding, I'll assume the labels are entered into a sas dataset.
1 - Define Labels:
data myLabels;
infile datalines dlm="|" truncover;
informat injury $12. labl $24.;
input injury labl;
datalines;
S460|Acute meniscal tear, medial
S520|Head trauma
;
2 - Add a new query to the existing proc sql step to prepare the label assignment.
proc sql noprint;
/* Existing query */
select distinct injury
into :injuries separated by " "
from have
order by injury;
/* New query */
select catx("=",injury,quote(trim(labl)))
into :labls separated by " "
from myLabels;
quit;
3 - Then, at the end of the data want step, just add a label statement.
data want(drop=i injury);
set have;
by id;
/* ...same as before... */
* Add labels;
label &labls;
run;
And that should do it!

SAS: Using Do/Loop in a Proc Transpose

I'm not very familiar with Do Loops in SAS and was hoping to get some help. I have data that looks like this:
Product A: 1
Product A: 2
Product A: 4
I'd like to transpose (easy) and flag that Product A: 3 is missing, but I need to do this iteratively to the i-th degree since the number of products is large.
If I run the transpose part in SAS, my first column will be 1, second column will be 2, and third column will be 4 - but I'd really like the third column to be missing and the fourth column to be 4.
Any thoughts? Thanks.
Get some sample data:
proc sort data=sashelp.iris out=sorted;
by species;
run;
Determine the largest column we will need to transpose to. Depending on your situation you may just want to hardcode this value using a %let max=somevalue; statement:
proc sql noprint;
select cats(max(sepallength)) into :max from sorted;
quit;
%put &=max;
Transpose the data using a data step:
data want;
set sorted;
by species;
retain _1-_&max;
array a[1:&max] _1-_&max;
if first.species then do;
do cnt = lbound(a) to hbound(a);
a[cnt] = .;
end;
end;
a[sepallength] = sepallength;
if last.species then do;
output;
end;
keep species _1-_&max;
run;
Notice we are defining an array of columns: _1,_2,_3,..._max. This happens in our array statement.
We then use by-group processing to populate these newly created columns for a single species at a time. For each species, on the first record, we clear the array. For each record of the species, we populate the appropriate element of the array. On the final record for the species output the array contents.
You need a way to tell SAS that you have 4 products and the values are 1-4. In this example I create dummy ID with the needed information then transpose using ID statement to name new variables using the value of product.
data product;
input id product ##;
cards;
1 1 1 2 1 4
2 2 2 3
;;;;
run;
proc print;
run;
data productspace;
if 0 then set product;
do product = 1 to 4;
output;
end;
stop;
run;
data productV / view=productV;
set productspace product;
run;
proc transpose data=productV out=wide(where=(not missing(id))) prefix=P;
by id;
var product;
id product;
run;
proc print;
run;

how to display the total count of individual words from a list

In the data for 10000 item_ids, the item description is given so how to count the frequency of individual word in the item description column, for a particular item_id, where the item_id are repeating, using SAS (without using array).
Goal is to identify the keywords for a particular item_id.
Following approach leverage Proc Freq to get 'keyword' distribution.
data have;
infile cards truncover;
input id var $ 100.;
cards;
1 This is test test
2 failed
1 be test
2 failed is
3 success
3 success ok
;
/*This is to break down the description into single word*/
data want;
set have;
do _n_=1 to countw(var);
new_var=scan(var,_n_);
output;
end;
run;
/*This is to give you words freq by id*/
ods output list=mylist (keep=id new_var frequency);
PROC FREQ DATA = want
ORDER=FREQ
;
TABLES id * new_var /
NOCOL
NOPERCENT
NOCUM
SCORES=TABLE
LIST
ALPHA=0.05;
RUN; QUIT;
ods _all_ close;
ods listing;
Arrays are used to read across multiple columns, so aren't of any particular use here. This does sound a bit like a homework question and you should really show some attempt that you've made. However, this is not an easy problem to solve, so I will post a solution.
My thoughts on how to approach this are :
Sort the data by item_id
For every item_id, scan through each word and check if it already exists for that item_id. If so then go to the next word, otherwise add the word to the unique list and increment the counter by 1
When the last of the current item_id's is processed, output the unique word list and count
I've hopefully commented the code below sufficiently for you to follow what's going on, if not then look up the particular function or statement online.
/* create dummy dataset */
data have;
input item_id item_desc $30.;
datalines;
1 this is one
1 this is two
2 how many words are here
2 not many
3 random selection
;
run;
/* sort dataset if necessary */
proc sort data=have;
by item_id;
run;
/* extract unique words from description */
data want;
set have;
by item_id;
retain unique_words unique_count; /* retain value from previous row */
length unique_words $200; /* set length for unique word list */
if first.item_id then do; /* reset unique word list and count when item_id changes */
call missing(unique_words);
unique_count = 0;
end;
do i = 1 by 1 while(scan(item_desc,i) ne ''); /* scan each word in description until the end */
if indexw(unique_words,scan(item_desc,i),'|') > 0 then continue; /* check if word already exists in unique list, if so then go to next word */
else do;
call catx('|',unique_words,scan(item_desc,i)); /* add to list of unique words, separated by | */
unique_count+1; /* count number of unique words */
end;
end;
drop item_desc i; /* drop unwanted columns */
if last.item_id then output; /* output id, unique word list and count when last id */
run;

SAS: Insert Blank Rows

I'm calculating some interval statistics (standard deviation of one minute intervals for example) of financial time series data. My code managed to get results for all intervals that contain data, but for intervals that do not contain any observations in the time series, I'd like to insert an empty row just to maintain the timestamp consistency.
For example, if there's data between 10:00 to 10:01, 10:02 to 10:03, but not 10:01 to 10:02, my output would be:
10:01 stat1 stat2 stat3
10:03 stat1 stat2 stat3
It would ideal if the result could be (I want some values to be 0, some missing '.'):
10:01 stat1 stat2 stat3
10:02 0 0 .
10:03 stat1 stat2 stat3
What I did:
data v_temp/view = v_temp;
set &taq_ds;
where TIME_M between &start_time and &end_time;
INTV = hms(00, ceil(TIME_M/'00:01:00't),00); *create one minute interval;
format INTV tod.; *format hh:mm:ss;
run;
proc means data = sorted noprint;
by SYM_ROOT DATE INTV;
var PRICE;
weight SIZE;
output
out=oneMinStats(drop=_TYPE_ _FREQ_)
n=NTRADES mean=VWAP sumwgt=SUMSHS max=HI min=LO std=SIGMAPRC
idgroup(max(TIME_M) last out(price size ex time_m)=LASTTRD LASTSIZE LASTEX LASTTIME);
run;
For some non-active stocks, there're many gaps like this. What would be an efficient way to generate those filling rows?
If you have SAS:ETS licensed, PROC EXPAND is a good choice for adding blank rows in a time series. Here's a very short example:
data mydata;
input timevar stat1 stat2 stat3;
format timevar TIME5.;
informat timevar HHMMSS5.;
datalines;
10:01 1 3 5
10:03 2 4 6
;;;;
run;
proc expand data=mydata out=mydata_exp from=minute to=minute observed=beginning method=none;
id timevar;
run;
The documentation has more details if you want to perform inter/extrapolation or anything like that. The important options are from=minute, observed=beginning, method=none (no extrapolation or interpolation), and id (which identifies the time variable).
If you don't have ETS, then a data step should suffice. You can either merge to a known dataset, or add your own rows; the size of your dataset determines somewhat which is easier. Here's the merge variation. The add your own rows in a datastep variation is similar to how I create the extra rows.
*Select the maximum time available.;
proc sql noprint;
select max(timevar) into :endtime from mydata;
quit;
*Create the empty dataset with just times;
data mydata_tomerge;
set mydata_tomerge(obs=1);
do timevar = timevar to &endtime by 60; *by 60 = minutes!;
output;
end;
keep timevar;
run;
*Now merge the one with all the times to the one with all the data!;
data mydata_fin;
merge mydata_tomerge(in=a) mydata;
by timevar;
if a;
run;