Transform the set of data into 5 values - sas

Suppose that I was given the following data
ID Birthday Monthly Salary
P222 2 March 1976 9,600
P013 13 June 1955 31,450
S015 12 September 1966 27,500
The ID number starts with a character followed by three digits.
The first character is the abbreviation of the occupation ("P" for Professor. and "S" for Staff, etc.).
Consider the following data, denoted by (*) and (**):
(*):
P222 2Mar1976 9,60000
P013 13Jun1955 31,45000
S015 12Sep1966 27,50000
(**):
P222 2Mar1976 $9,6,00
***************
P013 13Jun1955 $31,450
**************
S015 12Sep1966 $27,500
***********
Suppose I have to write SAS programs to read the aforementioned data (*) and (**) respectively to create a temporary SAS data file, called PERSONEL, which contains five variables, namely ID, OCCUPATION, BIRTHDAY, YEAR and SALARY.
I mean YEAR by the year of birth here. So variables BIRTHDAY, YEAR and SALARY are numeric, but ID and OCCUPATION would be character variables.
For example, the first record should have
ID="P222", OCCUPATION="P", BIRTHDAY=27821, YEAR=1976, SALARY=9600
Is it possible for me to do this WITHOUT using assignment statement?

If you have fixed column text file, like your first example:
RULE: ----+----1----+----2----+----3
1311 P222 2Mar1976 9,60000
1312 P013 13Jun1955 31,45000
1313 S015 12Sep1966 27,50000
Then you could read the variables directly from the proper columns.
data want;
infile 'myfile' truncover;
input id $ 1-4 occupation $ 1 #7 birthday date9. year 12-15 #16 salary comma12.2 ;
format birthday date9. salary dollar12.2;
run;
Result:
Obs id occupation birthday year salary
1 P222 P 02MAR1976 1976 $9,600.00
2 P013 P 13JUN1955 1955 $31,450.00
3 S015 S 12SEP1966 1966 $27,500.00
The second version has the values in slightly different positions and and extra line that would need to be skipped.

Related

SAS colon format modifier

What do the numbers in the grey box represent? And what's a simple way of understanding how the colon modifier affects the way sas reads in values?
The answer depends on information not provided. The answer B is the best choice in the sense that you should use the colon modifier when using informats in the INPUT statement to prevent the use of the formatted input mode instead of list input mode. Otherwise the formatted input could read too many or too few characters and also might leave the cursor in the wrong place for reading the next field.
But if you try to read that data from in-line cards it works fine for those two lines. That is because in-line data lines are padded to next multiple of 80 bytes.
If you put those lines into a file without any trailing spaces on the lines then the second line fails because there are not 10 characters to read for the last field. But if you add the TRUNCOVER option (or PAD) to the INFILE statement then it will work.
Try it yourself. TEST1 and TEST3 work. TEST2 gets a LOST CARD note.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
options parmcards=test;
filename test temp ;
parmcards;
Donny 5MAR2008 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
data test2;
infile test;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
data test3;
infile test truncover;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
run;
With different data the first formatted input can cause trouble also. For example if the date values used only 2 digits for the year it would throw things off. So it tries to read FL as the age and then reads the first 8 characters of the salary as the STATE and just blanks as the SALARY.
data test1;
input name $ hired date9. age state $ salary comma10.;
format hired date9.;
cards;
Donny 5MAR08 25 FL $43,123.50
Margaret 20FEB2008 43 NC 65,150
;
Results:
Obs name hired age state salary
1 Donny 05MAR2008 . $43,123. .
2 Margaret 20FEB2008 43 NC 65150

sas reading files with space as delimiter

I have a datafile which uses blank space as delimiter. I want to write a data step to read this file into sas.
The fields are not separated by a single blanks in most of the cases the fields are separated by more than 10 blanks spaces.I have checked using notepad++ and the delimiters are not tabs.
137 3.35 Afghanistan 2009-07-08
154 2.43 Albania 2009-07-22
101 1.22 Antigua and Barbuda 2009-06-24
155 4.13 Federated States of Micronesia 2009-07-22
I am tried writing informat statements for these and have been unsuccessful
Here's what I have done so far
input casedt1id :$3. contntid :4 country :&$32. casedt1 yymmdd10.
This reads only the first field properly and the rest get missing values.
The question is to write an informat statement to read this data ?
thanks for the help.
regards
jana
You can use the # symbol to control where the pointer reads from on the line. It looks like you have a fixed starting column for each variable.
data want;
input #1 casedt1id :$3. #14 contntid :4 #28 country :&$32. #61 casedt1 :yymmdd10.;
format casedt1 yymmdd10.;
datalines;
137 3.35 Afghanistan 2009-07-08
154 2.43 Albania 2009-07-22
101 1.22 Antigua and Barbuda 2009-06-24
155 4.13 Federated States of Micronesia 2009-07-22
;
That looks like fixed column data to me. The problem then is using INFORMATs with fixed column data. This should work
input casedt1id $ 1-3 contntid 4-27 country $28-60 casedt1 yymmdd10.;
format casedt1 yymmdd10.;
The trick is make sure the pointer is in the right place when it tries to read the formatted text. So in the statement above that is done by telling it read to column 60 for COUNTRY. So now you are at column 61 when you are ready to read the date. You could also use + or # to move the pointer.
... #61 casedt1 yymmdd10. ...
If you are reading from a variable length file (most files now are variable length) then make sure to add the TRUNCOVER option to the INFILE statement just in case the date is missing or written using fewer than 10 characters.

Calculate Column Percentage sas

I have the following dataset:
Date Occupation Tota_Employed
1/1/2005 Teacher 45
1/1/2005 Economist 76
1/1/2005 Artist 14
2/1/2005 Doctor 26
2/1/2005 Economist 14
2/1/2005 Mathematician 10
and so on until November 2014
What I am trying to do is to calculate a column of percentage of employed by occupation such that my data will look like this:
Date Occupation Tota_Employed Percent_Emp_by_Occupation
1/1/2005 Teacher 45 33.33
1/1/2005 Economist 76 56.29
1/1/2005 Artist 14 10.37
2/1/2005 Doctor 26 52.00
2/1/2005 Economist 14 28.00
2/1/2005 Mathematician 10 20.00
where the percent_emp_by_occupation is calculated by dividing total_employed by each date (month&year) by total sum for each occupation to get the percentage:
Example for Teacher: (45/135)*100, where 135 is the sum of 45+76+14
I know I can get a table via proc tabulate, but was wondering if there is anyway of getting it through another procedure, specially since I wanted this as a separate dataset.
What is the best way to go about doing this? Thanks in advance.
Extract month and year from the date and create a key:
data ds;
set ds;
month=month(date);
year=year(date);
key=catx("_",month,year);
run;
Roll up the total at month level:
Proc sql;
create table month_total as
select key,sum(total_employed) as monthly_total
from ds
group by key;
quit;
Update the original data with the monthly total:
Proc sql;
create table ds as
select a.*,b.monthly_total
from ds as a left join month_total as b
on a.key=b.key;
quit;
This would lead to the following data set:
Date Occupation Tota_Employed monthly_total
1/1/2005 Teacher 45 135
1/1/2005 Economist 76 135
1/1/2005 Artist 14 135
Finally calculate the percentage as:
data ds;
set ds;
percentage=total_employed/monthly_total;
run;
Here you go:
proc sql;
create table occ2 as
select
occ.*,
total_employed/employed_by_date as percentage_employed_by_date format=percent7.1
from
occ a
join
(select
date,
sum(total_employed) as employed_by_date
from occ
group by date) b
on
a.date = b.date
;
quit;
Produces a table like so:
One last thought: you can create all of the totals you desire for this calculation in one pass of the data. I looked at a prior question you asked about this data and assumed that you used proc means to summarize your initial data by date and occupation. You can calculate the totals by date as well in the same procedure. I don't have your data, so I'll illustrate the concept with sashelp.class data set that comes with every SAS installation.
In this example, I want to get the total number of students by sex and age, but I also want to get the total students by sex because I will calculate the percentage of students by sex later. Here's how to summarize the data and get counts for 2 different levels of summary.
proc summary data=sashelp.class;
class sex age;
types sex sex*age;
var height;
output out=summary (drop=_freq_) n=count;
run;
The types statement identifies the levels of summary of my class variables. In this case, I want counts of just sex, as well as the counts of sex by age. Here's what the output looks like.
The _TYPE_ variable identifies the level of summary. The total count of sex is _TYPE_=2 while the count of sex by age is _TYPE_=3.
Then a simple SQL query to calculate the percentages within sex.
proc sql;
create table summary2 as
select
a.sex,
a.age,
a.count,
a.count/b.count as percent_of_sex format=percent7.1
from
summary (where=(_type_=3)) a /* sex * age */
join
summary (where=(_type_=2)) b /* sex */
on
a.sex = b.sex
;
quit;
The answer is to look back at the questions you have asked in the last few days about this same data and study those answers. Your answer is there.
While you are reviewing those answers, take time to thank them and give someone a check for helping you out.

How to delete all the duplicate observations but add a column with the frequency in SAS?

In a dataset in SAS, I have some observations multiple times. What I am trying to do is: I am trying to add a column with the frequency of each observation and make sure I keep it only one time in my dataset. I have to do this for a dataset with many rows and around 8 variables.
name id address age
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
This would have to become:
name id address age frequency
jack 2 chicago 50 2
peter 4 new york 45 1
Is there anybody who knows how to do this in SAS (preferably without using SQL)?
Thank you a lot!
#kl78 is right, proc summary is the best non-sql solution here. This runs in memory which can cause problems with very large datasets, but you should be ok with 8 columns.
class _all_ will group by all the variables and the frequency is output by default, so there's no need to specify any measures. I've dropped the other automatic variable, _type_, as it isn't relevant here and renamed _freq_.
data have;
input name $ id address &$ age;
datalines;
jack 2 chicago 50
peter 4 new york 45
jack 2 chicago 50
;
run;
proc summary data=have nway;
class _all_;
output out=want (drop=_type_ rename=(_freq_=frequency));
run;

Combining data from different rows into one variable

I have a table as below:
id sprvsr phone name
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
I would like to put same id and same name on one column and sprvsr and phone in one column together as below:
id sprvsr phone name
2 123-128 5232-5458 ali
3 145-125 7845-4785 oya
edit question:
have one more question- related this one.
i followed the way you showed me and works. Thank you! Another problem is for example:
sprvsr name
5232-5458 ali
5232-5458 ali
5458-5232 ali
is there any way that i can make them in same order?
If you need the variables in the same order, you'll need to use a temporary array and sort it. This requires having some idea of how many rows you might have. Also requires it to be sorted. This is a bit more complicated than the previous solution (in a previous revision).
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
4 128 5458 ali
4 123 5232 ali
;
run;
data want;
array phones[99] $8 _temporary_; *initialize these two to some reasonably high number;
array sprvsrs[99] $3 _temporary_;
length phone_all sprvsr_all $200; *same;
set have;
by id;
if first.id then do; *for each id, start out clearing the arrays;
call missing(of phones[*] sprvsrs[*]);
_counter=0;
end;
_counter+1; *increment counter;
phones[_counter]=phone; *assign current phone/sprvsr to array elements;
sprvsrs[_counter]=sprvsr;
if last.id then do; *now, create concatenated list and output;
call sortc(of phones[*]); *sort the lists;
call sortc(of sprvsrs[*]);
phone_all = catx('-',of phones[*]); *concatenate them together;
sprvsr_all= catx('-',of sprvsrs[*]);
output;
end;
drop phone sprvsr;
rename
phone_all=phone
sprvsr_all=sprvsr;
run;
The construction array[*] means "All variables of that array". So catx('-',of phones[*]) means put all phones elements in the catx (fortunately, missing ones are ignored by catx).
This is a way to do that:
data have;
input id sprvsr $ phone $ name $;
datalines;
2 123 5232 ali
2 128 5458 ali
3 145 7845 oya
3 125 4785 oya
;
run;
data want (drop=lag_sprvsr lag_phone);
format id;
length sprvsr $7 phone $9;
set have;
by id;
lag_sprvsr=lag(sprvsr);
lag_phone=lag(phone);
if lag(id)=id then do;
sprvsr=catx('-',lag_sprvsr,sprvsr);
phone=catx('-',lag_phone,phone);
end;
if last.id then output;
run;
Just pay attention to the possible lenghts of the input variables and that of the concatenated string. The input dataset must be sorted by id.
The catx() function removes the leading and trailing blanks and concatenates with a delimiter.