I have a large data file with data in the following format: country, datatype, year1month1 to year2018month7.
Reading the data using proc import did not work for all data fields. I ended up modifying the SAS datastep code to ensure data format was correct.
However, I am having trouble simplifying the code, namely I would like a do loop to go through all the years and month. This way, I could use current date to figure out the range of dates for the file and the code to create Year/Month variable does not have to repeat 100 times in the file.
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
informat _(i)M(j) best32.;
end;
end;
informat Base_Year $1. ;
format Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
format _(i)M(j) best12.;
end;
end;
format Base_Year $1. ;
input
Country_Name $
do i = 1940 to 2018;
do j = 1 to 12;
_(i)M(j) $;
end;
end;
Base_Year $;
run;
There are a few approaches here that could work. The most directly translatable to your approach is to use the macro language.
You need to translate those two loops to something like this:
%do i = 1940 %to 2018;
%do j = 1 %to 12;
informat _&i.M&j. best32.;
%end;
%end;
Notice the % there. This also has to be in a macro; you can't do this in normal datastep code.
I would rewrite it to use a macro like so:
%macro make_ym(startyear=, endyear=, separator=);
%local i j;
%do i = &startyear. %to &endyear.;
%do j = 1 %to 12;
_&i.&separator.&j.
%end;
%end;
%mend make_ym;
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
informat %make_ym(startyear=1940,endyear=2018,separator=M) best32.;
informat Base_Year $1. ;
format %make_ym(startyear=1940,endyear=2018,separator=M) best12.;
format Base_Year $1. ;
input
Country_Name $
%make_ym(startyear=1940,endyear=2018,separator=M)
Base_Year $;
run;
I took out the $ after the yMm bits in the input since you declared them as numeric.
Don't model your data step after the code generated by PROC IMPORT. It does a lot of useless things, like attaching formats and informats to variables that don't need them.
For your problem you just need a simple program like this:
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. Y1940M01 .... Y2018M08 Base_Year :$1. ;
run;
Now the only tricky part is building that list of numerical variables. If the list is small enough you could just put it into a macro variable. Fortunately that is not a problem in this case since using 8 character names (YyyyyMmm) there is room for over 300 years worth in a data step character variable. A variable of length 10,800 bytes should have room for 100 years of month names.
So just run this data step first.
data _null_;
length names $10800 ;
basedate = mdy(1,1,1940);
lastdate = today();
do i=0 to intck('month',basedate,lastdate);
date=intnx('month',basedate,i);
names=catx(' ',names,cats('Y',year(date),'M',put(month(date),Z2.)));
end;
call symputx('names',names);
run;
Now you can use the macro variable in your INPUT statement.
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. &names Base_Year :$1. ;
run;
Related
I have a SAS dataset where I keep 50 diagnoses codes and 50 diagnoses descriptions.
It looks something like this:
data diags;
set diag_list;
keep claim_id diagcode1-diagcode50 diagdesc1-diagdesc50;
run;
I need to print all of the variables but I need diagnosis description right next to corresponding diagnosis code. Something like this:
proc print data=diags;
var claim_id diagcode1 diagdesc1 diagcode2 diagdesc2 diagcode3 diagdesc3; *(and so on all the way to 50);
run;
Is there a way to do this (possibly using arrays) without having to type it all up?
Here's one approach then, using Macros. If you have other variables make sure to include them BEFORE the %loop_names(n=50) portion in the VAR statement.
*generate fake data to test/run solution;
data demo;
array diag(50);
array diagdesc(50);
do claim_id=1 to 100;
do i=1 to 50;
diag(i)=rand('normal');
diagdesc(i)=rand('uniform');
end;
output;
end;
run;
%macro loop_names(n=);
%do i=1 %to &n;
diag&i diagdesc&i.
%end;
%mend;
proc print data=demo;
var claim_ID %loop_names(n=20);
run;
Here is some example SAS code that uses actual ICD 10 CM codes and their descriptions and #Reeza proc print:
%* Copy government provided Medicare code data zip file to local computer;
filename cms_cm url 'https://www.cms.gov/Medicare/Coding/ICD10/Downloads/2020-ICD-10-CM-Codes.zip' recfm=s;
filename zip_cm "%sysfunc(pathname(work))/2020-ICD-10-CM-Codes.zip" lrecl=200000000 recfm=n ;
%let rc = %sysfunc(fcopy(cms_cm, zip_cm));
%put %sysfunc(sysmsg());
%* Define fileref to the zip file member that contains ICD 10 CM codes and descriptions;
filename cm_codes zip "%sysfunc(pathname(zip_cm))" member="2020 Code Descriptions/icd10cm_codes_2020.txt";
%* input the codes and descriptions, there are 72,184 of them;
%* I cheated and looked at the data (more than once) in order
%* to determine the variable sizes needed;
data icd10cm_2020;
infile cm_codes lrecl=250 truncover;
attrib
code length=$7
desc length=$230
;
input
code 1-7 desc 9-230;
;
run;
* simulate claims sample data with mostly upto 8 diagnoses, and
* at least one claim with 50 diagnoses;
data have;
call streaminit(123);
do claim_id = 1 to 10;
array codes(50) $7 code1-code50;
array descs(50) $230 desc1-desc50;
call missing(of code:, of desc:);
if mod(claim_id, 10) = 0
then top = 50;
else top = rand('uniform', 8);
do _n_ = 1 to top;
p = ceil(rand('uniform', n)); %* pick a random diagnosis code, 1 of 72,184;
set icd10cm_2020 nobs=n point=p; %* read the data for that random code;
codes(_n_) = code;
descs(_n_) = desc;
end;
output;
end;
stop;
drop top;
run;
%macro loop_names(n=);
%do i=1 %to &n;
code&i desc&i.
%end;
%mend;
ods _all_ close;
ods html;
proc print data=have;
var claim_id %loop_names(n=50);
run;
I need the macro to have one parameter which is the path of the text files.
This is what I have now. I have 6 txt with name= NY2012.txt NY2013.txt NY2014.txt.....etc.
%macro data ;
%let list= 2012 2013 2014 2015 2016 2017;
data allData;
delete; (what should I put here?)
run;
%do i=1 %to 6;
%let currItem = %scan(&list, &i);
filename f&list "/folders/myfolders/NY&list.txt"; (should this be here?)
data currentData;
x = &currItem;
infile f&list truncover;
input value $ 1-20;
retain year;
if _n_=1 then year=year_1;
run;
* Keeping adding things to the cumulative data set;
data allData;
set allData currentData;
run;
%end;
%mend data;
%data;
Then I should end up with a dataset for each year and one large dataset that included all the years. How should I fix this? Thank you.
You should make loop by elements in macro variable and work with scanned element of list but not with list of elements inside the loop. So the code will look like:
%let list=2017 2018 2019;
%macro data(tlist) ; %macro d; %mend d;
%do i=1 %to %sysfunc(countw(&tlist,%str( )));
%let currItem = %scan(&tlist, &i, %str( ));
filename f&currItem. "/folders/myfolders/NY&currItem..txt";
data currentData&currItem;
x = &currItem;
infile f&currItem truncover;
input value $ 1-20;
retain year;
if _n_=1 then year=year_1;
run;
* Keeping adding things to the cumulative data set;
data allData;
set allData currentData&currItem;
run;
filename f&currItem. clear;
%end;
%mend data;
%data;
%data(&list);
A simple wildcard on the infile will suffice.
In this case ???? means any 4 characters. You could also use NY*.txt.
data allData ;
length _f f $256. ; /* temporary & permanent variables to hold the filename being read */
infile "/folders/myfolders/NY????.txt" truncover filename=_f ;
input value $ 1-20 ;
f = _f ;
/* derive the year from the filename */
/* compress(var,,'kd') means Keep Digits */
year = input(compress(scan(f,-1,'/'),,'kd'),8.) ;
run ;
I am trying to run this code
data swati;
input facility_id$ loan_desc : $50. sys_name :$50.;
cards;
fac_001 term_loan RM_platform
fac_001 business_loan IQ_platform
fac_002 business_loan BUSES_termloan
fac_002 business_loan RM_platform
fac_003 overdrafts RM_platform
fac_003 RCF IQ_platform
fac_003 term_loan BUSES_termloan
;
proc contents data=swati out=contents(keep=name varnum);
run;
proc sort data=contents;
by varnum;
run;
data contents;
set contents ;
where varnum in (2,3);
run;
data contents;
set contents;
summary=catx('_',name, 'summ');
run;
data _null_;
set contents;
call symput ("name" || put(_n_ , 10. -L), name);
call symput ("summ" || put (_n_ , 10. -L), summary);
run;
options mlogic symbolgen mprint;
%macro swati;
%do i = 1 %to 2;
proc sort data=swati;
by facility_id &&name&i.;
run;
data swati1;
set swati;
by facility_id &&name&i.;
length &&summ&i. $50.;
retain &&summ&i.;
if first.facility_id then do;
&&summ&i.="";
end;
if first.&&name&i. = last.&&name&i. then &&summ&i.=catx(',',&&name&i., &&summ&i.);
else if first.&&name&i. ne last.&&name&i. then &&summ&i.=&&name&i.;
run;
if last.facility_id ;
%end;
%mend;
%swati;
This code will create two new variables loan_desc_summ and sys_name_summ which has values of the all the loans_desc in one line and the sys_names in one line seprated by comma example (term_loan, business_loan), (RM_platform, IQ_platform) But if a customer has only one loan_desc the loan_summ should only have its value twice.
The problem while running the do loop is that after running this code, I am getting the dataset with only the sys_name_summ and not the loan_desc_summ. I want the dataset with all the five variables facility_id, loan_desc, sys_name, loan_desc_summ, sys_name_summ.
Could you please help me in finding out if there is a problem in the do loop??
Your loop is always starting with the same input dataset (swati) and generating a new dataset (SWATI1). So only the last time through the loop has any effect. Each loop would need to start with the output of the previous run.
You also need to fix your logic for eliminating the duplicates.
For example you could change the macro to:
%macro swati;
data swati1;
set swati;
run;
%do i = 1 %to 2;
proc sort data=swati1;
by facility_id &&name&i.;
run;
data swati1;
set swati1;
by facility_id &&name&i ;
length &&summ&i $500 ;
if first.facility_id then &&summ&i = ' ' ;
if first.&&name&i then catx(',',&&summ&i,&&name&i);
if last.facility_id ;
run;
%end;
%mend;
Also your program could be a lot smaller if you just used arrays.
data want ;
set have ;
by facility_id ;
array one loan_desc sys_name ;
array two $500 loan_desc_summ sys_name_summ ;
retain loan_desc_summ sys_name_summ ;
do i=1 to dim(one);
if first.facility_id then two(i)=one(i) ;
else if not findw(two(i),one(i),',','t') then two(i)=catx(',',two(i),one(i));
end;
if last.facility_id;
drop i loan_desc sys_name ;
run;
If you want to make it more flexible you can put the list of variable names into a macro variable.
%let varlist=loan_desc sys_name;
You could then generate the list of new names easily.
%let varlist2=%sysfunc(tranwrd(&varlist,%str( ),_summ%str( )))_summ ;
Then you can use the macro variables in the ARRAY, RETAIN and DROP statements.
I found some code from obseveupdate websit. They are used for IV calculation. When I run it code it goes through, but all IV and Woe are zeros. I changed another data set to try, also get zeros for all variables. Could you help me figure out why?
data inputdata;
length Region $ 20 age $ 20 Gender $ 20;
infile datalines dsd dlm= ':' truncover;
input Region $ age $ Gender $ target ;
datalines;
Scotland:18-25:Male:1
Scotland:18-25:Female:0
Scotland:26-35:Male:0
Wales:26-35:Male:1
Wales:36-45:Female:0
Wales:26-35:Male:1
London:36-45:Male:1
London:26-35:Male:0
London:18-25:Unknown:1
London:36-45:Male:0
Northern Ireland:36-45:Female:0
Northern Ireland:26-35:Male:1
Northern Ireland:36-45:Male:0
Engand (Not London):45+:Female:0
Engand (Not London):18-25:Male:1
Engand (Not London):26-35:Female:0
Engand (Not London):45+:Female:0
Engand (Not London):36-45:Female:1
Engand (Not London):45+:Female:1
;
data _tempdata;
set inputdata;;
n=_n_;
run;
proc sort data=_tempdata;
by target n;
run;
proc transpose data=_tempdata out = _tempdata;
by target n;
var _character_ _numeric_;
run;
proc sort data=_tempdata out=_tempdata;
by _name_ target;
run;
proc freq data=_tempdata;
by _name_ target;
tables col1 /out=_tempdata;
run;
proc sort data=_tempdata;
by _name_ col1;
run;
proc transpose data=_tempdata out=_tempdata;
by _name_ col1;
id target;
var percent;
run;
data IV_Table(keep=variable IV) WOE_Table(keep=variable attribute woe);
set _tempdata;
by _name_;
rename col1=attribute _name_=variable;
_0=sum(_0,0)/100; *Convert to percent and convert null to zero;
_1=sum(_1,0)/100; *Convert to percent and convert null to zero;
woe=log(_0/_1)*100;output WOE_Table;*Output WOE;
if _1 ne 0 and _0 ne 0 then do;
raw=(_0-_1)*log(_0/_1);
end;
else raw=0;
IV+sum(raw,0);*Culmulativly add to IV, set null to zero;
if last._name_ then do; *only _tempdata the last final row;
output IV_table;
IV=0;
end;
where upcase(_name_) ^='TARGET' and upcase(_name_) ^= 'N';run;
proc sort data=IV_table;by descending IV;run;
title1 "IV Listing";proc print data=IV_table;run;
proc sort data=woe_table;
by variable WOE;
run;
title1 "WOE Listing";
proc print data=WOE_Table;run;
I am trying to parse a delimited dataset with over 300 fields. Instead of listing all the input fields like
data test;
infile "delimited_filename.txt"
DSD delimiter="|" lrecl=32767 STOPOVER;
input field_A:$200.
field_B :$200.
field_C:$200.
/*continues on */
;
I am thinking I can dump all the field names into a file, read in as a sas dataset, and populate the input fields - this also gives me the dynamic control if any of the field names changes (add/remove) in the dataset. What would be some good ways to accomplish this?
Thank you very much - I just started sas, still trying to wrap my head around it.
This worked for me - Basically "write" data open code using macro language and run it.
Note: my indata_header_file contains 5 columns: Variable_Name, Variable_Length, Variable_Type, Variable_Label, and Notes.
%macro ReadDsFromFile(filename_to_process, indata_header_file, out_dsname);
%local filename_to_process indata_header_file out_dsname;
/* This macro var contain code to read data file*/
%local read_code input_in_line;
%put *** Processing file: &filename_to_process ...;
/* Read in the header file */
proc import OUT = ds_header
DATAFILE = &indata_header_file.
DBMS = EXCEL REPLACE; /* REPLACE flag */
SHEET = "Names";
GETNAMES = YES;
MIXED = NO;
SCANTEXT = YES;
run;
%let id = %sysfunc(open(ds_header));
%let NOBS = %sysfunc(attrn(&id.,NOBS));
%syscall set(id);
/*
Generates:
data &out_dsname.;
infile "&filename_to_process."
DSD delimiter="|" lrecl=32767 STOPOVER FIRSTOBS=3;
input
'7C'x
*/
%let read_code = data &out_dsname. %str(;)
infile &filename_to_process.
DSD delimiter=%str("|") lrecl=32767 STOPOVER %str(;)
input ;
/*
Generates:
<field_name> : $<field_length>;
*/
%do i = 1 %to &NObs;
%let rc = %sysfunc(fetchobs(&id., &i));
%let VAR_NAME = %sysfunc(getvarc(&id., %sysfunc(varnum(&id., Variable_Name)) ));
%let VAR_LENGTH = %sysfunc(getvarn(&id., %sysfunc(varnum(&id., Variable_Length)) ));
%let VAR_TYPE = %sysfunc(getvarc(&id., %sysfunc(varnum(&id., Variable_Type)) ));
%let VAR_LABEL = %sysfunc(getvarc(&id., %sysfunc(varnum(&id., Variable_Label)) ));
%let VAR_NOTES = %sysfunc(getvarc(&id., %sysfunc(varnum(&id., Notes)) ));
%if %upcase(%trim(&VAR_TYPE.)) eq CHAR %then
%let input_in_line = &VAR_NAME :$&VAR_LENGTH..;
%else
%let input_in_line = &VAR_NAME :&VAR_LENGTH.;
/* append in_line statment to main macro var*/
%let read_code = &read_code. &input_in_line. ;
%end;
/* Close the fid */
%let rc = %sysfunc(close(&id));
%let read_code = &read_code. %str(;) run %str(;) ;
/* Run the generated code*/
&read_code.
%mend ReadDsFromFile;
Sounds like you want to generate code based on metadata. A data step is actually a lot easier to code and debug than a macro.
Let's assume you have metadata that describes the input data. For example let's use the metadata about the SASHELP.CARS. We can build our metadata from the existing DICTIONARY.COLUMNS metadata on the existing dataset. Let's set the INFORMAT to the FORMAT since that table does not have INFORMAT value assigned.
proc sql noprint ;
create table varlist as
select memname,varnum,name,type,length,format,format as informat,label
from dictionary.columns
where libname='SASHELP' and memname='CARS'
;
quit;
Now let's make a sample text file with the data in it.
filename mydata temp;
data _null_;
set sashelp.cars ;
file mydata dsd ;
put (_all_) (:);
run;
Now we just need to use the metadata to write a program that could read that data. All we really need to do is define the variables and then add a simple INPUT firstvar -- lastvar statement to read the data.
filename code temp;
data _null_;
set varlist end=eof ;
by varnum ;
file code ;
if _n_=1 then do ;
firstvar=name ;
retain firstvar ;
put 'data ' memname ';'
/ ' infile mydata dsd truncover lrecl=1000000;'
;
end;
put ' attrib ' name 'length=' #;
if type = 'char' then put '$'# ;
put length ;
if informat ne ' ' then put #10 informat= ;
if format ne ' ' then put #10 format= ;
if label ne ' ' then put #10 label= :$quote. ;
put ' ;' ;
if eof then do ;
put ' input ' firstvar '-- ' name ';' ;
put 'run;' ;
end;
run;
Now we can just run the generated code using %INCLUDE.
%include code / source2 ;