I have below variable
mhstdtc
-----------
2011-01-01
2015-02-01
2002
2001
2003-03
2003-12
Here is my code I used to convert the variable
ASTDTMC=INPUT(MHSTDTC,is8601da.);
PUT ASTDTMC DATE9.;
It worked only the variable has yyyy-mm-dd values, remaining were returned blank. Please help me to convert yyyy and yyyy-mm values also;
Thanks in advance.
One way using SUBSTR on the left.
25 data _null_;
26 input iso :$10.;
27 mask = '....-06-15';
28 substr(mask,1,length(iso))=iso;
29 ASTDTM=INPUT(mask,is8601da.);
30 format astdtm date9.;
31 put 'NOTE: ' (_all_)(=);
32 cards;
NOTE: iso=2011-01-01 mask=2011-01-01 ASTDTM=01JAN2011
NOTE: iso=2015-02-01 mask=2015-02-01 ASTDTM=01FEB2015
NOTE: iso=2002 mask=2002-06-15 ASTDTM=15JUN2002
NOTE: iso=2001 mask=2001-06-15 ASTDTM=15JUN2001
NOTE: iso=2003-03 mask=2003-03-15 ASTDTM=15MAR2003
NOTE: iso=2003-12 mask=2003-12-15 ASTDTM=15DEC2003
If you have your original dates in character format, add "01-01" to each date, then do the conversion. If your dates are not in character format, convert them to character then add "01-01" to them, and try this code :
data have ;
input mhstdtc $10.;
cards;
2011-01-01
2015-02-01
2002
2001
2003-03
2003-12
;
data want;
set have;
ASTDTMC2=compress(catx("",MHSTDTC,"-01-01")," ","");
ASTDTMC3=inPUT(ASTDTMC2,yymmdd10.);
run;
The result is in ASTDTMC3.
18628
20120
15341
14976
15765
16040
Related
I am pretty new to SAS, I am trying to see which songs/artists/albums have appeared most on my spotify most played csv's (2017-2020). I am getting stuck very early on trying to just set the 2017 csv as a data set. Is there anything anyone can see that I am doing wrong? Seems like this step should be pretty straight forward.
data Spotify_2017;
infile='C:\Users\your_top_songs_2017.csv' dlm=’09’x dsd firstobs=2;
input Track URI Track Name Artist URI Artist Name Album URI Album Name Album Release Date Disc Number Track Number Track Duration Explicit Popularity Added By Added At;
run;
and here is the log:
1 The SAS System 10:18 Friday, January 15, 2021
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Spotify.sas';
4 %LET _CLIENTPROCESSFLOWNAME='Standalone Not In Project';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='C:\Users\xxx\Desktop\Spotify\Spotify.sas';
9 %LET _SASPROGRAMFILEHOST='USRDUL-PC0NNXU1';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=SVG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 %macro HTML5AccessibleGraphSupported;
15 %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH;
16 %mend;
17 FILENAME EGHTML TEMP;
18 ODS HTML5(ID=EGHTML) FILE=EGHTML
19 OPTIONS(BITMAP_MODE='INLINE')
20 %HTML5AccessibleGraphSupported
21 ENCODING='utf-8'
22 STYLE=HtmlBlue
23 NOGTITLE
24 NOGFOOTNOTE
25 GPATH=&sasworklocation
26 ;
NOTE: Writing HTML5(EGHTML) Body file: EGHTML
27
28 data Spotify_2017;
29 infile='C:\Users\pcardella\Desktop\Spotify\C:\Users\xxx\Desktop\Spotify\your_top_songs_2017.csv' dlm=’09’x dsd
___
388
76
29 ! firstobs=2;
ERROR 388-185: Expecting an arithmetic operator.
ERROR 76-322: Syntax error, statement will be ignored.
30 input Track URI Track Name Artist URI Artist Name Album URI Album Name Album Release Date Disc Number Track Number Track
30 ! Duration Explicit Popularity Added By Added At;
31 run;
ERROR: No DATALINES or INFILE statement.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.SPOTIFY_2017 may be incomplete. When this step was stopped there were 0 observations and 16 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
32
33 %LET _CLIENTTASKLABEL=;
34 %LET _CLIENTPROCESSFLOWNAME=;
35 %LET _CLIENTPROJECTPATH=;
36 %LET _CLIENTPROJECTPATHHOST=;
37 %LET _CLIENTPROJECTNAME=;
38 %LET _SASPROGRAMFILE=;
39 %LET _SASPROGRAMFILEHOST=;
2 The SAS System 10:18 Friday, January 15, 2021
40
41 ;*';*";*/;quit;run;
42 ODS _ALL_ CLOSE;
43
44
45 QUIT; RUN;
46
infile is a statement and does not need an equals sign. The syntax is:
infile 'file location here' <options>;
data Spotify_2017;
infile 'C:\Users\your_top_songs_2017.csv' dlm=’09’x dsd firstobs=2;
input Track URI Track Name Artist URI Artist Name Album URI Album Name Album Release Date Disc Number Track Number Track Duration Explicit Popularity Added By Added At;
run;
One way to help learn importing raw files using the data step is to use proc import. proc import will import the data and generate data step code for you in the log when importing csv files. You can study it to see how it works and try to replicate it.
proc import
file = 'C:\Users\your_top_songs_2017.csv'
out = spotify_2017
dbms = csv
replace;
run;
Also, a great option to help make logs more readable in Enterprise Guide is to disable autogenerated code. Go to Tools -> Options -> Results -> General -> uncheck "Show generated wrapper code in SAS log"
In SAS, how to read the following dates with different formats? (especially 01/05/2018 and 1/6/2018)
01/05/2018
1/6/2018
Jan 05 2018
Jan 6 2018
Any help is greatly appreciated. Thanks!
The ANYDTDTM informat will parse most varieties of human readable date, time or datetime representations into a SAS datetime value. The datepart function of that value will return the SAS date value thereof.
The ANYDTDTE informat will also parse a variety of date, time or datetime representations and return the date part implicitly. However it fails on some of your data items where ANYDTDTM does not.
data _null_;
input
#1 a_datetime_value anydtdtm.
#1 a_date_value anydtdte.
;
hot_date = datepart(a_datetime_value);
put
'_infile_ ' _infile_
/ 'anydtdtm. ' a_datetime_value datetime16.
/ 'datepart() ' hot_date yymmdd10.
/ 'anydtdte. ' a_date_value yymmdd10.
/;
datalines;
01/05/2018
1/6/2018
Jan 05 2018
Jan 6 2018
run;
==== LOG ====
_infile_ 01/05/2018
anydtdtm. 05JAN18:00:00:00
datepart() 2018-01-05
anydtdte. .
_infile_ 1/6/2018
anydtdtm. 06JAN18:00:00:00
datepart() 2018-01-06
anydtdte. 2018-01-06
_infile_ Jan 05 2018
anydtdtm. 05JAN18:00:00:00
datepart() 2018-01-05
anydtdte. .
_infile_ Jan 6 2018
anydtdtm. 06JAN18:00:00:00
datepart() 2018-01-06
anydtdte. .
Read the SAS documentation and conference papers for a greater exploration of the ANYDT** family of informats.
Problem Statement: I have a text file and I want to read it using SAS INFILE function. But SAS is not giving me the proper output.
Text File:
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
The Code that I have tried:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name$ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
While the result i got is :
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas 20 16 .
3 Shashank16 20 .
4 Meera 18 20 .
Which is improper. So what is wrong with the code? I need to read the file in SAS with the same sequence of marks as in text file. Please help.
Expected result:
Table of Contents
Obs Name SAS DS R
1 Akash 18 19 20
2 Tejas . 20 16
3 Shashank16 20 .
4 Meera 18 . 20
Thanks in advance.
If that text file is tab-delimited, you should specify the delimiter in the infile statement and use the dsd option to account for missing values:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover dlm='09'x dsd;
INPUT Name $ SAS DS R;
RUN;
PROC PRINT DATA=arr;
RUN;
EDIT: after editing, your sample text file now looks fixed-width rather than space-delimited. In that case you should be using column input:
DATA Arr;
INFILE "/folders/myfolders/personal/SAS_Array .txt" missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
RUN;
example with datalines:
DATA Arr;
INFILE datalines missover;
INPUT Name $1-9 SAS 10-12 DS 13-15 R 16-18;
datalines;
Akash 18 19 20
Tejas 20 16
Shashank 16 20
Meera 18 20
RUN;
I need to delete duplicates from a data set. My issue is that once I sort the data and flag the duplicates (using lag function), some information across variables is present within the duplicate observation and some within the original observation. I need to retain information across all variables while also deleting the duplicates.
My thought was to first fill in all the information between both the original and duplicate before deleting the duplicate.
Example of observations after sorting data and flagging duplicates (fake data values):
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
What I want:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
AB 36 1980 45654 . 2135 1
ON 26 1990 . 35464 8868 0
ON 26 1990 . 35464 8868 1
So I can delete duplicates and eventually have this:
Province AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate
AB 36 1980 45654 . 2135 0
ON 26 1990 . 35464 8868 0
I created lag and lead variables to attempt to fill in information but it only seems to be working on some of the data set.
Here is the code for the lead variables:
data uncleaned_data;
merge uncleaned_data
uncleaned_data(
firstobs=2
keep= TRANS_ID MORB_ID Varx
rename=(TRANS_ID=lead_TRANS_ID MORB_ID=lead_MORB_ID Varx=lead_Varx ));
if lag(flag_duplicate=1) then do;
if TRANS_ID=. then do;
TRANS_ID= lead_TRANS_ID;
end;
if MORB_ID=. then do;
MORB_ID= lead_MORB_ID;
end;
if Varx=. then do;
Varx= lead_Varx;
end;
end;
run;
I did the same kind of thing for lag variables except my initial if statement is 'if flag_duplicate=1 then do;'
This method does not seem to work for many duplicate pairs in my data set.
Is there a better way to approach my problem overall? possibly through proc SQL?
Thanks for reading and any advice offered!
I'm assuming that you don't have different values of Trans_id, for example, for the same Province. If that is the case then you can flatten the original data in one go to achieve your goal, using an update statement with a by statement. In my code, the first reference to the dataset, with obs=0, just creates the variables, the second reference populates the values and the by statement ensures that only one row is updated per Providence.
Using this method means you don't need to identify the duplicate values beforehand.
data have;
input Province $ AGE BRTHYEAR Trans_id Morb_id VarX flag_duplicate;
datalines;
AB 36 1980 45654 . . 0
AB 36 1980 . . 2135 1
ON 26 1990 . . 8868 0
ON 26 1990 . 35464 8868 1
;
run;
data want;
update have(obs=0) have;
by province;
run;
Something like this should work...
proc sort data=uncleaned_data; by Province AGE BRTHYEAR; run;
data cleaned_data (DROP=TRANS_ID RENAME=(KEEP_TRANS_ID=TRANS_ID) ...);
set uncleaned_data;
by Province AGE BRTHYEAR;
if first.BRTHYEAR then do;
keep_TRANS_ID=TRANS_ID;
...
end;
else do;
if keep_TRANS_ID=. then keep_TRANS_ID=TRANS_ID;
...
end;
if last.BRTHYEAR then output;
run;
I have the following data set:
Date jobboardid Sales
Jan05 3 256
Jan05 6 70
Jan05 54 90
Feb05 32 456
Feb05 11 89
Feb05 16 876
March05
April05
.
.
.
Jan06 6 678
Jan06 54 87
Jan06 13 56
Feb06 McDonald 67
Feb06 11 281
Feb06 16 876
March06
April06
.
.
.
Jan07 6 567
Jan07 54 76
Jan07 34 87
Feb07 10 678
Feb07 11 765
Feb07 16 67
March07
April06
I am trying to calculate a 12 month growth rate for Sales column when jobboardid column has the same value 12 months apart. I have the following code:
data Want;
set Have;
by Date jobboardid;
format From Till monyy7.;
from = lag12(Date);
oldsales = lag12(sales);
if lag12 (jobboardid) EQ jobboardid
and INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
run;
However I keep getting the following error message:
Note: Missing values were created as a result of performing operation on missing values.
But when I checked my dataset, there aren't any missing values. What's the problem?
Note: My date column is in monyy7. format. jobboardid is numeric value and so does the Sales.
The NOTE is being thrown by the INTCK() function. When you say from=lag12(date) the first 12 records will have a missing value for from. And then INTCK('month', from, Date) will throw the NOTE. Even though INTCK is not used in an assignment statement, it still throws the NOTE because one of its arguments has a missing value. Below is an example. The log reports that missing values were created 12 times, because I used lag12.
77 data have;
78 do Date=1 to 20;
79 output;
80 end;
81 run;
NOTE: The data set WORK.HAVE has 20 observations and 1 variables.
82 data want;
83 set have;
84 from=lag12(Date);
85 if intck('month',from,today())=. then put 'Missing: ' (_n_ Date)(=);
86 else put 'Not Missing: ' (_n_ Date)(=);
87 run;
Missing: _N_=1 Date=1
Missing: _N_=2 Date=2
Missing: _N_=3 Date=3
Missing: _N_=4 Date=4
Missing: _N_=5 Date=5
Missing: _N_=6 Date=6
Missing: _N_=7 Date=7
Missing: _N_=8 Date=8
Missing: _N_=9 Date=9
Missing: _N_=10 Date=10
Missing: _N_=11 Date=11
Missing: _N_=12 Date=12
Not Missing: _N_=13 Date=13
Not Missing: _N_=14 Date=14
Not Missing: _N_=15 Date=15
Not Missing: _N_=16 Date=16
Not Missing: _N_=17 Date=17
Not Missing: _N_=18 Date=18
Not Missing: _N_=19 Date=19
Not Missing: _N_=20 Date=20
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
12 at 85:6
NOTE: There were 20 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 20 observations and 2 variables.
One way to avoid the problem would be to add another do block something like (untested):
if lag12 (jobboardid) EQ jobboardid and _n_> 12 then do;
if INTCK('month', from, Date) EQ 12 then do;
till = Date;
rate = (sales - oldsales) / oldsales;
output;
end;
end;