How do I use numeric functions to correct date typos? - sas

I know it's easy enough to do manual corrections on date typos, but I want to automate such corrections using one or more SAS functions, given that my dataset is large and typos are frequent.
For instance, it seems that whomever created the dataset I am cleaning often transposed digits in the year of someone's birthdate (e.g., '2102' rather than '2012', '2110' instead of '2010', etc). I'm aware of string functions such as INDEX() that find certain character values or strings and then allow for the replacement of said characters in the same position (i.e., replace "ABCD" with "ABBB", regardless of the string's location in a value). Can the same process be replicated with numeric (and specifically date) values?

I don't think SAS has any functions that would check numeric values for digit patterns. I often do data cleaning and address this issue by making a character variable out of the numeric date variable, then using character functions and Perl regex to clean the character values, and then storing the cleaned values as numeric date.
For specifically date values, you could try using SAS date functions (e.g. DAY(), MONTH(), YEAR(), MDY(), etc.) to extract parts of the date value, error-check them, and put them all back together into a date value. This could be a good quick solution if you expect a limited set of typos and you roughly know what they are. For a more thorough error check, converting the numeric values to character and using char or regex functions would give you more options.

The only really concise suggestion I can imagine is using mdy (Assuming this is date, not datetime variables).
For example:
data want;
set have;
if year(datevar) > 2100 then
datevar = mdy(month(datevar),day(datevar),year(datevar)-90);
run;
would correct any '2104' to '2014'. That's a very simple correction (and may well do as much harm as good, since '2114' is also a possible typo), but things along those lines - break the date up into its pieces, verify the pieces, reconstruct using mdy.

Related

Destring a time variable using Stata

How to destring a time variable (7:00) using Stata?
I have tried destring: however, the : prevents the destring. I then tried destring, ignore(:) but was unable to then make a double and/or format %tc. encode does not work; recast does not do the job.
I also have a separate string date that I was able to destring and convert to a double.
Am I missing that I could be combining these two string variables (one date, one time) into a date/time variable or is it correct to destring them individually and then combine them into a date/time variable?
Short answer
To give the bottom line first: two string variables that hold date and time information can be converted to a single numeric date-time variable using some operation like
generate double datetime = clock(date + time, "DMY hm")
format datetime %tc
except that the exact details will depend on exactly how your dates are held.
For understanding dates and times in Stata there is no substitute for
help dates and times
Everything else tried is likely to be wrong or irrelevant or both, as your experience shows.
Longer answer, addressing misconceptions
destring, encode and recast are all (almost always) completely wrong in Stata for converting string dates and/or times to numeric dates and/or times. (I can think of one exception: if somehow a date in years had been imported as string with values "1960", "1961", etc. then destring would be quite all right.)
In reverse order,
recast is not for any kind of numeric to string or string to numeric conversion. It only recasts among numeric or among string types.
encode is essentially for mapping obvious strings to numeric and (unless you specify otherwise) will produce integer values 1, 2, 3, and so forth which will be quite wrong for times or dates in general.
destring as you applied it implies that the string times "7:00", "7:59", "8:00" should be numeric, except that someone stupidly added irrelevant punctuation. But if you strip the colons :, you get times 700, 759, 800, etc. which will not match the standard properties of times. For example, the difference between "8:00" and "7:59" is clearly one minute, but removing the informative punctuation would just yield numbers 800 and 759, which differ by 41, which makes no sense.
For a pure time, you can set up your own system, or use Stata's date-time functions.
For a time between "00:00" and "23:59" you can use Stata's date-times:
. di %tc clock("7:00", "hm")
01jan1960 07:00:00
. di %tc_HH:MM clock("7:00", "hm")
07:00
With variables you would need to generate a new variable and make sure that it is created as double.
A pure time less than 24 hours is (notionally) a time on 1 January 1960, but you can ignore that. But you need to hold in mind (constantly!) that the underlying numeric units are milliseconds. Only the format gives you a time in conventional terms.
If you have times more than 24 hours, that is probably not a good idea.
Your own system could just be to convert string times in the form "hh:mm" to minutes and do calculations in those terms. For times held as variables, the easiest way forward would be to use split, destring to produce numeric variables holding hours and minutes and then use 60 * hours + minutes.
However, despite your title, the real problem here seems to be dealing jointly with date and time information, not just time information, so at this point, you might like to read the short answer again.

How does the reverse function in SAS work?

I have a time data field, say, 10/1/2014.
I want to extract the month and the year information dynamically in SAS, given any date.
I wrote the following code in SAS to extract the month info:
month = substr(time_field, 1, index(time_field, '/')-1);
This worked fine.
I wrote the following snippet to extract the year info:
year = substr(reverse(time_field), 1, 4);
This doesn't work; it throws a blank. Have I missed something? Please help.
SAS will return the year for you. No need to write any custom function for this purpose. Look:
data _null_;
length year 4.;
year=year(today());
put "we are on the year of " year;
run;
Your variable has trailing spaces most likely. So when you reverse it, the trailing spaces become leading spaces and then you take the first four characters which are blanks.
You can verify this by running the reverse function alone on the variable and see the results.
Try adding the compress function.
year = substr(reverse(compress(time_field)), 1, 4);
Though this may solve your problem, you should really convert your date to a SAS date and then use the Month/Day/Year functions.
data have;
length time_field $20.;
time_field="10/1/2014";
year_bad = substr(reverse(time_field),1, 4);
year_good = reverse(substr(reverse(compress(time_field)),1, 4));
year_better = year(input(time_field, mmddyy10.));
put "year_bad:" year_bad;
put "year_good:" year_good;
put "year_better:" year_better;
run;
Your data is either a month in a character field, or it is a numeric value formatted as a date. While you can use text expressions on numerics, you shouldn't; you should explicitly convert them.
When you don't, then you end up with things like this - ie, improper lengths of fields, because the automatic conversion is very loose. It tends to allow a huge amount of extra space where it's not required to.
If your data is numeric, use MONTH() or YEAR() and be done with it; there's no reason to play in text here. Look at the field in the data explorer; it will tell you if it's numeric or not. (Numeric with a format can still look like text, so actually look at it!)
If your data is text, then you have some better options than REVERSE.
First is SCAN. SCAN splits by word, similar to many other languages; often strsplit (R) or similar.
month=scan(mdy_var,1,'/');
day =scan(mdy_var,2,'/');
year =scan(mdy_var,3,'/');
Second, you could still use SUBSTR, along with LENGTH.
year = scan(mdy_var,length(mdy_var)-3,4);
LENGTH tells you how long the string really is (minus trailing spaces), so '10/1/2014' is 9 long; 6th character (9-3) is the 2, and then 4 characters after that [which should be unnecessary]. This method wouldn't really work with Day, of course, only with year (and only with 4 digit year). Scan is better really, but this is a good example of how this works.
Going along the same lines, you can use FIND and look backwards, also, using a negative start position.
year = substr(mdy_var,find(mdy_var,'/',-99)+1,4);
That starts it at the 99th character (which is realistically your maximum, right?) and goes left, and then tells you what position the first '/' it finds.

Extract left part of the string in SAS?

Is there a function SAS proc SQL which i can use to extract left part of the string.it is something similar to LEFT function sql server. in SQL I have left(11111111, 4) * 9 = 9999, I would like to something similar in SAS proc SQL. Any help will be appreciated.
Had an impression you want to repeat the substring instead of multiply, so I'm adding REPEAT function just for the curiosity.
proc sql;
select
INPUT(SUBSTR('11111111', 1, 4), 4.) * 9 /* if source is char */
, INPUT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 4.) * 9 /* if source is number */
, REPEAT(SUBSTR(PUT(11111111, 16. -L), 1, 4), 9) /* repeat instead of multiply */
FROM SASHELP.CLASS (obs=1)
;
quit;
substr("some text",1,4) will give you "some". This function works the same way in a lot of SQL implementations.
Also, note that this is a string function, but in your example you're applying it to a number. SAS will let you do this, but in general it's wise to control you conversion between strings and numbers with put() and input() functions to keep your log clean and be sure that you're only converting where you actually intend to.
You might be looking for SUBSTRN function..
SUBSTRN(string, position <, length>)
Arguments
string specifies a character or numeric constant, variable,
or expression.
If string is numeric, then it is converted to a character value that
uses the BEST32. format. Leading and trailing blanks are removed, and
no message is sent to the SAS log.
position is an integer that specifies the position of the first
character in the substring.
length is an integer that specifies the length of the substring. If
you do not specify length, the SUBSTRN function returns the substring
that extends from the position that you specify to the end of the
string.
As others have pointed out, substr() is the function you are looking for, although I feel that a more useful answer would also 'teach you how to fish'.
A great way to find out about SAS functions is to google sas functions by category which at the time of writing this post will direct you here:
SAS Functions and CALL Routines by Category
It's worth scanning through this list at least once just to get an idea of all of the functions available.
If you're after a specific version, you may want to include the SAS version number in your search. Note that the link above is for 9.2.
If you have scanned through all the functions, and still can't find what you are looking for, then your next option may be to write your own SAS function using proc fcmp. If you ever need assistance with doing this than I suggest posting a new question.

SAS: Where statement not working with string value

I'm trying to use PROC FREQ on a subset of my data called dataname. I would like it to include all rows where varname doesn't equal "A.Never Used". I have the following code:
proc freq data=dataname(where=(varname NE 'A.Never Used'));
run;
I thought there might be a problem with trailing or leading blanks so I also tried:
proc freq data=dataname(where=(strip(varname) NE 'A.Never Used'));
run;
My guess is for some reason my string values are not "A.Never Used" but whenever I print the data this is the value I see.
This is a common issue in dealing with string data (and a good reason not to!). You should consider the source of your data - did it come from web forms? Then it probably contains nonbreaking spaces ('A0'x) instead of regular spaces ('20'x). Did it come from a unicode environment (say, Japanese characters are legal)? Then you may have transcoding issues.
A few options that work for a large majority of these problems:
Compress out everything but alphabet characters. where=(compress(varname,,'ka') ne 'ANeverUsed') for example. 'ka' means 'keep only' and 'alphabet characters'.
UPCASE or LOWCASE to ensure you're not running into case issues.
Use put varname HEX.; in a data step to look at the underlying characters. Each two hex characters is one alphabet character. 20 is space (which strip would remove). Sort by varname before doing this so that you can easily see the rows that you think should have this value next to each other - what is the difference? Probably some special character, or multibyte characters, or who knows what, but it should be apparent here.

regexes in calendar

I have a string in date format 06/09/2011 03:00 PM. I want to remove all of the forward slashes, and if the first digit of the month (06) is a zero, remove it as well as the first digit of the day (09) remove it as well. Any body who can help me out?
thanks!
The usual way to do this is by taking an available date parser where you hand in the input format and output it to a different output format.
Patterns differ, Implementations etc differ also. It is not convenient neither practicable to do date parsing via regex.
Something like that
0([1-9]+)/0([1-9]+)/([0-9]+)
Of course, it will only work in valid dates; it does not parse the date or anything.
BTW: I find better (more readable, detects errors in a more meaningful manner) fyr's answer. This is just to show that it can be done with regex, if fyr's solution is not available in your platform.