Destring a time variable using Stata - stata

How to destring a time variable (7:00) using Stata?
I have tried destring: however, the : prevents the destring. I then tried destring, ignore(:) but was unable to then make a double and/or format %tc. encode does not work; recast does not do the job.
I also have a separate string date that I was able to destring and convert to a double.
Am I missing that I could be combining these two string variables (one date, one time) into a date/time variable or is it correct to destring them individually and then combine them into a date/time variable?

Short answer
To give the bottom line first: two string variables that hold date and time information can be converted to a single numeric date-time variable using some operation like
generate double datetime = clock(date + time, "DMY hm")
format datetime %tc
except that the exact details will depend on exactly how your dates are held.
For understanding dates and times in Stata there is no substitute for
help dates and times
Everything else tried is likely to be wrong or irrelevant or both, as your experience shows.
Longer answer, addressing misconceptions
destring, encode and recast are all (almost always) completely wrong in Stata for converting string dates and/or times to numeric dates and/or times. (I can think of one exception: if somehow a date in years had been imported as string with values "1960", "1961", etc. then destring would be quite all right.)
In reverse order,
recast is not for any kind of numeric to string or string to numeric conversion. It only recasts among numeric or among string types.
encode is essentially for mapping obvious strings to numeric and (unless you specify otherwise) will produce integer values 1, 2, 3, and so forth which will be quite wrong for times or dates in general.
destring as you applied it implies that the string times "7:00", "7:59", "8:00" should be numeric, except that someone stupidly added irrelevant punctuation. But if you strip the colons :, you get times 700, 759, 800, etc. which will not match the standard properties of times. For example, the difference between "8:00" and "7:59" is clearly one minute, but removing the informative punctuation would just yield numbers 800 and 759, which differ by 41, which makes no sense.
For a pure time, you can set up your own system, or use Stata's date-time functions.
For a time between "00:00" and "23:59" you can use Stata's date-times:
. di %tc clock("7:00", "hm")
01jan1960 07:00:00
. di %tc_HH:MM clock("7:00", "hm")
07:00
With variables you would need to generate a new variable and make sure that it is created as double.
A pure time less than 24 hours is (notionally) a time on 1 January 1960, but you can ignore that. But you need to hold in mind (constantly!) that the underlying numeric units are milliseconds. Only the format gives you a time in conventional terms.
If you have times more than 24 hours, that is probably not a good idea.
Your own system could just be to convert string times in the form "hh:mm" to minutes and do calculations in those terms. For times held as variables, the easiest way forward would be to use split, destring to produce numeric variables holding hours and minutes and then use 60 * hours + minutes.
However, despite your title, the real problem here seems to be dealing jointly with date and time information, not just time information, so at this point, you might like to read the short answer again.

Related

Can you use setfill() to set fill 2 digits?

Say if I want to display 2018 when the input is 18, how can I make it display the "20" in 2018? I tried using setfill("20") and it won't work as it's a string rather than a char. But when I used setfill('2'); it will display 2218.
The formatting facility setfill() is designed to fill the empty space by repeating one single char. So it is not possible to obtain this effect, using a string.
For time formatting, you'd have the possibility to use put_time() with a formatting string to indicate the desired format (e.g. Y for 2018 and y for 18), provided you have a real date (so year 2018) passed in a tm structure.
In your case, you could try something like:
cout << (year<100 ? year:year+2000);
But a better approach would be to either ensure that the date is entered on 4 digits (this will avoid you the future bug of the year 2100), or use some function for doing that, so that --if needed-- your future you could easily find the delicate places in your code.

Controlling newlines when writing out arrays in Fortran

So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)

Stata does not replace variable value

Stata does not replace a value, as I am commanding. What is happening?
I have this variable Shutouts, which is a float variable (%9.0g).
One observation has the value = 5.08; that is an error, it should be 5.
I type: replace Shutout= 5 if Shutout==5.08.
And, surprisingly to me, Stata responds:
replace Shutouts=5 if Shutouts==5.08
(0 real changes made)
I have a similar problem for a variable with the same characteristics, with the name Save_perc; one value is 9.2 but should be .92. And, also this time, I receive this response from Stata:
replace Save_perc=.92 if Save_perc==9.2
(0 real changes made)
Why "0 real changes"?
It seems like a very banal problem, but I have been working on it for like 30' and I cannot really figure it out.
it has to do with how floating numbers are stored into memory. You should not use == when comparing two different number formats because some internal storage approximation can make the comparison fail.
In your case, you should just use
Shutouts=5 if Shutouts > 5.07
or
Shutouts=5 if Shutouts == float(5.07)

How does the reverse function in SAS work?

I have a time data field, say, 10/1/2014.
I want to extract the month and the year information dynamically in SAS, given any date.
I wrote the following code in SAS to extract the month info:
month = substr(time_field, 1, index(time_field, '/')-1);
This worked fine.
I wrote the following snippet to extract the year info:
year = substr(reverse(time_field), 1, 4);
This doesn't work; it throws a blank. Have I missed something? Please help.
SAS will return the year for you. No need to write any custom function for this purpose. Look:
data _null_;
length year 4.;
year=year(today());
put "we are on the year of " year;
run;
Your variable has trailing spaces most likely. So when you reverse it, the trailing spaces become leading spaces and then you take the first four characters which are blanks.
You can verify this by running the reverse function alone on the variable and see the results.
Try adding the compress function.
year = substr(reverse(compress(time_field)), 1, 4);
Though this may solve your problem, you should really convert your date to a SAS date and then use the Month/Day/Year functions.
data have;
length time_field $20.;
time_field="10/1/2014";
year_bad = substr(reverse(time_field),1, 4);
year_good = reverse(substr(reverse(compress(time_field)),1, 4));
year_better = year(input(time_field, mmddyy10.));
put "year_bad:" year_bad;
put "year_good:" year_good;
put "year_better:" year_better;
run;
Your data is either a month in a character field, or it is a numeric value formatted as a date. While you can use text expressions on numerics, you shouldn't; you should explicitly convert them.
When you don't, then you end up with things like this - ie, improper lengths of fields, because the automatic conversion is very loose. It tends to allow a huge amount of extra space where it's not required to.
If your data is numeric, use MONTH() or YEAR() and be done with it; there's no reason to play in text here. Look at the field in the data explorer; it will tell you if it's numeric or not. (Numeric with a format can still look like text, so actually look at it!)
If your data is text, then you have some better options than REVERSE.
First is SCAN. SCAN splits by word, similar to many other languages; often strsplit (R) or similar.
month=scan(mdy_var,1,'/');
day =scan(mdy_var,2,'/');
year =scan(mdy_var,3,'/');
Second, you could still use SUBSTR, along with LENGTH.
year = scan(mdy_var,length(mdy_var)-3,4);
LENGTH tells you how long the string really is (minus trailing spaces), so '10/1/2014' is 9 long; 6th character (9-3) is the 2, and then 4 characters after that [which should be unnecessary]. This method wouldn't really work with Day, of course, only with year (and only with 4 digit year). Scan is better really, but this is a good example of how this works.
Going along the same lines, you can use FIND and look backwards, also, using a negative start position.
year = substr(mdy_var,find(mdy_var,'/',-99)+1,4);
That starts it at the 99th character (which is realistically your maximum, right?) and goes left, and then tells you what position the first '/' it finds.

How do I use numeric functions to correct date typos?

I know it's easy enough to do manual corrections on date typos, but I want to automate such corrections using one or more SAS functions, given that my dataset is large and typos are frequent.
For instance, it seems that whomever created the dataset I am cleaning often transposed digits in the year of someone's birthdate (e.g., '2102' rather than '2012', '2110' instead of '2010', etc). I'm aware of string functions such as INDEX() that find certain character values or strings and then allow for the replacement of said characters in the same position (i.e., replace "ABCD" with "ABBB", regardless of the string's location in a value). Can the same process be replicated with numeric (and specifically date) values?
I don't think SAS has any functions that would check numeric values for digit patterns. I often do data cleaning and address this issue by making a character variable out of the numeric date variable, then using character functions and Perl regex to clean the character values, and then storing the cleaned values as numeric date.
For specifically date values, you could try using SAS date functions (e.g. DAY(), MONTH(), YEAR(), MDY(), etc.) to extract parts of the date value, error-check them, and put them all back together into a date value. This could be a good quick solution if you expect a limited set of typos and you roughly know what they are. For a more thorough error check, converting the numeric values to character and using char or regex functions would give you more options.
The only really concise suggestion I can imagine is using mdy (Assuming this is date, not datetime variables).
For example:
data want;
set have;
if year(datevar) > 2100 then
datevar = mdy(month(datevar),day(datevar),year(datevar)-90);
run;
would correct any '2104' to '2014'. That's a very simple correction (and may well do as much harm as good, since '2114' is also a possible typo), but things along those lines - break the date up into its pieces, verify the pieces, reconstruct using mdy.