Is there a way to drop values from only one variable? - stata

I want to split 100 observations in Stata into 2 sets of 50 and drop all missing values so used
gen A = x if _n <= 50
gen B = x if _n > 50
How do I remove the missing values from each variable without affecting the other?

You can’t do that and you don’t need to, as Stata generally ignores missing values.
You can drop entire observations (rows in spreadsheet jargon) or entire variables (columns ditto) but the equivalent for some values in a variable is just setting values to missing, which you have done.
The question in return is why you want to do this.

Related

How to create variable from 1 to n (increments of 1)?

Assuming that in Stata e.g. I have one stacked variable (in column 2) of stock returns with data populating range of 1 to 2000000 (some blanks are replaced with dots). How can I create another variable next to it, which will start at 1 and jump in the increments of one (1, 2, 3, 4...) all the way down to 2000000? I need this kind of variable to merge datasets. Advice would be much appreciated.
If it helps, if I was to use VBA, I would find the last row of the stacked column and then create a variable on this basis moving in the increments of one (that would be of course if Excel allowed 2 million rows)
gen long id = _n
will populate a variable with the observation number.
Note that you can merge on observation number. You don't need any identifier variable(s) to do it. In practice, I would almost always be very queasy about any merge not based on explicit identifiers, unless the datasets were visibly compatible (not so with 2 million observations).

How to refence cell in a sas database

I have a quick question. In excel you can refer to the value in row 20, column c as the C20 cell. What is the equivalent expression in a SAS database?
It's not really useful to think of a SAS dataset like a spreadsheet. Rather, think about it like a database table. Extracting a particular row is easy, but extracting a column requires a name rather than a position, like C in Excel.
If this is the dataset
x1 x2 x3
+----+----+----
1 | 0 | 1 | 0
2 | 1 | 2 | 3
Then in a data step, you can get the equivalent of B2 like so:
data b2;
set dataset;
if _n_ = 2 then output;
keep x2;
run;
The output dataset will then contain only the value you want. But you have to know that x2, for example, is the variable you want.
This isn't really what SAS is for, though.
You cannot explicitly refer to a single cell external to a dataset out of context, like you can in Excel. SAS processes rows one at a time, and does not naturally have the ability to directly access a cell.
In general, if you're referring to the value in a discussion, you would refer to the column as a variable. You could refer to the row number, although that has very little meaning in most instances (particularly as you can sort a dataset, changing all of the row numbers); instead, you would refer to it by its primary key. This would be whatever defines a unique row in your data. It might be a subject ID, for example, or some combination of several variables that together define a unique row.

SAS macros to average between a range of dates with missing dates in the data

I'm completely new to SAS and its macros. I have this dataset, named mydata:
Obs SYMBOL DATE kx y
1 A 20120128 5 6
2 B 20120128 10 7
3 C 20120128 20 9
4 D 20120128 6 10
5 E 20120128 9 20
My problem is to find this function:
Newi = ∑ j€[-10,-2] (x+y)i,j /N,
where,
i = any random date(user defined)
-10 and -2(10 days or 2 days before i)
N= total number of days with data available for (x+y) between (-10,-2)
There can be missing dates in the available data.
Can anyone help me with the possible SAS macros for the following problem.
Thanks in Advance!!
I'm assuming your date data are stored as dates and can accept numeric calculations. I'm also assuming that you want to get average of X and Y for a particular date around d, where d is user defined. Last, I'm assuming that if you have two unique ids on the same day, you keep the first one at random. Obviously those assumptions might need to be tweaked a bit but, from what I believe you are asking (I confess I'm only mostly sure I understand your question), hopefully this is close enough to what you need that you can tweak the rest pretty easily.
Okay...
PROC SORT DATA in;
BY date uniqueid;
RUN;
%MACRO summarize( userdate );
DATA out;
SET in (where = (date >= &userdate -10 and date <= &userdate - 2);
BY date uniqueid;
xy = sum(x, y)
IF first.uniqueid;
RUN;
PROC SUMMARY DATA = out;
OUTPUT OUT = Averages&userdate MEAN(xy) = ;
RUN;
%MEND summarize;
%summarize('20120128'd);
What's going on here? Well, I sort the data first by date and uniqueid. I could use NODUPKEY, but I imagine you might want to control how duplicate uniqueids on a given date are handled. The dataset is throwing out the dups by keeping the first one that it comes across, but you could modify deduping logic (which is coming from the BY command in the DATA step and the IF first. command in the same).
You want a set of dates around a particular user-defined date, d. So get d and filter the dataset with WHERE. You could also do this in your PROC SORT step, and there might be reasons for doing so if your raw data will be updated frequently. If you don't need the run the sort every time a user defines a date range, keep it outside the macro and only run it when needed. Sorts can be slow.
In the data step, I'm getting sum(x,y) to account for the fact that either x or y might be missing, or both, or neither. x + y would return missing in those cases. I assume that's now what you want, but do keep in mind that we'll be averaging out sum(x,y) over N, where N is "either x or y is not missing." If you wanted to ignore those rows entirely, use x + y and add IF xy != . in your DATA step.
The last part, the sum, should be pretty self-explanatory.
Hope this helps.

How to import dates from Excel into Stata

I'm using Stata 12.0.
I have a CSV file of exposures for days of the year e.g. 01/11/2002 (DMY).
I want these imported into Stata and it to recognise that it is a date variable. I've been using:
insheet using "FILENAME", comma
But by doing this I am only getting the dates as labels rather than names of the variables. I guess this is because Stata doesn't allow variable names to start with numbers. I have tried to reformat the cells as Dates in Excel and import but then Stata thinks the whole column is a Date and changes the exposure data into dates.
Any advice on the best course of action is appreciated...
As commented elsewhere, I too think you probably have a dataset that is best formatted as panel data. However, I address first the specific problem I think you have according to your question. Then I show some code in case you are interested in switching to a panel structure.
Here is an example CSV file open as a spreadsheet:
And here the same file, open in a text editor. Imagine the ; are ,. This is related to my system's language settings.
Running this (substitute delimiter(";") for comma, in your case):
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
results in
which I think is the problem you describe: dates as variable labels. You would like to have the dates as variable names. One solution is to use a loop and strtoname() to rename the variables based on the variable labels. The following goes after importing with insheet:
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
The result is
The function strtoname() will substitute out the ilegal characters for _'s. See help strtoname.
Now, if you want to work with a panel structure, one way would be:
clear all
set more off
insheet using "D:\xlsdates.csv", delimiter(";")
* Rename variables
foreach var of varlist * {
local j = "`: variable l `var''"
local newname = strtoname("`j'", 1)
rename `var' `newname'
}
* Generate ID
generate id = _n
* Change to long format
reshape long _, i(id) j(dat) string
* Sensible name
rename _ metric
* Generate new date variable
gen dat2 = date(dat,"DMY", 2050)
format dat2 %d
list, sepby(id)
As you can see, there's no need to do anything beforehand in Excel or in an editor. Stata seems to be enough in this case.
Note: I've reused code from http://www.stata.com/statalist/archive/2008-09/msg01316.html.
A further note on performance: A CSV file with 122 variables or days (columns) and 10,000 observations or subjects (rows) + 1 header row, will produce 1,220,000 observations after the reshape. I have tested this on some old machine with a 1.79 GHz AMD processor and 640 MB RAM and the reshape takes approximately 8 minutes. Stata 12 has a hard-limit of 2,147,483,647 observations (although available RAM determines if you can actually achieve it) and Stata SE of 32,767 variables.
There seems to be some confusion here between the names that variables may have, the values that variables may have and the types that they may have.
Thus, the statement "Stata doesn't allow variables to start with numbers" appears to be a reference to Stata's rules for variable names; if it were true, numeric variables would be impossible.
Stata has no variable (i.e. storage) type that is a date. Strictly, it has no concept of a date variable, but dates may be held as strings or numbers. Dates may be held as strings insofar as any text indicating a date is likely to be a string that Stata can hold. This is flexible, but not especially useful. For almost all useful work, dates need to be converted to integers and then assigned a display format that matches their content to be readable by people. Stata has various conventions here, e.g. that daily dates are held as integers with 0 meaning 1 January 1960.
It seems likely in your case that daily dates are being imported as strings: if so, the function date() (also known as daily()) may be used to convert to an integer date. The example here just uses the minimal default display format for daily dates: friendlier formats exist.
. set obs 1
obs was 0, now 1
. gen sdate = "12/03/12"
. gen ndate = daily(sdate, "DMY", 2050)
. format ndate %td
. l
+----------------------+
| sdate ndate |
|----------------------|
1. | 12/03/12 12mar2012 |
+----------------------+
If your variable names are being misread, as guessed by #ChrisP, you may need to tell us more. A short and concrete example is worth more than a longer verbal description.

Extracting sub-data from a SAS dataset & applying to a different dataset

I have written a macro to use proc univariate to calculate custom quantiles for variables in a dataset (say dsn1) %cust_quants(dsn= , varlist= , quant_list= ). The output is a summary dataset (say dsn2)that looks something like the following:
q_1 q_2.5 q_50 q_80 q_97.5 q_99 var_name
1 2.5 50 80 97.5 99 ex_var_1_100
-2 10 25 150 500 20000 ex_var_pos_skew
-20000 -500 -150 0 10 50 ex_var_neg_skew
What I would like to do is to use the summary dataset to cap/floor extreme values in the original dataset. My idea is to extract the column of interest (say q_99) and put it into a vector of macro-variables (say q_99_1, q_99_2, ..., q_99_n). I can then do something like the following:
/* create summary of dsn1 as above example */
%cust_quants(dsn= dsn1, varlist= ex_var_1_100 ex_var_pos_skew ex_var_neg_skew,
quant_list= 1 2.5 50 80 97.5 99);
/* cap dsn1 var's at 99th percentile */
data dsn1_cap;
set dsn1;
if ex_var_1_100 > &q_99_1 then ex_var_1_100 = &q_99_1;
if ex_var_pos_skew > &q_99_2 then ex_var_pos_skew = &q_99_2;
/* don't cap neg skew */
run;
In R, it is very easy to do this. One can extract sub-data from a data-frame using matrix like indexing and assign this sub-data to an object. This second object can then be referenced later. R example--extracting b from data-frame a:
> a <- as.data.frame(cbind(c(1,2,3), c(4,5,6)))
> print(a)
V1 V2
1 1 4
2 2 5
3 3 6
> a[, 2]
[1] 4 5 6
> b <- a[, 2]
> b[1]
[1] 4
Is it possible to do the same thing in SAS? I want to be able to assign a column(s) of sub-data to a macro variable / array, such that I can then use the macro / array within a 2nd data step. One thought is proc sql into::
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
However, this creates a single string variable when what I really want is a vector of variables (or array--no vectors in SAS). Another thought is to add %scan (assuming this is inside a macro):
proc sql noprint;
select v2 into :v2_macro separated by " "
from a;
run;
%let i = 1;
%do %until(%scan(&v2_macro, &i) = "");
%let var_&i = %scan(&v2_macro, &i);
%let &i = %eval(&i + 1);
%end;
This seems inefficient and takes a lot of code. It also requires the programmer to remember which var_&i corresponds to each future purpose. Is there a simpler / cleaner way to do this?
**Please let me know in the comments if this is enough background / example. I'm happy to give a more complete description of why I'm doing what I'm attempting if needed.
First off, I assume you are talking about SAS/Base not SAS/IML; SAS/IML is essentially similar to R and has the same kind of operations available in the same manner.
SAS/Base is more similar to a database language than a matrix language (though has some elements of both, and some elements of an OOP language, as well as being a full-featured functional programming language).
As a result, you do things somewhat differently in order to achieve the same goal. Additionally, because of the cost of moving data in a large data table, you are given multiple methods to achieve the same result; you can choose the appropriate method for the required situation.
To begin with, you generally should not store data in a macro variable in the manner you suggest. It is bad programming practice, and it is inefficient (as you have already noticed). SAS Datasets exist to store data; SAS macro variables exist to help simplify your programming tasks and drive the code.
Creating the dataset "b" as above is trivial in Base SAS:
data b;
set a;
keep v2;
run;
That creates a new dataset with the same rows as A, but only the second column. KEEP and DROP allow you to control which columns are in the dataset.
However, there would be very little point in this dataset, unless you were planning on modifying the data; after all, it contains the same information as A, just less. So for example, if you wanted to merge V2 into another dataset, rather than creating b, you could simply use a dataset option with A:
data c;
merge z a(keep=v2);
by id;
run;
(Note: I presuppose an ID variable of some form to combine A and Z.)
This merge combines the v2 column onto z, in a new dataset, c. This is equivalent to vertically concatenating two matrices (although a straight-up concatenation would remove the 'by id;' requirement, in databases you do not typically do that, as order is not guaranteed to be what you expect).
If you plan on using b to do something else, how you create and/or use it depends on that usage. You can create a format, which is a mapping of values [ie, 1='Hello' 2='Goodbye'] and thus allows you to convert one value to another with a single programming statement. You can load it into a hash table. You can transpose it into a row (proc transpose). Supply more detail and a more specific answer can be provided.