Assuming that in Stata e.g. I have one stacked variable (in column 2) of stock returns with data populating range of 1 to 2000000 (some blanks are replaced with dots). How can I create another variable next to it, which will start at 1 and jump in the increments of one (1, 2, 3, 4...) all the way down to 2000000? I need this kind of variable to merge datasets. Advice would be much appreciated.
If it helps, if I was to use VBA, I would find the last row of the stacked column and then create a variable on this basis moving in the increments of one (that would be of course if Excel allowed 2 million rows)
gen long id = _n
will populate a variable with the observation number.
Note that you can merge on observation number. You don't need any identifier variable(s) to do it. In practice, I would almost always be very queasy about any merge not based on explicit identifiers, unless the datasets were visibly compatible (not so with 2 million observations).
Related
I have a table with some millions records. There, I have a column looking like that (goes from 1 to 7 for hundreds of times)
I would like to add an index (say nweeks) looking like that,
Any ideas?
Thanks
Without seeing more of the data table and it's potential natural ordering columns you could create a DATA step view
data work.big_with_week / view=work.big_with_week;
set big;
if list = 1 then nweek + 1;
run;
The syntax variable+expression is known as a SUM statement.
The sum statement is equivalent to using the SUM function and the RETAIN statement, as shown here:
retain variable 0;
variable=sum(variable,expression);
Thus, the retained variable nweek is only incremented when the list value is 1. If your big data ever becomes disordered or otherwise not uphold the implicit contract of list being sequenced 1..7 the view will not be accurate.
I'm trying to find a way to calculate the number of unique strings in a single column, excluding blank cells. So far I've seen solutions such as the following:
=SUM(1/COUNTIF(X2:X99;X2:X99))
Plus another similar formula using FREQUENCY instead of COUNTIF. However, applying this to my spreadsheet gives me a decimal value that has no apparent meaning. For example, if my column contains 20 cells containing "ABC", and 30 cells with "XYZ", I should have an output value of 2. However, this is not the case, and even I can clearly see that the above formula won't output anything larger than a 1, which has left me rather confused as to its usage.
Pivot tables seem to show the most promise, but I can't get that to work either. Here's what I tried:
Select the column, including the header
Select a new pivot table and use the selected range
Drag the header from Available Fields to Row Fields
Select the ignore empty rows option
Create the table
This then creates a table with one row per unique entry in the first column, and an empty second column. One row below is a Total Result cell, with the adjacent cell empty.
From this, I can see that there must be some sort of capability of the software to find unique strings, so it would stand to reason that there must also be a way of counting them and displaying that value in a cell. The question is, how do I do that?
Your first attempt should work if (a) wrapped in SUMPRODUCT and (b) the range does not contain blank cells:
=SUMPRODUCT(1/COUNTIF(X2:X51;X2:X51))
I have a DataPrep dataset which contains a series of ~10 columns, each of which indicates whether or not a particular brochure was selected:
BRO_AF BRO_SAF BRO_SE ...
1 1
1 1
1
I'd like to sum/count these values into a BrochuresSelected column.
I was hoping to use ADD with a column range (ie BRO_AF~BRO_ITA), but ADD only takes two numbers.
I can't use COUNT, as it counts rows not columns.
I can use NEST to create a column storing a map or array of brochures, but there doesn't seem to be a function for adding these. I can't use ARRAYLEN on this column, as even empty columns are represented in the column (eg ["1","","","","",""] would have an array length of six, not one).
Has anyone solved a similar issue?
If you know the column names, you can use the + operator in a derive transform. For example:
I am working in SAS Enterprise guide on a line of code that is supposed to read the number of observations in a dataset. The dataset contains 3 rows (observations).
I write the following line of code to get the number of observations and store it in the number_observations variable:
call symputx("number_observations", put(attrn(dsid, "nobs"),best.));
However, instead of getting a result of 3, this line returns 9 for me
Any idea what is happening? I should maybe also note that I manually edited this table (it used to have 9 rows).
Use nlobs instead of nobs. nlobs gives the number of logical observations, honoring any records marked for deletion.
There are some situations where nlobs will return -1 if it doesn't know the number of observations. My favorite countobs paper is http://www2.sas.com/proceedings/sugi26/p095-26.pdf.
I have a quick question. In excel you can refer to the value in row 20, column c as the C20 cell. What is the equivalent expression in a SAS database?
It's not really useful to think of a SAS dataset like a spreadsheet. Rather, think about it like a database table. Extracting a particular row is easy, but extracting a column requires a name rather than a position, like C in Excel.
If this is the dataset
x1 x2 x3
+----+----+----
1 | 0 | 1 | 0
2 | 1 | 2 | 3
Then in a data step, you can get the equivalent of B2 like so:
data b2;
set dataset;
if _n_ = 2 then output;
keep x2;
run;
The output dataset will then contain only the value you want. But you have to know that x2, for example, is the variable you want.
This isn't really what SAS is for, though.
You cannot explicitly refer to a single cell external to a dataset out of context, like you can in Excel. SAS processes rows one at a time, and does not naturally have the ability to directly access a cell.
In general, if you're referring to the value in a discussion, you would refer to the column as a variable. You could refer to the row number, although that has very little meaning in most instances (particularly as you can sort a dataset, changing all of the row numbers); instead, you would refer to it by its primary key. This would be whatever defines a unique row in your data. It might be a subject ID, for example, or some combination of several variables that together define a unique row.