I have a dataset that results from the joins between a few results from a proc univariate.
After some more joins, I have a final dataset with a variable called "Measure", which has the name of certain measures, like 'mean' and 'standard deviation', for example, and other variables each with values for these measures, representing a month in a certain year.
I'd like to sort these measures in a particular order and, for now, I'm doing a proc transpose, doing a retain to stabilish the order I want, and doing another transpose. The problem is that this a really naive solution and I feel it just takes longer than it should take.
Is there a simpler/more effective way to do this sort?
An example of what I want to do, with random values:
What I have:
Measures | 2013/01 | 2013/02 | 2013/03
Mean | 10 | 9 | 11
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
What I want:
Measures | 2013/01 | 2013/02 | 2013/03
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
Mean | 10 | 9 | 11
I hope I was clear enough.
Thanks in advance
Couple of straightforward solutions. First, you could simply add a variable that you sort by and then drop. Don't need to transpose, just do it in the data step or PROC SQL after the join. if measures='Mean' then sortorder=3; else if measures='MEdian' then sortorder=2;... then sort by sortorder and then drop it in the PROC SORT step.
Second, if you're using entirely numeric values, you can use PROC MEANS to do the sorting for you, with a custom format that defines the order (using NOTSORTED and order=data on the class statement) and idgroup functionality in PROC MEANS to do the sorting and output the right values. This is overkill in most cases, but if the dataset is huge it might be appropriate.
Third, if you're doing the joins in SQL, you can order by the variable that you input into a order you want - I can explain that in more detail if you find that the most useful.
Related
I have a dataset in Stata and would like to create a descriptive statistics table. The current problem I have is that my variables are both numerical and categorical. For the numerical variables, I know I can create a table easily with the mean, standard deviation and so on. I have just had a problem with categorical variables. For example, education encompasses 5 levels of different education and I would like to show the proportion of observations for each option within the education variable.This is just part of it. I wanted to create an overall table that has descriptive statistics for other variables, like gender, age, income, level of education and so on.
I like to use the user-contributed command table1 for this purpose. Type ssc install table1 to access the package.
sysuse auto
table1, vars(price contn \ rep78 cat)
+------------------------------------------------+
| Factor Level Value |
|------------------------------------------------|
| N 74 |
|------------------------------------------------|
| Price, mean (SD) 6,165.3 (2,949.5) |
|------------------------------------------------|
| Repair record 1978 1 2 (3%) |
| 2 8 (12%) |
| 3 30 (43%) |
| 4 18 (26%) |
| 5 11 (16%) |
+------------------------------------------------+
Type help table1 for additional options.
asdocx has a comprehensive template for creating table1. The template can summarize different types of variables such as continuous and binary / categorical variables in a single table. Table1 template allows different statistics with categorical / factor variables, continuous variables, and binary variables. The allowed statistics are given below:
mean Mean of the variable
sd Standard deviation of the variables
ci 95% Confidence interval
n Counts
N Counts
frequency Counts
percentage Count as Percentage of total *
% Count as percentage of total
The statistics presented in the above table can be selectively used with categorical, binary, and continuous variables. The default statistics for each type of variables are given below:
(1) Binary variables : Count (Percentages)
(2) Categorical variables : Count (Percentages)
(3) Continuous variables : Mean (95% confidence interval)
Table1 template also support survey weights. I have posted several examples on this page
I need help checking if several variables are not empty. Normally, a "where VarName is not missing" would suffice, however, the number of variables that are generated will vary.
I have the following macro I found which correctly determines the number of variables when called in a data step:
%macro var_count(var_count_name);
array vars_char _character_;
array vars_num _numeric_;
&var_count_name = dim(vars_char) + dim(vars_num);
%mend;
My datasets is creating a variable number of COLs (i.e. COL1, COL2, COL3, etc.) depending on the dataset I use. I would like to use this in a data step where it returns observations where each of my generated COL1, COL2, COL3, etc. variables are looked at. I envision something like below:
Data Want;
set Have;
where cats(COL, %var_count(total_vars)) is not missing;
run;
But this does not work. I would very much like to avoid having to write "where COL1 is not missing or COL2 is not missing or ..." everytime the program is run. Any and all help will be appreciated.
EDIT: I fear I may have been too unspecific in my needs above. I will try to be more clear below. Not sure if I should make a new post, but here goes.
Dataset that I have (CVal = Character value)
| ID | COL1 | COL2 | COL3 | COL4 | COL5 | COL6 | COL7 |
| 1 | | | | | | | CVal |
| 2 | CVal | CVal | | | | | |
| 3 | | | | | | | |
| 4 | | CVal | | | | | |
I would like to keep ID1, 2 and 4, due to there being information in either COL1 through COL7 in each of these.
Essentially I would like a piece of code that can do the following:
Data Want;
Set data have;
if missing(COL1) and missing(COL2) and missing(COL3) and missing(COL4)
and missing(COL5) and missing(COL6) and missing(COL7) then delete;
run;
My problem is then, the number of COLs will vary depending on the input dataset. It may sometimes just be COL1-COL5, sometimes COL1-COL20. How can this be made "automatic", so it will automatically register the number of COL-columns and then automatically check those columns, if they are all empty and then delete the observation.
In your case to test if any of the COL: variables is non-empty you can just test if the concatenation of them is non-empty.
data want;
set have;
if not missing(cats(of COL:));
run;
You need to use subsetting IF because you cannot use variable lists in a WHERE statement.
Example:
35 data test;
36 set sashelp.class;
37 where nmiss(of height weight) > 0 ;
------
22
76
ERROR: Syntax error while parsing WHERE clause.
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, (, ), *, **, +, ',', -, /, <, <=, <>, =, >, >=, ?, AND, BETWEEN, CONTAINS, EQ,
GE, GT, IN, IS, LE, LIKE, LT, NE, NOT, NOTIN, OR, ^, ^=, |, ||, ~, ~=.
ERROR 76-322: Syntax error, statement will be ignored.
38 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.TEST may be incomplete. When this step was stopped there were 0 observations and 5 variables.
WARNING: Data set WORK.TEST was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
Note that if any of the COL... variables is numeric you would need to modify the test a little. When the MISSING option is set to ' ' then the line above will work. But if it is set to the normal '.' then the numeric missing values will appear as periods. If you don't mind also treating any character values that just have periods as missing also then you could just use compress() to remove the periods.
if not missing(compress(cats(of COL:),'.'));
You can use N to count the number of non-missing numerics and CATS to check for some character values being not-missing.
Example:
Presume numeric and character variables are segregated with prior variable based array statements such as
array chars col1-col7;
array nums x1-x10;
The subsetting if would be
if N(of vars_num(*)) or not missing (CATS(of vars_char(*)));
or test using COALESCE and COALESCEC
if not missing(coalesce(of nums(*))) or
not missing(coalesceC(of chars(*)));
If you don't know the variable names ahead of time, you will need to examine the data set ahead of the subsetting DATA step and codegen the array statements into macro variables.
My table looks like this:
Candidate |Current Status | Interviewer 1 | Interview 1 Date | Interviewer 2 | Interview 2 Date
Candidate 1 | Int1 clear | aaa | 1/1/2020 | bbb | 2/1/2020 <br>
Candidate 2 | Int1 pending | bbb | 10/1/2020 | aaa | 10/2/2020 <br>
There're more columns but I'm ignoring them for now.
I want to create a view to find out how many interviews were conducted by "aaa" drill down to interview date and the current status. Issue is, "aaa" will be shown for both Interview 1 & 2.
I tried to unpivot for Interviewer 1 and Interviewer 2, but that gives me the irrelevant dates of Interview by "bbb". Something like,
Candidate 1 | Int 1 clear | 1/1/2020 | 2/1/2020 | Interviewr 1 | aaa<br>
Candidate 1 | Int 1 clear | 1/1/2020 | 2/1/2020 | Interviewr 1 | bbb<br>
Candidate 2 | Int 1 pending | 1/1/2020 | 2/1/2020 | Interviewr 2 | aaa<br>
Candidate 2 | Int 1 pending | 1/1/2020 | 2/1/2020 | Interviewr 2 | bbb<br>
Now there's data (Interview 2 date) of aaa for interviews conducted by bbb.
Clarification - Interview 1 and Interview 2 are of same candidate. Candidate is going through series of interview so we're trying to keep the track of the candidate and the interviews they go through.
Each interview is taken by a different panelist - I want to count the number of interviews taken by the panelist and drill down to the details of each Interview
I don't exactly know what it is you want to do since your explanation is somewhat vague. If I understand you correctly you might be better of giving Interviewer labels to the correct Interview by hand.
For example:
(this is without unpivoting)
Interview |Interviewer|Candidate. |status
____________________________________________
Interview 1|aaa. |Candidate 1|Pending
Interview 2|bbb. |Candidate 2|Pending
Interview 3|aaa. |Candidate 3|Clear
and so on
Or, you could also try making interviewer columns like the following:
aaa. |bbb. |Candidate. |status
____________________________________________
Interview 1|Interview 2|Candidate 1|Pending
Interview 3|Interview 5|Candidate 2|Pending
Interview 4|interview 6|Candidate 3|Clear
and so on
In case of the latter you can unpivot aaa and bbb. This wil create a table where you will find the interviewer in one table and the interviews the interviewer has conducted in a values table. This will however make it so that the candidates where interviewed by both interviewers. I do not know if this is what you want. You can work around this, but for that we would need more information and a more clear question.
Both ways described above would let you create a filter for interviewer and thus let you calculate whatever you want for the corresponding interviewer.
hope this helps
Are you 100% married to the idea of keeping everything in one table? There are some advantages to the approach of creating separate tables for the interviewers, candidates, and possibly the interview status.
However, let's assume that you prefer to keep everything in one table. There's actually no need to unpivot columns to solve what you're looking for.
I recommend using a tidy data approach and creating one column for each variable. In this case the variables are the candidate, the interviewer, the date of the interview, which interview it was, and what the interview status is. Personally I would make the interview status a calculated column either directly in the query or after the table loads and using DAX.
This is how I would approach it - first make a duplicate of the original query. Drop the interview status column for now in both queries.
In your original query, also get rid of the columns for the interviewer and interview date for the second interview. You should have three columns left in the original query - candidate, interviewer 1, and interview 1 date. Create a new column for the interview stage. Populate it with something like "1" or "First".
In your duplicate query keep the information for candidate, interviewer 2, and interview 2 date. Get rid of interviewer 1 and interview 1 date. You should have three columns, candidate, interviewer 2, and interview 2 date. Create a new column for the interview stage. Populate it with something like "2" or "Second".
In both queries change the column names so they're the same in both queries. I recommend simply dropping the 1 or 2 from the interviewer and interview date columns.
Append the two queries together. You should now have one table with four columns: candidate, interviewer, interview date, and interview stage. Since your primary interest is in the interviewer, move that column to the far left. Sort by the interviewer first (ascending or descending by whichever works better for you), then by the candidate ascending or descending, and then by the date in ascending order. Add an index column and either leave it at the end or move it to the far left as you choose. It doesn't matter if you start at 0 or 1 on the index column.
At this point you can either load the table or try to create a status column using whatever logic determines pending vs cleared or other statuses you might have. Personally I find it easier to create columns for that type of logic using DAX but it may be easier to do it in the query depending on how complex the logic is.
Once you have that calculated column for the status you should have everything you need to generate the visuals for what you want to see. The index column is there to give you more options with how you approach the status column. It also gives you a way to put the table in the exact order you had it in the query prior to load. As I'm sure you've noticed when looking at your tables in the datasheet view after load, the rows probably aren't in the same order that they were in the query. Also, you can't sort on more than one column at a time in the datasheet view. Sorting by the index column takes care of both those concerns.
If you do the status column in DAX, you will probably want to look at the EARLIER function if you're not already familiar with it.
I have a table that contains multiple columns with their named having either the suffix _EXPECTED or _ACTUAL. For example, I'm looking at my sold items from my SoldItems Table and I have the following columns: APPLES_EXPECTED, BANANAS_EXPECTED, KIWIS_EXPECTED, APPLES_ACTUAL, BANANAS_ACTUAL, KIWIS_ACTUAL (The Identifier of the table is the date, so we have results per date). I want to show that data in a table form, something like this (for a selected date in filters:
+------------+----------+--------+
| Sold items | Expected | Actual |
+------------+----------+--------+
| Apples | 10 | 15 |
| Bananas | 8 | 5 |
| Kiwis | 2 | 1 |
+------------+----------+--------+
How can I manage something like this in Power BI ? I tried playing with the matrix/table visualization, however, I can't figure out a way to merge all the expected and actual columns together.
It looks like the easiest option for you would be to mould the data a bit differently using Power query. You can UNPIVOT your data so that all the expected and actual values become rows instead of columns. For example take the following sample:
Date Apples_Expected Apples_Actual
1/1/2019 1 2
Once you unpivot this it will become:
Date Fruit Count
1/1/2019 Apples_Expected 1
1/1/2019 Apples_Actual 2
Once you unpivot, it should be fairly straightforward to get the view you are looking for. The following link should walk you through the steps to unpivot:
https://support.office.com/en-us/article/unpivot-columns-power-query-0f7bad4b-9ea1-49c1-9d95-f588221c7098
Hope this helps.
I am working on a research project that requires me to run a linear regression on the stock returns (of thousands of companies) against the market return for every single day between 1993 to 2014.
The data would be similar to (This is dummy data):
| Ticker | Time | Stock Return | Market Return |
|----------|----------|--------------|---------------|
| Facebook | 12:00:01 | 1% | 1.5% |
| Facebook | 12:00:02 | 1.5% | 2% |
| ... | | | |
| Apple | 12:00:01 | -0.5% | 1.5% |
| Apple | 12:00:03 | -0.3% | 2% |
The data volume is pretty huge. There are around 1.5 G of data for each day. There are 21 years of those data that I need to analyze and run regression on.
Regression formula is something similar to
Stock_Return = beta * Market_Return + alpha
where beta and alpha are two coefficients we are estimating. The coefficients are different for every company and every day.
Now, my question is, how to output the beta & alpha for each company and for each day into a data structure?
I was reading the SAS regression documentation, but it seems that the output is rather a text than a data structure.
The code from documentation:
proc reg;
model y=x;
run;
The output from the documentation:
There is no way that I can read over every beta for every company on every single day. There are tens of thousands of them.
Therefore, I was wondering if there is a way to output and extract the betas into a data structure?
I have background in OOP languages (python and java). Therefore the SAS can be really confusing sometimes ...
SAS in many ways is very similar to an object oriented programming language, though of course having features of functional languages and 4GLs also.
In this case, there is an object: the output delivery system object (ODS). Every procedure in SAS 9 that produces printed output produces it via the output delivery system, and you can generally obtain that output via ODS OUTPUT if you know the name of the object.
You can use ODS TRACE to see the names of the output produced by a particular proc.
data stocks;
set sashelp.stocks;
run;
ods trace on;
proc reg data=stocks;
by stock;
model close=open;
run;
ods trace off;
Note the names in the log. Then whatever you want output-wise, you just wrap the proc with ODS OUTPUT statements.
So if I want parameter estimates, I can grab them:
ods output ParameterEstimates=stockParams;
proc reg data=stocks;
by stock;
model close=open;
run;
ods output close;
You can have as many ODS OUTPUT statements as you want, if you want multiple datasets output.