I have dataset in Stata that looks like this
entityID indicator indicatordescr indicatorvalue
1 gdp Gross Domestic 100
1 pop Population 15
1 area Area 50
2 gdp Gross Domestic 200
2 pop Population 10
2 area Area 300
and there is a one-to-one mapping between values of indicator and values of indicatordescr.
I want to reshape it to wide, i.e. to:
entityID gdp pop area
1 100 15 50
2 200 10 300
where I would like gdp variable label to be "Gross Domestic", pop label "Population" and area "Area".
Unfortunately, as I understand, it is not possible to assign the value of indicatordescr as a value label of indicator, so the reshape can't transform these value labels into variable labels.
I have looked at this : Bring value labels to variable labels when reshaping wide
and this : http://www.stata.com/support/faqs/data-management/apply-labels-after-reshape/
but did not understand how to apply those to my case.
NB: the variable labeling after reshape must be done programatically, because indicator and indicatordescr have many values.
"String labels" here is informal; Stata does not support value labels for string variables. However, what is wanted here is that the distinct values of a string variable become variable labels on reshaping.
Various work-arounds exist. Here's one: put the information in the variable name and then take it out again.
clear
input entityID str4 indicator str14 indicatordescr indicatorvalue
1 gdp "Gross Domestic" 100
1 pop "Population" 15
1 area "Area" 50
2 gdp "Gross Domestic" 200
2 pop "Population" 10
2 area "Area" 300
end
gen what = indicator + "_" + subinstr(indicatordescr, " ", "_", .)
keep entityID what indicatorvalue
reshape wide indicatorvalue , i(entityID) j(what) string
foreach v of var indicator* {
local V : subinstr local v "_" " ", all
local new : word 1 of `V'
rename `v' `new'
local V = substr("`V'", strpos("`V'", " ") + 1, .)
label var `new' "`V'"
}
renpfix indicatorvalue
EDIT If the length of variable names bites, try another work-around:
clear
input entityID str4 indicator str14 indicatordescr indicatorvalue
1 gdp "Gross Domestic" 100
1 pop "Population" 15
1 area "Area" 50
2 gdp "Gross Domestic" 200
2 pop "Population" 10
2 area "Area" 300
end
mata : sdata = uniqrows(st_sdata(., "indicator indicatordescr"))
keep entityID indicator indicatorvalue
reshape wide indicatorvalue , i(entityID) j(indicator) string
renpfix indicatorvalue
mata : for(i = 1; i <= rows(sdata); i++) stata("label var " + sdata[i, 1] + " " + char(34) + sdata[i,2] + char(34))
end
LATER EDIT Although the above is called a work-around, it is a much better solution than the previous.
Related
I have a balanced panel with a set of dummies for 'countries' and observations for several years. I want to generate a new set of variables that assigns a number in the sequence 1:n for each year observation of country i, and 0 for any other observation that is not from country i.
As an example, suppose I have two countries and two years. Below on the left is an example of my database. I want a new set of variables as shown on the right:
*Example of Database Example of Desired Output
*country1 country2 year output1 output2
* 1 0 1 1 0
* 1 0 2 2 0
* 0 1 1 0 1
* 0 1 2 0 2
How can I get the desired output? Intuitively I need to multiply 'country*' by 'year' to get 'output*', but I have been unable to make it work in Stata.
Below is what I tried.
gen output = year * country
* country is ambiguous
gen output = year * country*
* invalid syntax
foreach var in country*{
gen output_`var' = year * `var'
}
* invalid name
Your last attempt almost solved it. The issue with your attempt is that you need to tell Stata that you are passing a varlist for you to be able to use the wildcards * and ?. To be able to use a wildcard in foreach, do this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(country1 country2 year)
1 0 1
1 0 2
0 1 1
0 1 2
end
foreach var of varlist country* {
gen `var'_year = year * `var'
}
The full name country1, country2 etc. is stored in `var' so I took the freedom to update the name of the result variables to country1_year, country2_year etc. rather than output_country1, output_country2 etc.
Note that this solution will only work if the country* vars only have the values 1 and 0, no observation has a missing value in any variable country* and no observation have the value 1 in more than one variable country*.
I'm working in SAS as a novice. I have two datasets:
Dataset1
Unique ID
ColumnA
1
15
1
39
2
20
3
10
Dataset2
Unique ID
ColumnB
1
40
2
55
2
10
For each UniqueID, I want to subtract all values of ColumnB by each value of ColumnA. And I would like to create a NewColumn that is 1 anytime 1>ColumnB-Column >30. For the first row of Dataset 1, where UniqueID= 1, I would want SAS to go through all the rows in Dataset 2 that also have a UniqueID = 1 and determine if there is any rows in Dataset 2 where the difference between ColumnB and ColumnA is greater than 1 or less than 30. For the first row of Dataset 1 the NewColumn should be assigned a value of 1 because 40 - 15 = 25. For the second row of Dataset 1 the NewColumn should be assigned a value of 0 because 40 - 39 = 1 (which is not greater than 1). For the third row of Dataset 1, I again want SAS to go through every row of ColumnB in Dataset 2 that has the same UniqueID as in Dataset1, so 55 - 20 = 35 (which is greater than 30) but NewColumn would still be assigned a value of 1 because (moving to row 3 of Datatset 2 which has UniqueID =2) 20 - 10 = 10 which satisfies the if statement.
So I want my output to be:
Unique ID
ColumnA
NewColumn
1
15
1
1
30
0
2
20
1
I have tried concatenating Dataset1 and Dataset2 into a FullDataset. Then I tried using a do loop statement but I can't figure out how to do the loop for each value of UniqueID. I tried using BY but that of course produces an error because that is only used for increments.
DATA FullDataset;
set Dataset1 Dataset2; /*Concatenate datasets*/
do i=ColumnB-ColumnA by UniqueID;
if 1<ColumnB-ColumnA<30 then NewColumn=1;
output;
end;
RUN;
I know I'm probably way off but any help would be appreciated. Thank you!
So, the way that answers your question most directly is the keyed set. This isn't necessarily how I'd do this, but it is fairly simple to understand (as opposed to a hash table, which is what I'd use, or a SQL join, probably what most people would use). This does exactly what you say: grabs a row of A, says for each matching row of B check a condition. It requires having an index on the datasets (well, at least on the B dataset).
data colA(index=(id));
input ID ColumnA;
datalines;
1 15
1 39
2 20
3 10
;;;;
data colB(index=(id));
input ID ColumnB;
datalines;
1 40
2 55
2 30
;;;;
run;
data want;
*base: the colA dataset - you want to iterate through that once per row;
set colA;
*now, loop while the check variable shows 0 (match found);
do while (_iorc_ = 0);
*bring in other dataset using ID as key;
set colB key=ID ;
* check to see if it matches your requirement, and also only check when _IORC_ is 0;
if _IORC_ eq 0 and 1 lt ColumnB-ColumnA lt 30 then result=1;
* This is just to show you what is going on, can remove;
put _all_;
end;
*reset things for next pass;
_ERROR_=0;
_IORC_=0;
run;
I have the following dataset:
dataseta:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Inc 100 100 0
2. TC Tribal Council Limited INC 20 25 65
desired output:
datasetb:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Limited Inc 120 125 0
Basically, I need to choose the row with the maximum length of characters for the column name2.
I tried the following, but it didn't work
proc sql;
create table datasetb as select no,name1,name2,sum(sales),sum(inv),min(comp) from dataseta group by 1,2,3 having length(name2)=max(length(name2));quit;
If I do the following code, it only partially resolves it, and I get duplicate rows
proc sql;
create table datasetb as select no,name1,max(length(name2)),sum(sales),sum(inv),min(comp) from dataseta group by 1,2 having length(name2)=max(length(name2));quit;
You appear to be joining the results of two separate aggregate computations.
Presuming:
no is unique so as to allow a tie breaker criteria and the first (per no) longest name2 is to be joined with the cost, inv, comp totals over name1.
The query will have lots going on...
1st longest name2 within name1, nested subqueries are needed to:
Determine the longest name2, then
Select first one, according to no, if more than one.
totals over name1
The totals will be a sub-query that is joined to, for delivering the desired result set.
Example (SQL)
data have;
length no 8 name1 $6 name2 $35 sales inv comp 8;
input
no name1& name2& sales inv comp; datalines;
1 TC Tribal Council Inc 100 100 0 * name1=TC group
2 TC Tribal Council Limited INC 20 25 65
3 TC Tribal council co 0 0 0
4 TC The Tribal council Assoctn 10 10 10
7 LS Longshore association 10 10 0 * name=LS group
8 LS The Longshore Group, LLC 2 4 8
9 LS The Longshore Group, llc 15 15 6
run;
proc sql;
create table want as
select
first_longest_name2.no,
first_longest_name2.name1,
first_longest_name2.name2,
name1_totals.sales,
name1_totals.inv,
name1_totals.comp
FROM
(
select
no, name1, name2
from
( select
no, name1, name2
from have
group by name1
having length(name2) = max(length(name2))
) longest_name2s
group by name1
having no = min(no)
) as
first_longest_name2
LEFT JOIN
(
select
name1,
sum(sales) as sales,
sum(inv) as inv,
sum(comp) as comp
from
have
group by name1
) as
name1_totals
ON
first_longest_name2.name1 = name1_totals.name1
;
quit;
Example (DATA Step)
Processing the data in a serial manner, when name1 groups are contiguous rows, can be accomplished using a DOW loop technique -- that is a loop with a SET statement within it.
data want2;
do until (last.name1);
set have;
by name1 notsorted;
if length(name2) > longest then do;
longest = length(name2);
no_at_longest = no;
name2_at_longest = name2;
end;
sales_sum = sum(sales_sum,sales);
inv_sum = sum(inv_sum,inv);
comp_sum = sum(comp_sum,comp);
end;
drop name2 no sales inv comp longest;
rename
no_at_longest = no
name2_at_longest = name2
sales_sum = sales
inv_sum = inv
comp_sum = comp
;
run;
I'm trying to make an automated Excel file that documents the number of observations dropped during my sample construction, using putexcel and a simple program.
I'm pretty new to programming, but the program below does the job. It stores 4 global macros for each time I drop some observations: 1) Number of observations dropped, 2) Share of observations dropped, 3) Number of observations left in the data set and 4) a string that describes why I drop the observations.
To export the results to excel I use the putexcel-command -- which is working fine. The problem is that I need to drop observations a lot of times in the dofile and I wondered if I could somehow incorporate the putexcel part in the program to make it loop over cells.
In other words, what I want is the program to automatically save the description ($why) in A1 the first time, in A8 the second time and so on.
I have provided an example of my code below:
** Generate some data:
clear
input id year wage
1 1 200
1 2 250
1 3 300
2 1 152
2 2 150
2 3 140
3 1 300
3 2 320
3 3 360
end
** Define program
cap program drop dropdata
program define dropdata
count
global N = r(N)
count if `1'
global drop = r(N)
global share = ($drop/$N)
drop if `1'
count
global left = r(N)
global why = "`2'"
end
** Drop if first year
dropdata year==1 "Drop if first year"
** Export to excel
putexcel set "documentation.xlsx", modify
putexcel A1 = ("$why")
putexcel A3 = ("Obs. dropped") A4 = ("Share dropped") A5 = ("Observations left")
putexcel B3 = ($drop) B4 = ($share) B5=($left)
** Now drop if wage is < 300
dropdata wage<300 "Drop if wage<300"
putexcel A8 = ("$why")
putexcel A10 = ("Obs. dropped") A11 = ("Share dropped") A12 = ("Observations left")
putexcel B10 = ($drop) B11 = ($share) B12 = ($left)
The issue with this is that Stata does not know what cells are filled and which are not, so I think it would probably be easiest to include another argument in your program define that says the number of times you have run the program.
Here is an example:
** Generate some data:
clear
input id year wage
1 1 200
1 2 250
1 3 300
2 1 152
2 2 150
2 3 140
3 1 300
3 2 320
3 3 360
end
** Define program
cap program drop dropdata
program define dropdata
count
local N = r(N)
count if `1'
local drop = r(N)
local share = ($drop/$N)
drop if `1'
count
local left = r(N)
local why = "`2'"
local row1 = `3'*7 + 1
local row3 = `row1' + 2
local row4 = `row1' + 3
local row5 = `row1' + 4
putexcel set "documentation.xlsx", modify
putexcel A`row1' = ("`why'")
putexcel A`row3' = ("Obs. dropped") A`row4' = ("Share dropped") A`row5' = ("Observations left")
putexcel B`row3' = (`drop') B`row4' = (`share') B`row5' = (`left')
end
** Drop if first year
dropdata year==1 "Drop if first year" 0
** Now drop if wage is < 300
dropdata wage<300 "Drop if wage<300" 1
Note that the change is to include the number of calls already done as the third argument in dropdata, then we add the putexcel commands to rows based on that number.
As an aside:
I changed all of your globals to locals because they're safer. Also, in general, if you want to return macros from a program that you write, you tell Stata the program is, for example an rclass and then use statements like below:
program define return_2, rclass
return local asdf 2
end
and then you can access the local asdf (which is equal to 2) as the local r(asdf) and you can check the values of all locals returned by the program with the command return list
This is the output that I need in RTF format:
**DEMOGRAPHICS A-B**
Age
n 18
Mean 30.4
SD 6.29
Min 18
Median 30.5
Max 39
but I am getting this result:
**DEMOGRAPHICS A-B**
Age
n 18
Mean 30.4
SD 6.29
Min 18
Median 30.5
Max 39
How do I left align age and center the remaining variables?
Here is my code:
proc report data = FINAL2 split = "#"
STYLE(REPORT)=[BACKGROUND=WHITE BORDERCOLOR=BLACK BORDERWIDTH=0.1 ASIS=on FRAME=HSIDES RULES=GROUPS]
STYLE(HEADER)=[BACKGROUND=WHITE];
COLUMN DESC STAT1;
define DESC / "Demographic Characteristics" style(column)=[cellwidth=30%] style(header)=[just=left asis = on] ;
define STAT1 /"A - B#(N=18)" style(column header)=[cellwidth = 20%] style(header)=[just = left asis = yes];
You can use a compute block to do this. This would be executed per row but you could conditionally apply a column-specific style from there based on the variable's value being 'Age' or something else.
For example (you can add this after the define statements in your report step):
compute desc;
if desc ^= 'Age' then
call define(_COL_, "style", "style=[paddingleft=3em]");
endcomp;
This would apply a 3em padding to each desc column that doesn't match 'Age'.