I have a dataset (BASE)with the following strucuture: a column with a index for every records, a column with a classification type, the classification value and a column i'd like to populate.
NAME |CLASSIFICATION|VALUE|STANDARD VALUE
FIDO |ALFABET |F |
ALFA |STANDARD |2 |
BETA |STANDARD |5 |
ETA |MIXED |B65 |
THETA|MIXED |A40 |
Not all records have the same classification, however I have an additional table (TRANSCODE) to convert the different classification methods into the standard one (which is classification):
ALFABET|STANDARD|MIXED
A |1 |A1
B |5 |A30
C |3 |A40
D |5 |A31
E |8 |B65
F |6 |C54
My goal is to populate the fourth column with the corresponding value i can find with the second table. (the records with the standard classification will have two columns with the same classification).
After that my data should be like the following:
NAME |CLASSIFICATION|VALUE|STANDARD VALUE
FIDO |ALFABET |F |6
ALFA |STANDARD |2 |2
BETA |STANDARD |5 |5
ETA |MIXED |B65 |8
THETA|MIXED |A40 |3
In order to do so i'm trying to do a proc sql update with a join condition but it doesn't seem to work
proc sql;
update BASE
left join TRASCODE
on BASE.VALUE= (
case
when BASE.CLASSIFCATION = 'ALFABET' then TRANSCODE.ALFABET
when BASE.CLASSIFICATION= 'STANDARD' then TRANSCODE.STANDARD
when BASE.CLASSIFICATION= 'MIXED then TRANSCODE.MIXED
end
)
set BASE.STANDARD_VALUE = TRANSCODE.STANDARD
;
quit;
Can someone help me?
Thanks a lot
The value selection for the standard value is a lookup query, so you can not join directly to transcode.
Try this UPDATE query that uses a different lookup selection for each classification:
data base;
infile cards missover;
input
NAME $ CLASSIFICATION $ VALUE $ STANDARD_VALUE $; datalines;
FIDO ALFABET F
ALFA STANDARD 2
BETA STANDARD 5
ETA MIXED B65
THETA MIXED A40
run;
data transcode;
input
ALFABET $ STANDARD $ MIXED $; datalines;
A 1 A1
B 5 A30
C 3 A40
D 5 A31
E 8 B65
F 6 C54
run;
proc sql;
update base
set standard_value =
case
when classification = 'ALFABET' then (select standard from transcode where alfabet=value)
when classification = 'MIXED' then (select standard from transcode where mixed=value)
when classification = 'STANDARD' then value
else 'NOTRANSCODE'
end;
%let syslast = base;
Related
I have a variable that looks like this:
I want a new variable that multiplies the labels with the frequency, so for example the first row would be 170,105=70,105, and 2 would be 236,377=72754 and so on. I want my new variable to look like this:
How can I do this?
On the face of it you have at least 119167 observations. The "at least" refers to the possibility of missing values, not tabulated by default.
You don't say whether you want these values in the same observations or in a much reduced new dataset. If the former, then consider this (noting that 3845 * 4 = 15380).
clear
input apple freq
1 70105
2 36377
3 8840
4 3845
end
expand freq
tab apple
bysort apple : gen new = apple * _N
tabdisp apple, c(new)
----------------------
apple | new
----------+-----------
1 | 70105
2 | 72754
3 | 26520
4 | 15380
----------------------
```
I would like to put column percentages into an excel file. The first step therefore would be to capture the percentages (or counts if not possible for percentages) into a matrix and then post the values into excel using putexcel. I can not use the matcell and matrow option with svy: tab so I attempted to check the stored results using e(name). The issue I am facing is how to capture the values from the following tabulation into a matrix:
webuse nhanes2b, clear
svyset psuid [pweight=finalwgt], strata(stratid)
svy: tabulate sex race , format(%11.3g) percent
--------------------------------------
1=male, | 1=white, 2=black, 3=other
2=female | White Black Other Total
----------+---------------------------
Male | 42.3 4.35 1.33 47.9
Female | 45.7 5.2 1.2 52.1
|
Total | 87.9 9.55 2.53 100
--------------------------------------
Key: cell percentages
I would like to put the values above in a matrix. I tried the following which worked:
mat pct = e(b)' * 100
matrix list pct
pct[6,1]
y1
p11 42.254909
p12 4.3497373
p13 1.3303765
p21 45.660537
p22 5.2008547
p23 1.2035865
But what I am interested in is the column percentages given by the following tabulation:
svy: tabulate sex race , format(%11.3g) col percent
--------------------------------------
1=male, | 1=white, 2=black, 3=other
2=female | White Black Other Total
----------+---------------------------
Male | 48.1 45.5 52.5 47.9
Female | 51.9 54.5 47.5 52.1
|
Total | 100 100 100 100
--------------------------------------
Key: column percentages
I tried this which did not return the desired values in the table above:
mat pct = e(b)' * 100
matrix list pct
pct[6,1]
y1
p11 42.254909
p12 4.3497373
p13 1.3303765
p21 45.660537
p22 5.2008547
p23 1.2035865
After checking through various stored objects using ereturn list I did not seem to find anything corresponding to column percentages.
How can I get the column percentages into a matrix?
easy peasy
ssc install estout
webuse nhanes2b, clear
estpost svy: tabulate race diabetes, row percent
esttab . using "C:/table.csv", b(2) se(2) scalars(F_Pear) nostar unstack mtitle(`e(colvar)')
see also here http://repec.org/bocode/e/estout/hlp_estpost.html#svy_tabulate
I have a situation where I need to need to order several dates to see if there is a gap in coverage. My data set looks like this, where id is the panel id and start and end are dates.
id start end
a 01.01.15 02.01.15
a 02.01.15 03.01.15
b 05.01.15 06.01.15
b 07.01.15 08.01.15
b 06.01.15 07.01.15
I need to identify any cases where there is a gap in coverage, meaning when the 2nd start date for an id is greater than the first end date for the same id. Also it should be noted that the same id can have undetermined number of observations and they might not be in a particular order. I wrote the code below for a case where there are only two observations per id.
bys id: gen y=1 if end < start[_n+1]
However, this code does not produce the desired results. I'm thinking that there should be another way to approach this problem.
Your approach seems sound in essence, assuming that your date variables are really Stata daily date variables formatted suitably. You don't explain at all what "does not produce the desired results" means to you.
The code below creates a sandbox similar to your example, but with string variables converted to daily dates.
Key details include:
Observations must be sorted by date within panel.
The end date for the observation after the last in each panel would always be returned as missing, and so as greater than any known date. The code here returns the corresponding indicator as missing.
clear
input str1 id str8 (s_start s_end)
a "01.01.15" "02.01.15"
a "02.01.15" "03.01.15"
b "05.01.15" "06.01.15"
b "07.01.15" "08.01.15"
b "06.01.15" "07.01.15"
b "10.01.15" "12.01.15"
end
foreach v in start end {
gen `v' = daily(s_`v', "DMY", 2050)
format `v' %td
}
// the important line here
bysort id (start) : gen first = end < start[_n+1] if _n < _N
list , sepby(id)
+----------------------------------------------------------+
| id s_start s_end start end first |
|----------------------------------------------------------|
1. | a 01.01.15 02.01.15 01jan2015 02jan2015 0 |
2. | a 02.01.15 03.01.15 02jan2015 03jan2015 . |
|----------------------------------------------------------|
3. | b 05.01.15 06.01.15 05jan2015 06jan2015 0 |
4. | b 06.01.15 07.01.15 06jan2015 07jan2015 0 |
5. | b 07.01.15 08.01.15 07jan2015 08jan2015 1 |
6. | b 10.01.15 12.01.15 10jan2015 12jan2015 . |
+----------------------------------------------------------+
my understanding of SAS is very elementary. I am trying to do something like this and i need help.
I have a primary dataset A with 20,000 observations where Col1 stores the CITY and Col2 stores the MILES. Col2 contains a lot of missing data. Which is as shown below.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Gary,IN | 242.34 |
+----------------+---------------+
| Lafayette,OH | . |
+----------------+---------------+
| Ames, IA | 123.19 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | . |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
I have another secondary dataset B this has around 5000 observations and very similar to dataset A where Col1 stores the CITY and Col2 stores the MILES. However in this dataset B, Col2 DOES NOT CONTAIN MISSING DATA.
+----------------+---------------+
| Col1 | Col2 |
+----------------+---------------+
| Lafayette,OH | 321.45 |
+----------------+---------------+
| San Jose,CA | 212.55 |
+----------------+---------------+
| Schuaumburg,IL | 176.34 |
+----------------+---------------+
| Santa Cruz,CA | 454.44 |
+----------------+---------------+
My goal is to fill the missing miles in Dataset A based on the miles in Dataset B by matching the city names in col1.
In this example, I am trying to fill in 321.45 in Dataset A from Dataset B and similarly 176.34 by matching Col1 (city names) between the two datasets.
I am need help doing this in SAS
You just have to merge the two datasets. Note that values of Col1 needs to match exactly in the two datasets.
Also, I am assuming that Col1 is unique in dataset B. Otherwise you need to somehow tell more exactly what value you want to use or remove the duplicates (for example by adding nodupkey in proc sort statement).
Here is an example how to merge in SAS:
proc sort data=A;
by Col1;
proc sort data=B;
by Col1;
data AB;
merge A(in=a) B(keep=Col1 Col2 rename=(Col2 = Col2_new));
by Col1;
if a;
if missing(Col2) then Col2 = Col2_new;
drop Col2_new;
run;
This includes all observations and columns from dataset A. If Col2 is missing in A then we use the value from B.
Pekka's solution is perfectly working, I add an alternative solution for the sake of completeness.
Sometimes in SAS a PROC SQL lets you skip some passages compared to a DATA step (with the relative gain in storage resources and computational time), and a MERGE is a typical example.
Here you can avoid sorting both input datasets and handling the renaming of variables (here the matching key has the same name col1 but in general this is not the case).
proc sql;
create table want as
select A.col1,
coalesce(A.col2,B.col2) as col2
from A left join B
on A.col1=B.col1
order by A.col1;
quit;
The coalesce() function returns the first non missing element encountered in the arguments list.
I have a table of this form
id1|A|
id1| |var1
id1|B|var2
id2|C|
I would like to count retrieve the data that have all the information for all variables: ie
id1|B|var2
to perform this task I want to count the number of observations in each row and take only the rows which have full observation:
id|name|age |cntrow
id1| A | |2
id1| |var1|2
id1| B |var2|3
id2| C | |2
Any guess how to perform this task?
You can use a CMISS function. Something along the lines of:
Data nomissing missing;
Set input_dataset;
if CMISS(of _ALL_)=0 then output nomissing;
if CMISS(of _ALL_)>0 then output missing;
run;
The n function would work if this were numeric. Since the data are not, you can use CMISS to find out how many are missing:
data have;
infile datalines dlm='|';
input
id $ charvar1 $ charvar2 $ numvar;
vars_missing = cmiss(of _all_)-1; *because vars_missing is also missing at this point!;
put _all_;
datalines;
id1|A| |3
id1| |var1|2
id1|B|var2|.
id2|C| |2
;;;;
run;
And then subtract that from the known number of variables. If you don't know it, you can create _CHARACTER_ and _NUMERIC_ arrays and use dim() for those to find out.