Trailing zeros in string format in Stata - stata

I am trying to convert a string variable A in Stata to a string variable B such that each observation has a fixed length. For example the string variable A is
85
01
3
and I want to convert it to another string variable B with trailing zeros in order to get length 5 for each observations
85000
01000
30000
I know that in order to put leading zeros this code works
gen B= string(real(A),"%05.0f"). How should it be modified in order to get trailing zeros?

The issue is that your new variable is not the old one with a new format, but different values altogether. One way is:
clear
set more off
*----- example data -----
input ///
str2 A
85
01
3
end
list
*------ what you want -----
// desired length of string
local len = 5
// factor
gen xdif = 10 ^ (`len' - length(A))
// new values with new format
gen B0 = real(A) * xdif
gen B = string(B0,"%0`len'.0f")
list
You can make that a one-liner, if you like.

An alternative approach that works equally well padding strings that contain non-numeric values.
clear
set more off
*----- example data -----
input ///
str3 A
85
01
3
XYZ
"D F"
end
list
*------ what you want -----
// desired length of string
local len 5
// desired trailing character
local pad 0
// new values
gen B0 = "`pad'"*`len'
gen B1 = A+B0
generate str`len' B3 = B1
// new values in a single command
generate str`len' B = A+"`pad'"*`len'
list

Related

Concatenation option? '!'

I have been studying basic level SAS and here is a problem that I don't understand.
data test;
A='Ipswich, England';
B=substr(A,1,7);
C=B!!';'!!'England';
run;
According to the problem, the value of C must be Ipswich , England.
I tried the code and there are three things that I would like to ask in particular.
1), Why is it okay to use !! instead of || ? Is !! a different concatenation option?
2), The result I got was Ipswich ;England. So, I don't know what the comma is doing there instead of the smicolon.
3), Why is there an extra space after Ipswich? Should not B be only the 7 letters from A from letter 1? As in I s p w i c h ?
The text I am working on has some weird expressions so there is a chance that it is a typo, but I do not want to go there yet.
Thank you.
You can use !! as an alias for ||. Old keyboards didn't have the | character. Also old ASCII/EBCDIC transcoders didn't always translate that character properly.
Your code is definitely using a semi-colon and not a comma. So either a typo or a transcription error is why the suggested answer has a comma.
Since you didn't tell SAS what length to use for variable B it had to guess. So it guessed it should use the same length as the input to the SUBSTR() function call. So both A and B are defined as 16 bytes long. The || operator does not trim the trailing spaces so the semi-colon is the 17th byte of C.
171 data test;
172 A='Ipswich, England';
173 B=substr(A,1,7);
174 C=B!!';'!!'England';
175 put (a b c) (=$quote.);
176 run;
A="Ipswich, England" B="Ipswich" C="Ipswich ;England"
NOTE: The data set WORK.TEST has 1 observations and 3 variables.
Contents:
Alphabetic List of Variables and Attributes
# Variable Type Len
1 A Char 16
2 B Char 16
3 C Char 24
1) Back in the day not all keyboards had pipe.
2) More and one extra space.
27 data null;
28 A='Ipswich, England';
29 B=substr(A,1,7);
30 C=B!!';'!!'England';
31 l1=vlength(a);
32 l2=vlength(b);
33 l3=vlength(c);
34 put all;
35 put 'NOTE: ' c $varying. l3 '';
36 run;
3) Length of B defaults to length of SUBSTR argument 1.
A=Ipswich, England B=Ipswich C=Ipswich ;England l1=16 l2=16 l3=24 _ERROR_=0 _N_=1
NOTE: **Ipswich ;England**

Wide to Long Dataset in SAS

I have a dataset that has multiple measures taken as multiple time points.
The data look like this:
UserID Var1_2008 Var1_2009 Var1_2010 Var2_2008 Var2_2009 Var2_2010 Race
1 Y N Y 20 30 20 1
2 N N N 15 30 35 0
I want the data to look like this:
Year UserID Var1 Var2 Race
2008 1 Y 20 1
2009 1 N 30 1
....
How can I do this? I'm totally lost
You could use an array, assuming you have the same years for all of the var1_ and var2_ variables.
data want ;
set have ;
/* Need two arrays, as one is character, the other numeric */
array v1{*} var1_: ; /* wildcard all 'var1_'-prefixed variables */
array v2{*} var2_: ; /* same for var2_ */
/* loop along v1 array */
do i = 1 to dim(v1) ;
/* use vname function to get variable name associated to this array element */
year = input(scan(vname(v1{i}),-1,'_'),8.) ;
var1 = v1{i} ;
var2 = v2{i} ;
output ;
end ;
drop i ;
run ;
There's a macro for that! I think running the following will do exactly what you want to accomplish:
filename ut url 'https://raw.githubusercontent.com/FriedEgg/Papers/master/An_Easier_and_Faster_Way_to_Untranspose_a_Wide_File/src/untranspose.sas';
%include ut ;
%untranspose(data=have, out=want, by=UserID, id=year, delimiter=_,
var=Var1 Var2, copy=Race)

CAT vs CATS in data step do loop

Why does the code below work (&ds is 12345678910), but when I change cats to cat, &ds is just blank? I would expect changing cats to cat would mean &ds is 1 2 3 4 5 6 7 8 9 10.
data new;
length ds $500;
ds = "";
do i = 1 to 10;
ds = cats(ds, i, " ");
end;
call symputx('ds', ds);
run;
%put &ds;
The function cat() will not trim the values so if you concatenate anything to DS and try to store it back into DS whatever you added is not stored because there is no room for it.
It appears you actually want the catx() function.
ds = catx(' ',ds, i);
SAS tends to add leading and trailing spaces if you use the input buffer and doing text manipulation. you can use either the Strip() and catx() functions to remove leading and trailing spaces. With catx() you have the extra option of specifying a delimiter.
ds = cat(strip(ds), i, " ");

Compare averages of 2 columns

For example, i have a data set like this (the value a1 a2 a3 b1 b2 b3 are numeric):
A B
a1 b1
a2 b2
a3 b3
I want to compare the average of 2 class A and B using proc ttest. But it seems that i have to change my data set in order to use this proc. I read lots of tutorials about the proc ttest and all of them use the data sets in this form below:
class value
A a1
A a2
A a3
B b1
B b2
B b3
So my question is: Does it exist a method to do the proc ttest without changing my data set?
Thank you and sorry for my bad english :D
The short answer is no, you can't run a ttest in SAS that compares multiple columns. proc ttest, when used for 2 samples, relies on the variable in the class statement to compare the groups. Only one variable can be entered and it must have 2 levels, therefore the structure of your data is not compatible with this.
You will therefore need to change the data layout, although you could do this in a view so that you don't create a new physical dataset. Here's one way to do that.
/* create dummy data */
data have;
input A B;
datalines;
10 11
15 14
20 21
25 24
;
run;
/* create a view that turns vars A and B into a single variable */
data have_trans / view=have_trans;
set have;
array vals{2} A B;
length grouping $2;
do i = 1 to 2;
grouping = vname(vals{i}); /* extracts the current variable name (A or B) */
value = vals{i}; /* extracts the current value */
output;
end;
drop A B i; /* drop unwanted variables */
run;
/* perform ttest */
proc ttest data=have_trans;
class grouping;
var value;
run;

How do I assign numeric values to the alphabet in SAS

I'm trying to convert a character string to a numeric variable and then sum the values of each character to use as a unique identifier for that field.
So for example, I would like A=1, B=2, C=3.....X=24 Y=25 Z=26.
Say my string is "CAB" so after running the code I would like the result to be an intermidiary column of numbers, where the value for CAB IS 3 1 2 and the result column would be derived by summing the string 3+1+2= 6 and show the value of the intermideate column, so the final value woud be 6.
Here is the sas code I used to convert the characters to numbers, but I need help with the result column.
DATA CHAR_VALUE;
SET WORK.XYZ;
CHAR_2_NUM=TRANSLATE(MY_VAR_CHAR, '1 2 3 ...24 25 26', 'A B C ...X Y Z');
NUM_CHAR=INPUT(CHAR_2_NUM,32.);
RUN;
Thanks in advance...I appreciate any help or suggestions.
-rachel
RANK will give the ASCII numeric value underlying a character; so A=65, B=66, Z=90, a=97, z=122.
So this should work (if you want only the uppercase values - not a different value for a than A):
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,rank(char(upcase(charval),_t))-64);
end;
put _all_;
run;
Another option (Based on the comments below), is to build an informat with the relationships between letter and value. My loop iterates over each character A to Z, you can then put whatever value you want for each letter as label (I just put 1,2,3,4... but label= will change that).
data fmts;
retain fmtname 'CHARNUM' type 'i';
do _t=65 to 90;
start=byte(_t); *the character, so byte(65)='A';
label=_t-64; *the resulting number;
output;
end;
run;
proc format cntlin=fmts;
quit;
data test;
charval='CAB';
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
end;
put _all_;
run;
Finally, if you want to be able to construct this in the same datastep, you could construct the relationships in a hash table and look up the result. I can explain that if desired, though I'd like to see a more detailed example of what you want to do in terms of defining the relationship between a letter and its code.
If you need to see the intermediate values, you can do that by inserting a CAT function in the loop- I recommend CATX:
data test;
charval='CAB';
format intermed $100.;
do _t=1 to length(Charval);
numval=sum(numval,input(char(upcase(charval),_t),CHARNUM.));
intermed=catx('|',intermed,input(char(upcase(charval),_t),CHARNUM.)); *or the RANK portion from earlier;
end;
put _all_;
run;
That would give you 3|1|2, which you could then do math on via SCAN:
do _t = 1 to countc(intermed,'|')+1;
numval2 = sum(numval2,scan(intermed,_t,'|'));
end;
Your method to try and translate is a good attempt, but it will not really work. Here is a simple solution:
DATA CHAR_VALUE;
retain all_chars 'ABCDEFGHIJKLMMOPQRSTUVXXYZ';
set XYZ;
length CHAR_2_NUM $200;
CHAR_2_NUM = ' ';
NUM_CHAR = 0;
do i=1 to length(MY_VAR_CHAR);
if i=1 then CHAR_2_NUM = substr(MY_VAR_CHAR,i,1);
else CHAR_2_NUM = trim(CHAR_2_NUM) || ' ' || substr(MY_VAR_CHAR,i,1);
NUM_CHAR + index(all_chars,substr(MY_VAR_CHAR,i,1));
end;
drop i all_chars;
RUN;
This takes advantage of the fact that the indexed position of each character of your source variable in the all_chars variable corresponds to the mapping you desired.
UPDATED to also create your CHAR_2_NUM variable, which I overlooked in the original question.
Another simple solution is based on the collate function:
To convert a variable called MyNumbers (in the range of 1 to 26) to English upper-case characters, one can use:
collate(64 + MyNumbers, 64 + MyNumbers)
To obtain lower-case characters, one can use:
collate(96 + MyNumbers, 96 + MyNumbers)
Here's a quick example:
data _null_;
do MyNumbers = 1 to 26;
MyLettersUpper = collate(64 + MyNumbers, 64 + MyNumbers);
MyLettersLower = collate(96 + MyNumbers, 96 + MyNumbers);
put MyNumbers MyLettersUpper MyLettersLower;
end;
run;
1 A a
2 B b
3 C c
4 D d
5 E e
6 F f
7 G g
8 H h
9 I i
10 J j
11 K k
12 L l
13 M m
14 N n
15 O o
16 P p
17 Q q
18 R r
19 S s
20 T t
21 U u
22 V v
23 W w
24 X x
25 Y y
26 Z z
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds