I try to create a SAS data set with two different variables. Y should be the whole name. The variable Names should by the name with the given name presented by the initials e.g. Johnson Mike should be "Johnson M." and Smith Robert John should be "Smith R. J.". I'm not sure how to create the Names variable, anyone who can help?
data names;
Length y $ 40;
Input y &;
Names = y;
DATALINES;
Johnson Mike
Smith Robert John
Jones Linda Mary
Brown Marcus
run;
this should work with a do loop
data names_final;
set names;
do _n_ = 1 to countw(Y,' ');
if _n_ = 1 then name =scan(Y,1);
else name = catx(' ', name, cats(first(scan(y,_n_)),'.'));
end;
run;
you can also do
data names_final;
set names;
name = cats(catx(' ', scan(y,1), catx('. ', first(scan(y,2)),first(scan(y,3) ))),'.');
run;
Function 'countw' 'scan' 'first' and 'catx' may be helpful.
1. Get the number of words of Name;
2. Keep the first word;
3. Do a loop, connect the first word and the first letter of the every single word(expect the first word);
Related
Hi I have one doubt in sas
How to split string into multiple columns in sas?
Here before first space value consider as firstname and last space after values consider as lastname and between first and lastspace values consider as middle name.
data my_data1;
input name $500.;
datalines;
Andy Lincoln Bernard ravni
Barry Michael
Chad Simpson Smith
Eric
Frank Giovanni Goodwill
;
run;
proc print data=my_data1;
based on data expecte out like below :
Fname | Middlename | lname
Andy | Lincoln Bernard |ravni
Barry | |Michael
Chad | Simpson |Smith
Eric | |
Frank|Giovanni |Goodwill
I tried like below
data my_data2;
set my_data1;
Fname=scan(name, 1, ' ');
Middlename=scan(name, 2, ' ');
Lname=scan(name, -1, ' ');
run;
proc print data=my_data2;
above logic not give expected out put.
can you please tell me how to write code achive this task in sas
Code:
data want;
length first_name middle_name last_name $50.;
set have;
n_names = countw(name);
if(n_names) = 1 then first_name = name;
else if(n_names = 2) then do;
first_name = scan(name, 1);
last_name = scan(name, -1);
end;
else do;
first_name = scan(name, 1);
last_name = scan(name, -1);
middle_name = substr(name, length(first_name)+2, length(name) - (length(first_name) + length(last_name))-2 );
end;
run;
How it works
We know:
If there's one word, it's a first name
If there are two words, it's a first and last name
If there are three or more words, it's a first, last, and middle name
To get the middle name, we know:
Where the first name starts and how long it is
Where the last name starts and how long it is
How long the entire name is
By simply doing some subtraction, we can get a substring of the middle name:
Len ----------------- 17
----5 ---4
First Middle Last
Pos 7 12
The length of the string is 17. "Middle" starts at 7 and ends at 12. We can get the length of the middle name by simply substracting the lengths of the first and last names from the total length of the string. We subtract 2 to account for the space at the end of the middle name.
17 - (5 + 4) - 2 = 6
Our start position is 5 + 2 (i.e. the first name + 2) to account for the space. Translating this to substr:
substr(name, length(first_name)+2, length(name) - (length(first_name) + length(last_name))-2 )
Adapted from How to separate first name and middle name and last name
data want;
set my_data1;
length first middle middle1 middle2 last $ 40;
array parts[*] first middle1 middle2 last;
do i = 1 to countw(name);
if i = countw(name) and i < dim(parts) then do;
parts[dim(parts)] = scan(name, i);
end;
else do;
parts[i] = scan(name, i);
end;
end;
if middle1 ne "" and middle2 ne "" then middle = catx(" ", middle1, middle2);
else middle = middle1;
if first = "" and last ne "" then do;
first = last;
last = "";
end;
drop name i middle1 middle2;
run;
I am using SAS Enterprise Guide.
I have a new file and i was asked to generate output.
Source:
Name feeder_in feeder_out NickName
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
and the desire output:
Name feeder_final
ABBA 1
ABBA 2
ABBA A
ABBA B
POLA 1
POLA 2
CONS POLA C
CONS POLA D
CONS POLA E
I have been trying myself on handling this but no luck at all.
I tried
data test;
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
output;
end;
xequipment=newequipment;
run;
Does anyone have clue for this?
Here's my understanding of your requirements, based on the desired output: you want your output to have one observation for each combination of NAME and FEEDER_IN, plus another observation for each combination of NICKNAME and FEEDER_OUT.
On that assumption, the code would look something like (not tested):
data want;
set have;
keep name feeder_final
* Loop over FEEDER_IN and output one obs for each delimited value;
do i = 1 to countw(feeder_in, ',');
feeder_final = scan(feeder_in, i, ',');
output;
end;
* Move the NICKNAME value into NAME;
name = nickname;
* Loop over FEEDER_OUT and output one obs for each delimited value;
do i = 1 to countw(feeder_out, ',');
feeder_final = scan(feeder_out, i, ',');
output;
end;
run;
When transposing multiple columns you might want to also maintain the source row and column identifiers for further downstream analytics. The sequence of the values in the csv might also be important if you need to do pairwise joining on sequence position of the categorical form -- such as needing to match 1A 2B in row 1 and 1C 2D in row 2.
data have;
length name feeder_in feeder_out nickname $20;
input
Name& feeder_in& feeder_out& NickName&; datalines;
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
run;
data want;
_row_ + 1;
set have;
feeder = 'in ';
do seq = 1 to countw(feeder_in,',');
value = scan(feeder_in,seq,',');
OUTPUT;
end;
feeder = 'out';
do seq = 1 to countw(feeder_out,',');
value = scan(feeder_out,seq,',');
OUTPUT;
end;
keep _row_ Name feeder seq value NickName;
run;
Ok my last question I am having a hard time formatting this
data practice;
input
Datalines;
employee_id Name gender years dept salary Birthday
1 Mitchell, Jane A f 6 shoe 22,450 12/30/1960
2 Miller, Frances T f 8 appliance . 11/27/1965
3 Evans, Richard A m 9 appliance 42,900 02/15/1973
4 Fair, Suzanne K f 3 clothing 29,700 03/09/1958
5 Meyers, Thomas D m 5 appliance 33,700 10/22/1961
6 Rogers, Steven F m 3 shoe 27,000 09/12/1960
7 Anderson, Frank F m 5 clothing 33,000 03/09/1958
10 Baxter, David T m 2 shoe 23,900 11/25/1966
11 Wood, Brenda L f 3 clothing 33,000 01/14/1962
12 Wheeler, Vickie M f 7 appliance 31,500 12/23/1975
13 Hancock, Sharon T f 1 clothing 21,000 01/17/1972
14 Looney, Roger M m 10 appliance 31,500 06/09/1973
15 Fry, Marie E f 6 clothing 29,700 05/25/1967
;
run;quit;
Proc print data=practice;
run;quit;
Ok my question is there a way to do this without having to count each individual space? Even when I do count the data still does not properly print out what am I doing wrong? Thanks in advance this should be my last question afterwards I should be ready for this final.
If you don't assign a character length, SAS will use the length of the first value it encounters and assign it to all the values in that column. You can use the statement length var $w; before your data lines statement to set your own length. Using the option dsd tells SAS to use comma as your variable delimiter, read strings enclosed in quotation marks as a single variable, and to strip them off before saving the variable. If using blank spaces as your delimiter, make sure there are no blank spaces in front of each row below the dataline statement.
data practice;
infiles datalines dsd;
length Name $50. dept $9.;
input employee_id Name $ gender $ years dept $ salary $ Birthday MMDDYY10.;
format Birthday MMDDYY10.;
Datalines;
1, "Mitchell, Jane A", f, 6, shoe, "22,450", 12/30/1960
2, "Miller, Frances T", f, 8, appliance, , 11/27/1965
;
run;
Proc print data=practice;
run;quit;
I would like to create a variable called DATFL that would have the following values for the last obseration :
DATFL
gender/scan
Here is the code :
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M F
2 jill F L
3 james F M
4 jonas M M
;
run;
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
run;
Unfortunately, the value I get for 'DATFL' at the last observation is 'gender/scan/gender/scan'.Obviously because of the retain statement that I used for 'DATFL' I ended up with duplicates. At the end of this data step, I was planning to use a CALL SYMPUT statement to load the last value into macro variable but I won't do it until I fix my issue...Can anyone provide me with a guidance on how to prevent 'DATFL' to have duplicates value at the end of the dataset ? Cheers
sas_kappel
Don't retain DATFL, Instead, retain DATFL_.
data mix_3; set mix_;
length datfl datfl_ $ 50;
array m4(*) id name gender scan;
retain datfl_;
do i=1 to dim(m4);
if index(m4(i) ,'M') then do;
datfl_=vname(m4(i)) ;
if missing(datfl) then datfl=datfl_;
else datfl=strip(datfl)||"/"||datfl_;
end;
end;
if missing(datfl) then datfl = datfl_;
run;
It doesn't work...Let me change the dataset (mix_) and you can see that RETAIN DATFLl_, is not working in this scenario.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
To resume, what I want is to have the DISTINCT value of DATFL, into a macro variable. The code that I proposed does,for each records,a search for variables having the letter M, if it true then DATFL receives the variable name of the array variable. If there are multiple variable names then they will be separated by '/'. For the next records, do the same, BUT add only variable names satisfying the condition AND the variables that were not already kept in DATFL. Currently, if you run my program I have for DATFL at observation 4, DATFL=gender/scan/name/scan/scan but I would like to have DATFL=gender/scan/name , because those one are the distinct values. Ultimatlly, I will then write the following code;
if eof then CALL SYMPUT('DATFL',datfl);
sas_kappel
Your revised data makes it much clearer what you're looking for. Here is some code that should give the correct result.
I've used the CALL CATX function to add new values to DATFL, separated by a /. It first checks that the relevant variable name doesn't already exist in the string.
data mix_ ;
input id $ name $ gender $ scan $;
datalines;
1 jon M M
2 Marc F L
3 james F M
4 jonas H M
;
run;
data _null_;
set mix_ end=eof;
length datfl $100; /*or whatever*/
retain datfl;
array m4{*} $ id name gender scan;
do i = 1 to dim(m4);
if index(m4{i},'M') and not index(datfl,vname(m4{i})) then call catx('/',datfl,vname(m4{i}));
end;
if eof then call symput('DATFL', datfl);
run;
%put datfl = &DATFL.;
I have a table with a few thousand records sorted by distinct subject id, however in some cases the subject name appears multiple times if the subject used more than one id type, so one time the subject is using their social and another time their passport, and maybe a third time their drivers liscense.
The data is structured like this.
name id_type id_num
suzy smith passport 123
suzy smith ssn 123456789
suzy smith drivers liscense A3456789
I would like it to look like this.
name id_type id_num
suzy smith ssn 123456789
suzy smith ssn 123456789
suzy smith ssn 123456789
Any help would be greatly appreciated.
Thanks,
First sort the original data set ("have") by name:
proc sort data=have;
by name;
run;
Then set test by name and rename id_type to old_id_type. Do the same for id_num. Retain id_type and id_num. Then set id_type equal to old_id_type if a given record is the first instance of a person's name. Do the same for id_num.
data final;
set test (rename=(id_type=old_id_type id_num = old_id_num));
by name;
retain id_type id_num;
if first.name then do;
id_type = old_id_type;
id_num = old_id_num;
end;
drop old_id_type old_id_num;
run;
When you retain a variable, the value is kept from one observation to the next unless you reset the value. Thus each person will have the first id_type and id_num for all instances of that name.
I used the retain statement and added more code. Here is the full code:
DATA TEST;
SET C;
BY SUBJ_NAME;
RETAIN N(0);
IF FIRST.SUBJ_NAME THEN N=1;
ELSE N=N+1;
RUN;
PROC SORT
DATA=WORK.TEST
OUT=TTSORTED;
BY SUBJ_NAME N;
RUN;
PROC TRANSPOSE DATA=TTSORTED
OUT=TTTEST
PREFIX=Column
NAME=Source
LABEL=Label;
BY SUBJ_NAME N ;
VAR SUBJ_INDENT;
RUN; QUIT;
DATA TEST2;
SET TTTEST;
NEW_ID=CATS(SOURCE,N);
RUN;
PROC SORT
DATA=WORK.TEST2(KEEP=Column1 NEW_ID SUBJ_NAME SBJT_ID)
OUT=WORK.TMP0_INPUT;
BY SUBJ_NAME SBJT_ID;
RUN;
PROC TRANSPOSE DATA=TMP0_INPUT
OUT=SPLIT_TEST2;
BY SUBJ_NAME;
WHERE SUBJ_NAME NE ' ';
ID NEW_ID ;
VAR Column1;
RUN; QUIT;
DATA X;
SET SPLIT_TEST2(RENAME = (SUBJ_NAME=SUBJ_NAME_NEW));
IF SUBJ_INDENT2 NE ' ' THEN SUBJ_INDENT3= SUBJ_INDENT2;
IF SUBJ_INDENT2 = ' ' THEN SUBJ_INDENT3= SUBJ_INDENT1;
KEEP SUBJ_NAME_NEW SUBJ_INDENT1 SUBJ_INDENT2 SUBJ_INDENT3 ;
RUN;