splitstring by space inmultiple columns in sas

splitstring by space inmultiple columns in sas - sas

Hi I have one doubt in sas
How to split string into multiple columns in sas?
Here before first space value consider as firstname and last space after values consider as lastname and between first and lastspace values consider as middle name.
data my_data1;
input name $500.;
datalines;
Andy Lincoln Bernard ravni
Barry Michael
Chad Simpson Smith
Eric
Frank Giovanni Goodwill
;
run;
proc print data=my_data1;
based on data expecte out like below :
Fname | Middlename | lname
Andy | Lincoln Bernard |ravni
Barry | |Michael
Chad | Simpson |Smith
Eric | |
Frank|Giovanni |Goodwill
I tried like below
data my_data2;
set my_data1;
Fname=scan(name, 1, ' ');
Middlename=scan(name, 2, ' ');
Lname=scan(name, -1, ' ');
run;
proc print data=my_data2;
above logic not give expected out put.
can you please tell me how to write code achive this task in sas

Code:
data want;
length first_name middle_name last_name $50.;
set have;
n_names = countw(name);
if(n_names) = 1 then first_name = name;
else if(n_names = 2) then do;
first_name = scan(name, 1);
last_name = scan(name, -1);
end;
else do;
first_name = scan(name, 1);
last_name = scan(name, -1);
middle_name = substr(name, length(first_name)+2, length(name) - (length(first_name) + length(last_name))-2 );
end;
run;
How it works
We know:
If there's one word, it's a first name
If there are two words, it's a first and last name
If there are three or more words, it's a first, last, and middle name
To get the middle name, we know:
Where the first name starts and how long it is
Where the last name starts and how long it is
How long the entire name is
By simply doing some subtraction, we can get a substring of the middle name:
Len ----------------- 17
----5 ---4
First Middle Last
Pos 7 12
The length of the string is 17. "Middle" starts at 7 and ends at 12. We can get the length of the middle name by simply substracting the lengths of the first and last names from the total length of the string. We subtract 2 to account for the space at the end of the middle name.
17 - (5 + 4) - 2 = 6
Our start position is 5 + 2 (i.e. the first name + 2) to account for the space. Translating this to substr:
substr(name, length(first_name)+2, length(name) - (length(first_name) + length(last_name))-2 )

Adapted from How to separate first name and middle name and last name
data want;
set my_data1;
length first middle middle1 middle2 last $ 40;
array parts[*] first middle1 middle2 last;
do i = 1 to countw(name);
if i = countw(name) and i < dim(parts) then do;
parts[dim(parts)] = scan(name, i);
end;
else do;
parts[i] = scan(name, i);
end;
end;
if middle1 ne "" and middle2 ne "" then middle = catx(" ", middle1, middle2);
else middle = middle1;
if first = "" and last ne "" then do;
first = last;
last = "";
end;
drop name i middle1 middle2;
run;

Related

SAS: How do i split by comma and transpose?

I am using SAS Enterprise Guide.
I have a new file and i was asked to generate output.
Source:
Name feeder_in feeder_out NickName
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
and the desire output:
Name feeder_final
ABBA 1
ABBA 2
ABBA A
ABBA B
POLA 1
POLA 2
CONS POLA C
CONS POLA D
CONS POLA E
I have been trying myself on handling this but no luck at all.
I tried
data test;
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
catequipment=catx(',',strip(feeder_in),strip(feeder_out));
do i=1 to countw(catequipment,',');
output;
end;
xequipment=newequipment;
run;
Does anyone have clue for this?

Here's my understanding of your requirements, based on the desired output: you want your output to have one observation for each combination of NAME and FEEDER_IN, plus another observation for each combination of NICKNAME and FEEDER_OUT.
On that assumption, the code would look something like (not tested):
data want;
set have;
keep name feeder_final
* Loop over FEEDER_IN and output one obs for each delimited value;
do i = 1 to countw(feeder_in, ',');
feeder_final = scan(feeder_in, i, ',');
output;
end;
* Move the NICKNAME value into NAME;
name = nickname;
* Loop over FEEDER_OUT and output one obs for each delimited value;
do i = 1 to countw(feeder_out, ',');
feeder_final = scan(feeder_out, i, ',');
output;
end;
run;

When transposing multiple columns you might want to also maintain the source row and column identifiers for further downstream analytics. The sequence of the values in the csv might also be important if you need to do pairwise joining on sequence position of the categorical form -- such as needing to match 1A 2B in row 1 and 1C 2D in row 2.
data have;
length name feeder_in feeder_out nickname $20;
input
Name& feeder_in& feeder_out& NickName&; datalines;
ABBA 1,2 A,B ABBA
POLA 1,2 C,D,E CONS POLA
run;
data want;
_row_ + 1;
set have;
feeder = 'in ';
do seq = 1 to countw(feeder_in,',');
value = scan(feeder_in,seq,',');
OUTPUT;
end;
feeder = 'out';
do seq = 1 to countw(feeder_out,',');
value = scan(feeder_out,seq,',');
OUTPUT;
end;
keep _row_ Name feeder seq value NickName;
run;

Define name with the given name present by the initials

I try to create a SAS data set with two different variables. Y should be the whole name. The variable Names should by the name with the given name presented by the initials e.g. Johnson Mike should be "Johnson M." and Smith Robert John should be "Smith R. J.". I'm not sure how to create the Names variable, anyone who can help?
data names;
Length y $ 40;
Input y &;
Names = y;
DATALINES;
Johnson Mike
Smith Robert John
Jones Linda Mary
Brown Marcus
run;

this should work with a do loop
data names_final;
set names;
do _n_ = 1 to countw(Y,' ');
if _n_ = 1 then name =scan(Y,1);
else name = catx(' ', name, cats(first(scan(y,_n_)),'.'));
end;
run;
you can also do
data names_final;
set names;
name = cats(catx(' ', scan(y,1), catx('. ', first(scan(y,2)),first(scan(y,3) ))),'.');
run;

Function 'countw' 'scan' 'first' and 'catx' may be helpful.
1. Get the number of words of Name;
2. Keep the first word;
3. Do a loop, connect the first word and the first letter of the every single word(expect the first word);

I need to know how to get the last chars of some Strings

I got this chars
DDSPRJ11
DDSPRJ12
DDSPRJ12
DDRJCT
in the case of the first 3 i want the last 4 chars e the case of the last i want the last 3 chars, how can i get them using substr and get them in the correct order eg: RJ11.

You can do this with regular expression matching using prxchange:
data have;
infile datalines;
input mystr $ ##;
datalines;
DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT
;
run;
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)(.*)/$2/', 1, mystr);
run;

#user667489 is perfect answer if it you have can read all of values separately. if it is in same variable as shown below you can use the same code given by #user667489. and add can add can function. prxnext, can also be used to achieve the same. both examples are shown below
data have;
val= "DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT";
run;
/* using prxchange with scan*/
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)//', -1, val);
do i = 1 to countw(suffix,' ');
newstr= scan(suffix, i);
output;
end;
drop suffix val;
run;
/* using prxposn*/
data want;
length val1 re $200.;
set have;
start = 1;
stop = length(val);
re = prxparse('/(DDSP|DDR)/');
set have;
call prxnext(re, start, stop, trim(val), position, length);
do while (position > 0);
val1 = substr(val, position+length, length);
call prxnext(re, start, stop, trim(val), position, length);
output;
end;
drop re start stop position length val;
run;

Here is how you can do it in a simple python.
I assumed that, you want last 4 char of every word except last.
string_1 = 'DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT'
list_string = string_1.split()
new_list = []
for i in range(len(list_string)):
if i == len(list_string) - 1:
new_list.append(list_string[i][-3:])
else:
new_list.append(list_string[i][-4:])
print(new_list)
output:
['RJ11', 'RJ12', 'RJ12', 'JCT']

SAS - Remove duplicated words in a string

string = "spanner, span, spaniel, span";
From this string I would like to remove all the duplicates keeping one occurrence of the word and then output the revised string using SAS.
The revised string should look like this:
var string = "spanner, span, spaniel";

data a;
string = "spanner,span,spaniel,span,abc,span,bcc";
length word $100;
i = 2;
do while(scan(string, i, ',') ^= '');
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cats(substr(string, 1, start - 2), substr(string, start + length(word)));
leave;
end;
end;
i = i + 1;
end;
keep string;
run;

First create a data set with one column containing the words. With cats() the space is eliminated.
data temp(keep=text);
string = "spanner, span, spaniel, span";
do i=1 to count(cats(string),",")+1;
text = scan(string,i);
output;
end;
run;
Eliminate duplicates with nodup (nodupkey also works).
proc sort data=temp nodup;
by text;
run;
Create a macro variable new_string with the unique words.
proc sql noprint;
SELECT text
INTO :new_string separated by ","
FROM temp
;
quit;
Better solution for new specifications:
data temp(keep=i text);
string = tranwrd("I hate the product. I hate it because it smells bad. I hate wasting money.","."," ."); do i=1 to count(string," ")+1;
text = scan(string,i," ");
if text ne "" then do;
output;
end;
end;
run;
proc sort data=temp;
by text i;
run;
data temp2;
set temp;
by text i;
if first.text OR text eq ".";
run;
proc sort data=temp2;
by i;
run;
proc sql noprint;
SELECT text
INTO :new_string separated by ","
FROM temp
;
quit;

Build the list of unique words into a new variable.
data test;
input string $80.;
length newstring $80;
do i=1 to countw(string,',');
if not findw(newstring,scan(string,i,','),',','t') then
newstring=catx(', ',newstring,scan(string,i,','))
;
end;
cards;
spanner, span, spaniel, span
;

Thanks Robert. Just wanted to let you know I found a flaw in your Code. The inner loop modifies the string by removing the duplicate word, but the outer loop checks the next position of the original string no matter what. Example: "A,B,C,B,B" becomes "A,B,C,B" because the inner loop removes the fourth B and then the outer loop doesn't find the last "B" because it moved to the position of the fourth "B".
My solution:
data a;
string = "spanner,span,spaniel,span,abc,span,bcc";
length word $100;
i = 2;
do while(scan(string, i, ',') ^= '');
hit = 0;
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cats(substr(string, 1, start - 2), substr(string, start + length(word)));
hit = 1;
leave;
end;
end;
if hit = 0 then i = i + 1;
end;
keep string;
run;

Reading text file in SAS with delimiter in wrong places

I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?

I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;

This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;

Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!

Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

splitstring by space inmultiple columns in sas - sas

Related

SAS: How do i split by comma and transpose?

Define name with the given name present by the initials

I need to know how to get the last chars of some Strings

SAS - Remove duplicated words in a string

Reading text file in SAS with delimiter in wrong places

Categories

Resources