SAS - Remove duplicated words in a string

SAS - Remove duplicated words in a string - sas

string = "spanner, span, spaniel, span";
From this string I would like to remove all the duplicates keeping one occurrence of the word and then output the revised string using SAS.
The revised string should look like this:
var string = "spanner, span, spaniel";

data a;
string = "spanner,span,spaniel,span,abc,span,bcc";
length word $100;
i = 2;
do while(scan(string, i, ',') ^= '');
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cats(substr(string, 1, start - 2), substr(string, start + length(word)));
leave;
end;
end;
i = i + 1;
end;
keep string;
run;

First create a data set with one column containing the words. With cats() the space is eliminated.
data temp(keep=text);
string = "spanner, span, spaniel, span";
do i=1 to count(cats(string),",")+1;
text = scan(string,i);
output;
end;
run;
Eliminate duplicates with nodup (nodupkey also works).
proc sort data=temp nodup;
by text;
run;
Create a macro variable new_string with the unique words.
proc sql noprint;
SELECT text
INTO :new_string separated by ","
FROM temp
;
quit;
Better solution for new specifications:
data temp(keep=i text);
string = tranwrd("I hate the product. I hate it because it smells bad. I hate wasting money.","."," ."); do i=1 to count(string," ")+1;
text = scan(string,i," ");
if text ne "" then do;
output;
end;
end;
run;
proc sort data=temp;
by text i;
run;
data temp2;
set temp;
by text i;
if first.text OR text eq ".";
run;
proc sort data=temp2;
by i;
run;
proc sql noprint;
SELECT text
INTO :new_string separated by ","
FROM temp
;
quit;

Build the list of unique words into a new variable.
data test;
input string $80.;
length newstring $80;
do i=1 to countw(string,',');
if not findw(newstring,scan(string,i,','),',','t') then
newstring=catx(', ',newstring,scan(string,i,','))
;
end;
cards;
spanner, span, spaniel, span
;

Thanks Robert. Just wanted to let you know I found a flaw in your Code. The inner loop modifies the string by removing the duplicate word, but the outer loop checks the next position of the original string no matter what. Example: "A,B,C,B,B" becomes "A,B,C,B" because the inner loop removes the fourth B and then the outer loop doesn't find the last "B" because it moved to the position of the fourth "B".
My solution:
data a;
string = "spanner,span,spaniel,span,abc,span,bcc";
length word $100;
i = 2;
do while(scan(string, i, ',') ^= '');
hit = 0;
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cats(substr(string, 1, start - 2), substr(string, start + length(word)));
hit = 1;
leave;
end;
end;
if hit = 0 then i = i + 1;
end;
keep string;
run;

Related

I need to know how to get the last chars of some Strings

I got this chars
DDSPRJ11
DDSPRJ12
DDSPRJ12
DDRJCT
in the case of the first 3 i want the last 4 chars e the case of the last i want the last 3 chars, how can i get them using substr and get them in the correct order eg: RJ11.

You can do this with regular expression matching using prxchange:
data have;
infile datalines;
input mystr $ ##;
datalines;
DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT
;
run;
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)(.*)/$2/', 1, mystr);
run;

#user667489 is perfect answer if it you have can read all of values separately. if it is in same variable as shown below you can use the same code given by #user667489. and add can add can function. prxnext, can also be used to achieve the same. both examples are shown below
data have;
val= "DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT";
run;
/* using prxchange with scan*/
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)//', -1, val);
do i = 1 to countw(suffix,' ');
newstr= scan(suffix, i);
output;
end;
drop suffix val;
run;
/* using prxposn*/
data want;
length val1 re $200.;
set have;
start = 1;
stop = length(val);
re = prxparse('/(DDSP|DDR)/');
set have;
call prxnext(re, start, stop, trim(val), position, length);
do while (position > 0);
val1 = substr(val, position+length, length);
call prxnext(re, start, stop, trim(val), position, length);
output;
end;
drop re start stop position length val;
run;

Here is how you can do it in a simple python.
I assumed that, you want last 4 char of every word except last.
string_1 = 'DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT'
list_string = string_1.split()
new_list = []
for i in range(len(list_string)):
if i == len(list_string) - 1:
new_list.append(list_string[i][-3:])
else:
new_list.append(list_string[i][-4:])
print(new_list)
output:
['RJ11', 'RJ12', 'RJ12', 'JCT']

SAS select records (or columns) where the value sums up to a specific number

I have a table containing values like this
a b
110 1024
120 987
130 456
140 312
Is it possible in SAS to find all combinations of variable b that sum up to a specific value?

As user667489 has said in the comments this is not a simple problem. For a small number of values, you can try the following approach:
data test;
do i= 1 to 20;
b=int(abs(rand('norm', 0, 1)) * 10);
output;
end;
run;
proc transpose data = test out = testFormatted prefix=b;
id i;
var b;
run;
data _null_;
set testFormatted;
array values [*] b:;
numValues = dim(values);
length workingComb $32676;
do k = 1 to numValues;
combNum = comb(numValues, k);
do i = 1 to combNum;
rc=lexComb(i, k, of values[*]);
result = 0;
workingComb = '';
do j = 1 to k;
result = result + values[j];
end;
if result = 100 then do;
do m = 1 to k;
workingComb = catx(' ', workingComb, put(values[m],best.));
end;
put 'Combination found: ' rc= k= workingComb= result=;
output;
end;
end;
end;
run;

split string to columns with content fill

I have data that looks like this:
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
I need to add:
A new column with column heading for each sequence, add prefix 'RT', drop the letters following the numeric part of the sequence
Fill the new column with the letters that follow the numeric part
of the sequence
I need to create this:
ID Sequence RTE6 RTK11 RTQ174 RTD177 RTV245
-----------------------------------------------------------------------
101 E6S,K11T,Q174K,D177E S T K E
102 K11T,V245EKQ T EKQ

I assume you want a SAS data set and not a report. ANYDIGIT makes it pretty easy to find the last non-digit sub-string.
data seq;
infile cards firstobs=3;
input id:$3. sequence :$50.;
cards;
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
;;;;
run;
proc print;
run;
data seq2V / View=seq2V;
set seq;
length w name sub $32 subl 8;
do i = 1 by 1;
w = scan(sequence,i,',');
if missing(w) then leave;
subl = anydigit(w,-99);
name = substrn(w,1,subl);
sub = substrn(w,subl+1);
output;
end;
run;
proc transpose data=seq2V out=seq3(drop=_name_) prefix=RT;
by id sequence;
var sub;
id name;
run;
proc print;
run;

I had a similar problem a while ago. The code is adapted to your problem.
If found this solution to work faster than anything I tried with proc transpose.
Still overall performance on huge datasets (espc. using many different sequences) is not great at all, as we loop 2*2 over all strings and also the final variables.
Can anyone offer a faster solution?
(Caution: MacroVar is limited to 65534 Characters.)
data var_name ;
set in_data;
length var string $30.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
var = substr(string,1,anydigit(string,-99));
output;
keep var;
end;
run;
proc sql noprint;
select distinct compress("RT"||var) into :var_list separated by ' '
from var_name;
quit;
%put &var_list.;
data out_data;
set in_data;
length string &var_list. $30. n 8. ;
array a_var [*] &var_list.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
do j = 1 to dim(a_var);
n = anydigit(string,-99) ;
if substr(vname(a_var[j]),3) eq substr(string,1,n) then a_var[j] = substr(string,n+1);
end;
end;
drop string i j n;
run;

syntax search over a string with sas

Got the following example
I'm trying to know if any part of string in the column nomvar in table tata does exist in col1 in table toto and if yes, give me the definition using col2.
For I2010,RT,IS-IPI,F_CC11_X_CCXBA, I would have in the column intitule "yes,toto,tata,well"
I thought about using a proc sql with an insert and a select but I have two tables and I would need to do a join.
In the same time, I thought to have everything in one table but I'm unsure if it is a good idea.
Any suggestions are welcomed as I'm deeply stuck.

The SAS data step hash object is a nice way to do this. It allows you to read the Toto table into memory and it becomes a lookup table for you. Then you just walk the string from the Tata table using the scan function, tokenize, and lookup the col2 value. Here is the code.
By the way, turning table Tata into a structure like Toto and performing join is a perfectly rational way to do this, too.
/*Create sample data*/
data toto;
length col1 col2 $ 100;
col1='I2010';
col2='yes';
output;
col1='RT';
col2='toto';
output;
col1='IS-IPI';
col2='tata';
output;
col1='F_CC11_X_CCXBA';
col2='well';
output;
run;
data tata;
length nomvar intitule $ 100;
nomvar='I2010,RT,IS-IPI,F_CC11_X_CCXBA';
run;
/*Now for the solution*/
/*You can do this lookup easily with a data step hash object*/
data tata;
set tata;
length col1 col2 token $ 100;
drop col1 col2 token i sepchar rc;
/*slurp the data in from the Toto data set into the hash*/
if (_n_ = 1) then do;
declare hash toto_hash(dataset: 'work.toto');
rc = toto_hash.definekey('col1');
rc = toto_hash.definedata('col2');
toto_hash.definedone();
end;
/*now walk the tokens in data set tata and perform the lookup to get each value*/
i = 1;
sepchar = ''; /*this will be a comma after the first iteration of the loop*/
intitule = '';
do until (token = '');
/*grab nth item in the comma-separated list*/
token = scan(nomvar, i, ',');
/*lookup the col2 value from the toto data set*/
rc = toto_hash.find(key:token);
if (rc = 0) then do;
/*lookup successful so tack the value on*/
intitule = strip(intitule) || sepchar || col2;
sepchar = ',';
end;
i = i + 1;
end;
run;

Assuming your data is all structured like this (you're looking at the different strings in between . characters) I would think the easiest way is to normalize TATA (splitting by .) and then doing a straight join, then (if you need to) transposing back. (It might be better to leave it vertical - very likely you would find this more useful structure for analysis.)
data tata_v;
set tata;
call scan(nomvar,1,position,length,'.');
do _i = 1 by 1 while position le 0);
nomvar_out = substr(nomvar,position,length);
output;
call scan(nomvar,_i+1,position,length,'.');
end;
run;
Now you can join on nomvar_out and then (if needed) recombine things.

Reading text file in SAS with delimiter in wrong places

I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?

I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;

This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;

Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!

Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

SAS - Remove duplicated words in a string - sas

string = "spanner, span, spaniel, span"; From this string I would like to remove all the duplicates keeping one occurrence of the word and then output the revised string using SAS. The revised string should look like this: var string = "spanner, span, spaniel";

Build the list of unique words into a new variable. data test; input string $80.; length newstring $80; do i=1 to countw(string,','); if not findw(newstring,scan(string,i,','),',','t') then newstring=catx(', ',newstring,scan(string,i,',')) ; end; cards; spanner, span, spaniel, span ;

Related

I need to know how to get the last chars of some Strings

SAS select records (or columns) where the value sums up to a specific number

split string to columns with content fill

syntax search over a string with sas

Reading text file in SAS with delimiter in wrong places

Categories

Resources