I have a number of text entries (municipalities) from which I need to remove the s at the end.
Data test;
input city $;
datalines;
arjepogs
askers
Londons
;
run;
data cities;
set test;
if prxmatch("/^(.*?)s$/",city)
then city=prxchange("s/^(.*?)s$/$1/",-1,city);
run;
Strangely enough, my s's are only removed from my first entry.
What am I doing wrong?
You defined CITY as length $8. The s in Londons is in the 7th position of the string. Not the LAST position of the string. Use the TRIM() function to remove the trailing spaces from the value of the variable.
data have;
input city $20.;
datalines;
arjepogs
Kent
askers
Londons
;
data want;
set have;
length new_city $20 ;
new_city=prxchange("s/^(.*?)s$/$1/",-1,trim(city));
run;
Result
Obs city new_city
1 arjepogs arjepog
2 Kent Kent
3 askers asker
4 Londons London
You could also just change the REGEX to account for the trailing spaces.
new_city=prxchange("s/^(.*?)s\ *$/$1/",-1,city);
Here is another solution using only SAS string functions and no regex. Note that in this case there is no need to trim the variable:
data cities;
set test;
if substr(city,length(city)) eq "s" then
city=substr(city,1,length(city)-1);
run;
Related
I have some data in the form of a column in a dataset (named Person_details), where each has an unknown number of names, with the name (split up by spaces), followed by an underscore, followed by that persons identifier (7 characters).
Is there a way to split these entries up automatically, rather than repeatedly finding the position of the underscore, and then taking the substring before and after?
Person_details:
Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943
There are an unknown number of names in each cell, e.g. some have just one name and some have 20. Also complicated by the fact that some entries have middle name(s).
The ideal output would be in the form
Name Code
Evan Davies 123F323
Adam John Smith 342D427
Karl Marx 903C943
You could just use SCAN() instead.
data have;
string='Evan Davies_123F323 Adam Smith_342D427 Karl Marx_903C943';
length name $50 code $7 ;
do index=1 to countw(string,' ');
name = catx(' ',name,scan(string,index,' '));
if index(name,'_') then do;
code = scan(name,-1,'_');
name = substr(name,1,length(name)-length(code)-1);
output;
name=' ';
end;
end;
run;
Result
You can use a Perl regular expression (regex) to detect and extract pieces from patterned text. SAS routine PRXNEXT iterates through matches, and function PRXPOSN extracts pieces.
Example:
data have;
text = 'Evan Davies_123F323 Adam John Smith_342D427 Karl Marx_903C943';
run;
data want(keep=name code);
rx = prxparse('/(.+?)_(.{7}( |$))/');
set have;
start = 1;
stop = length(text);
do seq = 1 by 1;
call prxnext(rx,start,stop,text,position,length);
if position=0 then leave;
name = prxposn(rx,1,text);
code = prxposn(rx,2,text);
output;
end;
run;
I have a SAS string that always starts with a date. I want to remove the date from the substring.
Example of data is below (data does not have bullets, included bullets to increase readability)
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
I want the data to look like this (data does not have bullets, included bullets to increase readability)
test_num15
recom_1_test1
test_0_8_i0|vacc_previous0
Index find '|' position in the string, then substr substring; or use regular expression.
data have;
input x $50.;
x1=substr(x,index(x,'|')+1);
x2=prxchange('s/([^_]+\|)(?=\w+)//',1,x);
cards;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;
run;
This is a great use case for call scan. If your length of date is constant (always 10), then you don't actually need this (start would be 12 then and skip to the substr, as user667489 noted in comments), but if it's not this would be helpful.
data have;
length textstr $100;
input textstr $;
datalines;
10/01/2016|test_num15
11/15/2016|recom_1_test1
03/04/2017|test_0_8_i0|vacc_previous0
;;;;
run;
data want;
set have;
call scan(textstr,2,start,length,'|');
new_textstr = substr(textstr,start);
run;
It would also let you grab the second word only if that's useful (using length third argument for substr).
Is it possible to use the number in this string:
'xx8xx'
by replacing the number with 8 spaces to get this string:
'xx xx'
I can identify the number between the xx but the replacement syntax does not work as intended:
PRXCHANGE(s/xx([\d]*)xx/' ' x $1/io, -1, 'xx8xx')
Is there a way to use the number being held in $1 to repeat the space character by that number i.e. something like ' ' x $1?
Any help much appreciated!
Tiaan
Supposed you need to replace with three blank.
data _null_;
x=prxchange('s/(xx)\d+(xx)/$1 $2/', -1, 'xx8xx');
_x=prxchange('s/(?=\w+)(\d+)/ /',1,'xx8xx');
put _all_;
run;
Edit:
I missed important information. Tranwrd and repeat could be used to get it.
data _null_;
x=tranwrd('xx8xx', prxchange('s/.*(\d+).*/$1/',1,'xx8xx'), repeat(' ',prxchange('s/.*(\d+).*/$1/',1,'xx8xx')));
put _all_;
run;
You'll need to extract first, then compile a new regex. This will be expensive since you have to compile once per line.
data have;
input xstr $;
datalines;
xx8xx
xx3xx
xx4xx
;;;;
run;
data want;
set have;
rx1 = prxparse('/xx([\d])*xx/io');
rc1 = prxmatch(Rx1,xstr);
num_x = prxposn(rx1,1,xstr);
rx2 = prxparse(cat('s/(xx)[\d]*(xx)/$1',repeat(" ",num_x-1),'$2/i'));
newstr = prxchange(rx2,-1,xstr);
run;
I wish to extract the two words "blood" and "loss" within the closest proximity to a substring. I have the codes below, but the ID 4 didn't work. I wish to get the substring "blood loss", not "bloods but blood loss".
data test;
infile datalines truncover;
input id $2. string $80.;
datalines;
1 there is one blood something loss
2 no something else here
3 three blood loss again blood loss can not believe loss of blood
4 two bloods but blood loss
;
run;
data test1;
set test;
rx=prxparse("/blood.*?loss|loss.*?blood/i");
start=1;
stop =length(trim(string));
do until (p=0);
call prxnext(rx,start,stop,trim(string),p,l);
if p>0 then do;
sub=substr(string,p,l);
output;
end;
end;
run;
Very small change is needed, if bloods need to be ignored. Add a space between the blood and . in the first part of regex and it will try and match the word blood. Below is the replacement prxparse statement.
rx=prxparse("/blood .*?loss|loss.*?blood/i");
As per the updated comment, for matching string "blood loss" when string is like "blood something blood loss" then negative lookahead can help.
prxparse("/blood (.(?!blood))*?loss/i")
I would like to read following instream datalines
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
I employed while it read AAA items to team variables but not as div. And how should I place &(ampersand to read character with embedded blanks?)
data scores2;
infile datalines dlm=",";
input name : $10. score1-score3 team $20. div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Notice I have used : before team also ( well you have already used colon operator : for other variables , not sure why did you miss over here) As I have already mentioned in your other query, use : colon operator (tilde, dlm and colon format modifier in list input) which would tell SAS to use the informat supplied but to stop reading the value for this variable when a delimiter is encountered. Here as you had not used this operator , that is why SAS was trying to read 20 chars, even though
there was a delimiter in between.
Tested
data scores2;
infile datalines dlm=",";
input name : $10.
score1-score3
team : $20.
div : $3.;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
Another way to do this that's often a bit easier to read is to use the informat statement.
data scores2;
infile datalines dlm=",";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team $ div $;
datalines;
Smith,12,22,46,Green Hornets,AAA
FriedmanLi,23,19,25,High Volts,AAA
Jones,09,17,54,Las Vegas,AA
;
run;
That accomplishes the same thing as using the colon (input name :$10.) but organizes it a bit more cleanly.
And just to be clear, embedded blanks are irrelevant in comma delimited input; '20'x (ie, space) is just another character when it's not the delimiter. What ampersand will do is addressed in this article, and more specifically, if space is the delmiiter it allows you to require two consecutive delimiters to end a field. Example:
data scores2;
infile datalines dlm=" ";
informat name $10.
team $20.
div $4.;
input name $ score1-score3 team & $ div $;
datalines;
Smith 12 22 46 Green Hornets AAA
FriedmanLi 23 19 25 High Volts AAA
Jones 09 17 54 Las Vegas AA
;
run;
Note the double space after all of the team names - that's required by the &. But this is only because delimiter is space (which is default, so if you removed the dlm=' ' it would also be needed.)