I want to create a variable that resolves to the character before a specified character (*) in a string. However I am asking myself now if this specified character appears several times in a string (like it is in the example below), how to retrieve one variable that concatenates all the characters that appear before separated by a comma?
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
run;
Code:
data want;
set have;
cnt = count(string, "*");
_startpos = 0;
do i=0 to cnt until(_startpos=0);
before = catx(",",substr(string, find(string, "*", _startpos+1)-1,1));
end;
drop i _startpos;
run;
That code output before=C for the first and second observation. However I want it to be before=C,E for the first one and before=C,W,d for the second observation.
You can use Perl regular expression replacement pattern to transform the original string.
Example:
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
;
data want;
set have;
csl = prxchange('s/([^*]*?)([^*])\*/$2,/',-1,string); /* comma separated letters */
csl = prxchange('s/, *$//',1,csl); /* remove trailing comma */
run;
Make sure to increment _STARTPOS so your loop will finish. You can use CATX() to add the commas. Simplify selecting the character by using CHAR() instead of SUBSTR(). Also make sure to TELL the data step how to define the new variable instead of forcing it to guess. I also include test to handle the situation where * is in the first position.
data have;
input string $20.;
datalines;
ABC*EDE*
EFCC*W*d*
*XXXX*
asdf
;
data want;
set have;
length before $20 ;
_startpos = 0;
do cnt=0 to length(string) until(_startpos=0);
_startpos = find(string,'*',_startpos+1);
if _startpos>1 then before = catx(',',before,char(string,_startpos-1));
end;
cnt=cnt-(string=:'*');
drop i _startpos;
run;
Results:
Obs string before cnt
1 ABC*EDE* C,E 2
2 EFCC*W*d* C,W,d 3
3 *XXXX* X 1
4 asdf 0
call scan is also a good choice to get position of each *.
data have;
infile datalines delimiter=",";
input string :$20.;
datalines;
ABC*EDE*,
EFCC*W*d*
****
asdf
;
data want;
length before $20.;
set have;
do i = 1 to count(string,'*');
call scan(string,i,pos,len,'*');
before = catx(',',before,substrn(string,pos+len-1,1));
end;
put _n_ = +7 before=;
run;
Result:
_N_=1 before=C,E
_N_=2 before=C,W,d
_N_=3 before=
_N_=4 before=
Related
I got this chars
DDSPRJ11
DDSPRJ12
DDSPRJ12
DDRJCT
in the case of the first 3 i want the last 4 chars e the case of the last i want the last 3 chars, how can i get them using substr and get them in the correct order eg: RJ11.
You can do this with regular expression matching using prxchange:
data have;
infile datalines;
input mystr $ ##;
datalines;
DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT
;
run;
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)(.*)/$2/', 1, mystr);
run;
#user667489 is perfect answer if it you have can read all of values separately. if it is in same variable as shown below you can use the same code given by #user667489. and add can add can function. prxnext, can also be used to achieve the same. both examples are shown below
data have;
val= "DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT";
run;
/* using prxchange with scan*/
data want;
set have;
suffix = prxchange('s/(DDSP|DDR)//', -1, val);
do i = 1 to countw(suffix,' ');
newstr= scan(suffix, i);
output;
end;
drop suffix val;
run;
/* using prxposn*/
data want;
length val1 re $200.;
set have;
start = 1;
stop = length(val);
re = prxparse('/(DDSP|DDR)/');
set have;
call prxnext(re, start, stop, trim(val), position, length);
do while (position > 0);
val1 = substr(val, position+length, length);
call prxnext(re, start, stop, trim(val), position, length);
output;
end;
drop re start stop position length val;
run;
Here is how you can do it in a simple python.
I assumed that, you want last 4 char of every word except last.
string_1 = 'DDSPRJ11 DDSPRJ12 DDSPRJ12 DDRJCT'
list_string = string_1.split()
new_list = []
for i in range(len(list_string)):
if i == len(list_string) - 1:
new_list.append(list_string[i][-3:])
else:
new_list.append(list_string[i][-4:])
print(new_list)
output:
['RJ11', 'RJ12', 'RJ12', 'JCT']
Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;
I have a large data file with data in the following format: country, datatype, year1month1 to year2018month7.
Reading the data using proc import did not work for all data fields. I ended up modifying the SAS datastep code to ensure data format was correct.
However, I am having trouble simplifying the code, namely I would like a do loop to go through all the years and month. This way, I could use current date to figure out the range of dates for the file and the code to create Year/Month variable does not have to repeat 100 times in the file.
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
informat _(i)M(j) best32.;
end;
end;
informat Base_Year $1. ;
format Country_Name $34. ;
do i = 1940 to 2018;
do j = 1 to 12;
format _(i)M(j) best12.;
end;
end;
format Base_Year $1. ;
input
Country_Name $
do i = 1940 to 2018;
do j = 1 to 12;
_(i)M(j) $;
end;
end;
Base_Year $;
run;
There are a few approaches here that could work. The most directly translatable to your approach is to use the macro language.
You need to translate those two loops to something like this:
%do i = 1940 %to 2018;
%do j = 1 %to 12;
informat _&i.M&j. best32.;
%end;
%end;
Notice the % there. This also has to be in a macro; you can't do this in normal datastep code.
I would rewrite it to use a macro like so:
%macro make_ym(startyear=, endyear=, separator=);
%local i j;
%do i = &startyear. %to &endyear.;
%do j = 1 %to 12;
_&i.&separator.&j.
%end;
%end;
%mend make_ym;
data test;
infile 'abc.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat Country_Name $34. ;
informat %make_ym(startyear=1940,endyear=2018,separator=M) best32.;
informat Base_Year $1. ;
format %make_ym(startyear=1940,endyear=2018,separator=M) best12.;
format Base_Year $1. ;
input
Country_Name $
%make_ym(startyear=1940,endyear=2018,separator=M)
Base_Year $;
run;
I took out the $ after the yMm bits in the input since you declared them as numeric.
Don't model your data step after the code generated by PROC IMPORT. It does a lot of useless things, like attaching formats and informats to variables that don't need them.
For your problem you just need a simple program like this:
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. Y1940M01 .... Y2018M08 Base_Year :$1. ;
run;
Now the only tricky part is building that list of numerical variables. If the list is small enough you could just put it into a macro variable. Fortunately that is not a problem in this case since using 8 character names (YyyyyMmm) there is room for over 300 years worth in a data step character variable. A variable of length 10,800 bytes should have room for 100 years of month names.
So just run this data step first.
data _null_;
length names $10800 ;
basedate = mdy(1,1,1940);
lastdate = today();
do i=0 to intck('month',basedate,lastdate);
date=intnx('month',basedate,i);
names=catx(' ',names,cats('Y',year(date),'M',put(month(date),Z2.)));
end;
call symputx('names',names);
run;
Now you can use the macro variable in your INPUT statement.
data test;
infile 'abc.csv' dsd dlm= ',' truncover firstobs=2 ;
input Country_Name :$34. &names Base_Year :$1. ;
run;
I have a SAS dataset that I have to export to a .csv-file. I have the following two contradicting requirements.
I have to use the semicolon as the delimiter in the .csv-file.
Some of the character variables are manually inputted strings from formulas, hence they may contain semicolons.
My solution to the above is to either escape the semicolon or to replace it with a comma.
How can I, in a nice, clean and efficient way use e.g. tranwrd on an entire dataset?
My attempt:
For each variable, use the tranwrd(.., ";", ",") function on a variable in the data set. Update the dataset and loop through all variables. This, however, is naturally a very inefficient way of doing it for even semi-large datasets, since I have to do a datastep for each variable. The code for it is quite ugly, since I have to get the variable names by a few steps, but the inefficiency definitely takes the cake.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
* Get the variable names;
proc contents data=test out=vars(keep=name type varnum) order=varnum noprint;
run;
* Sort by variable number;
proc sort data=vars;
by varnum;
run;
* Put variable names into a space-separated string;
proc sql noprint;
select compress(name)
into :name_list separated by ' '
from vars;
quit;
%let len = %sysfunc(countw(&name_list));
*Initialize loop dataset;
data a;
set test;
run;
%macro loop;
%do i = 1 %to &len;
%let j = %scan(&name_list,&i);
data a(rename=(v_&j = &j) drop=&j);
set a;
v_&j.=compress(tranwrd(&j,";",","));
run;
%end;
%mend;
%loop;
I think I may have more elegant solution to your problem:
data class;
set sashelp.class;
array vars [*] _character_;
do i = 1 to dim(vars);
vars[i] = compress(tranwrd(vars[i],"a","X"));
end;
drop i;
run;
You can use array to reference all character columns from your data set and then loop through them.
The most widely used standard for csv files whose fields can contain delimiters is to quote fields that contain them, and double up any quotes. In SAS you can do this automatically using the dlm and dsd options in a put statement:
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb" 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
data _null_;
set test;
file "c:\temp\test.csv" dsd dlm=';';
put (_ALL_) (&);
run;
This results in the following semicolon-delimited csv (minus a header row, but that's a separate issue):
"Aaa;;";50;11;1;222;"a;s"
"Bbb""";35;12;2;250;qw
Comma,;75;13;3;foo;zx
Sorry, didn't notice your comment about the workaround until after I posted this. I'll leave it here in case anyone finds it helpful.
Fields in a properly formatted delimited file are quoted. PROC EXPORT will do that. There is no need to change the data.
data test;
input w $ c b d e $ f $;
datalines4;
Aaa;; 50 11 1 222 a;s
Bbb 35 12 2 250 qw
Comma, 75 13 3 foo zx
;;;;
run;
filename FT45F001 temp;
proc export data=test outfile=FT45F001 dbms=csv;
delimiter=';';
run;
data _null_;
infile FT45F001;
input;
list;
run;
proc import replace datafile=FT45F001 dbms=csv out=test2;
delimiter=';';
run;
proc print;
run;
proc compare base=test compare=test2;
run;
I have to retrieve the X-Axis and Y-Axis pos from ADDITIONAL_DETAILS field which is more than 300 bytes in length.
Somewhere in this string, I am getting the location details as RETLOCID=2312.4892 like that.
I am trying to use PERL REGEX in SAS.
Problem: I am able to get the starting position into postn1 from call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1); but the length is always returned as 8 even though it is more than that.
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1, length1); This is not giving me proper value when I am restricting length to 8. Any help is appreciated. Below is the code:
DATA WORK.LOCATION;
INFILE DATALINES;
INPUT ADDITIONAL_DETAILS $50.;
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
;
RUN;
data WORK.POSITION;
set WORK.POSITION;
if _N_ = 1 then do;
MATCH_PATTERN1 = PRXPARSE("/(RETLOCID=)/");
MATCH_PATTERN2 = PRXPARSE("/([0-9]{1,}\.[0-9]{1,})/");
end;
retain MATCH_PATTERN1 MATCH_PATTERN2;
call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1);
call prxsubstr(MATCH_PATTERN2, ADDITIONAL_DETAILS, postn2,length2);
if postn1 > 0 and not missing(ADDITIONAL_DETAILS) then
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1 + 8, length1);
RUN;
data work.POSITION;
set work.POSITION;
drop MATCH_PATTERN1 postn1 length1;
run;
I need to pull 121.5435 and 7821.5487
Try this:
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
string=prxchange('s/[a-z=_]+//i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
DISTR_QUOTE=66.92
gfdgf
;
run;
Or
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
length string $20.;
if prxmatch('/\=/',ADDITIONAL_DETAILS)=0 then string='';
else string=prxchange('s/.*(?<=\=)([^a-z]+).*/$1/i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
DISTR_QUOTE=66.92
;
proc print;
run;