I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?
I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;
This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;
Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!
Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;
Related
Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;
libname Prob 'Y:\alsdkjf\alksjdfl';
the geoid here is char I wanna convert to num to be able to merge by id;
data Problem2_1;
set Prob.geocode;
id = substr(GEOID, 8, 2);
id = input(id, best5.);
output;
run;
geoid here is numeric;
data Problem2_2; c
set Prob.households;
id = GEOID;
output;
run;
data Problem2_3;
merge Problem2_1
Problem2_2 ;
by ID;
run;
proc print data = Problem2_3;
*ERROR: Variable geoid has been defined as both character and numeric.
*ERROR: Variable id has been defined as both character and numeric.
It looks like you could replace these two lines:
id = substr(GEOID, 8, 2);
id = input(id, best5.);
With:
id = input(substr(GEOID, 8, 2), best.);
This would mean that both merge datasets contain numeric id variables.
SAS requires the linking id to be same data type. This means that you have to convert the int to string or vice versa. Personally, I prefer to convert to numeric when ever it is possible.
A is a worked out example:
/*Create some dummy data for testing purposes:*/
data int_id;
length id 3 dummy $3;
input id dummy;
cards;
1 a
2 b
3 c
4 d
;
run;
data str_id;
length id $1 dummy2 $3;
input id dummy2;
cards;
1 aa
2 bb
3 cc
4 dd
;
run;
/*Convert string to numeric. Int in this case.*/
data str_id_to_int;
set str_id;
id2 =id+0; /* or you could use something like input(id, 8.)*/
/*The variable must be new. id=id+0 does _not_ work.*/
drop id; /*move id2->id */
rename id2=id;
run;
/*Same, but other way around. Imho, trickier.*/
data int_id_to_str;
set int_id;
id2=put(id, 1.); /*note that '1.' refers to lenght of 1 */
/*There are other ways to convert int to string as well.*/
drop id;
rename id2=id;
run;
/*Testing. Results should be equivalent */
data merged_by_str;
merge str_id(in=a) int_id_to_str(in=b);
by id;
if a and b;
run;
data merged_by_int;
merge int_id(in=a) str_id_to_int(in=b);
by id;
if a and b;
run;
For Problem2_1, if your substring contains only numbers you can coerce it to numeric by adding zero. Something like this should make ID numeric and then you could merge with Problem2_2.
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = temp + 0;
drop temp;
run;
EDIT:
Your original code originally defines ID as the output of substr, which is character. This should work as well:
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = input(temp, 8.0);
drop temp;
run;
I'm working with some SAS data, and am trying to figure out how to find a record's sort position in a datastep while using as few steps as possible.
Here's an example --
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
Proc Sort data=WORK.PLACES;
by STATE CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by STATE CITY;
ST_CITY_RNK = _N_;
run;
Proc Sort data=WORK.PLACES;
by CITY;
run;
data WORK.PLACES;
set WORK.PLACES;
by CITY;
CITY_RNK = _N_;
run;
In this example, is there a way to calculate ST_CITY_RNK and CITY_RNK without sorting multiple times? It feels like this should be possible with ordered hash tables, but I'm not sure how to go about doing it.
Thank you!
Hash table would be doable. Temporary arrays would have roughly the same effect and might be a bit easier.
The major limitation of either is what do you do with non-unique city names? Salem, Oregon and Salem, Massachusetts? Obviously in state-city rank that's fine, though you may find states with more than one Lincoln or similar, who knows; but in just City you'll certainly find several Columbias, Lincolns, Charlestons, etc. My solution gives the same sort rank to all of them (but would then skip forward 6 or whatever to the next one). The data step solution you post above would give them unique ranks. The hash iterator could probably do either one. You could tweak this with some effort to give unique ranks, but it would be work.
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data sortrank;
*Init pair of arrays - the one that stores the original values, and one to mangle by sorting;
array states[32767] $ _temporary_;
array states_cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
array cities[32767] $40. _temporary_;
array cities_sorted[32767] $40. _temporary_ (32767*'ZZZZZ');
*Iterate over the dataset, load into arrays;
do _n_ = 1 by 1 until (Eof);
set places end=eof;
states[_n_] = state;;
states_cities_sorted[_n_] = catx(',',state,city);
cities[_n_] = city;
cities_sorted[_n_] = city;
end;
*Sort the to-be-sorted arrays;
call sortc(of states_cities_sorted[*]);
call sortc(of cities_sorted[*]);
do _i = 1 to _n_;
*For each array element, look up the rank using `whichc`, looking for the value of the unsorted element in the sorted list;
city_rank = whichc(cities[_i],of cities_sorted[*]);
state_cities_rank = whichc(catx(',',states[_i],cities[_i]),of states_cities_sorted[*]);
*And put the array elements back in their proper variables;
city = cities[_i];
state= states[_i];
*And finally make a row output;
output;
end;
run;
For reference, here's a hash approach:
data Places;
infile datalines delimiter=',';
input state $ city $40. ;
datalines;
WA,Seattle
OR,Portland
OR,Salem
OR,Tillamook
WA,Vancouver
;
run;
data places;
set places;
if _n_ = 1 then do;
declare hash h1(ordered:'a',dataset:'places');
rc = h1.definekey('city');
rc = h1.definedata('city');
rc = h1.definedone();
declare hiter hi1('h1');
declare hash h2(ordered:'a',dataset:'places');
rc = h2.definekey('state','city');
rc = h2.definedata('state','city');
rc = h2.definedone();
declare hiter hi2('h2');
end;
t_city = city;
t_state = state;
rc = hi1.first();
do city_rank = 1 by 1 until(t_city = city);
rc = hi1.next();
end;
rc = hi2.first();
do state_city_rank = 1 by 1 until(t_city = city and t_state = state);
rc = hi2.next();
end;
state = t_state;
city = t_city;
drop t_:;
run;
I have data that looks like this:
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
I need to add:
A new column with column heading for each sequence, add prefix 'RT', drop the letters following the numeric part of the sequence
Fill the new column with the letters that follow the numeric part
of the sequence
I need to create this:
ID Sequence RTE6 RTK11 RTQ174 RTD177 RTV245
-----------------------------------------------------------------------
101 E6S,K11T,Q174K,D177E S T K E
102 K11T,V245EKQ T EKQ
I assume you want a SAS data set and not a report. ANYDIGIT makes it pretty easy to find the last non-digit sub-string.
data seq;
infile cards firstobs=3;
input id:$3. sequence :$50.;
cards;
ID Sequence
---------------------------------
101 E6S,K11T,Q174K,D177E
102 K11T,V245EKQ
;;;;
run;
proc print;
run;
data seq2V / View=seq2V;
set seq;
length w name sub $32 subl 8;
do i = 1 by 1;
w = scan(sequence,i,',');
if missing(w) then leave;
subl = anydigit(w,-99);
name = substrn(w,1,subl);
sub = substrn(w,subl+1);
output;
end;
run;
proc transpose data=seq2V out=seq3(drop=_name_) prefix=RT;
by id sequence;
var sub;
id name;
run;
proc print;
run;
I had a similar problem a while ago. The code is adapted to your problem.
If found this solution to work faster than anything I tried with proc transpose.
Still overall performance on huge datasets (espc. using many different sequences) is not great at all, as we loop 2*2 over all strings and also the final variables.
Can anyone offer a faster solution?
(Caution: MacroVar is limited to 65534 Characters.)
data var_name ;
set in_data;
length var string $30.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
var = substr(string,1,anydigit(string,-99));
output;
keep var;
end;
run;
proc sql noprint;
select distinct compress("RT"||var) into :var_list separated by ' '
from var_name;
quit;
%put &var_list.;
data out_data;
set in_data;
length string &var_list. $30. n 8. ;
array a_var [*] &var_list.;
do i = 1 to countw(Sequence, ',');
string = scan(Sequence,i,',');
do j = 1 to dim(a_var);
n = anydigit(string,-99) ;
if substr(vname(a_var[j]),3) eq substr(string,1,n) then a_var[j] = substr(string,n+1);
end;
end;
drop string i j n;
run;
It is a simple one but I'm a struggling a bit.
What I have :
What I want :
I want to remove the v0 , v1 and etc.
I'm using this piece of code
data IndieDay20140704;
set IndieDay20140704;
do i=1 to 5;
VAR1=tranwrd(var1,"v&i","");
end;
run;
It is not working correctly as it is giving me this instead (see below) plus the error
WARNING: Apparent symbolic reference I not resolved.
Questions:
1) Do I need a macro?
2) Why the error?
Many thanks for your insights.
There's an error because you're (unintentionally) using macro variable i, that you did not initialize.
I guess the idea of tranwrd is to remove words in VAR2, VAR3.. from VAR1.
The logical error is to do it also for VAR1 itself.
Check if this helps (using array):
data IndieDay20140704;
length VAR1 VAR2 VAR3 VAR3 VAR5 $10;
VAR1 = 'TEST IT';VAR5 = 'TEST';
output;
VAR1 = 'STEST IT';VAR5 = 'TEST';
output;
run;
data IndieDay20140704_modified / view= IndieDay20140704_modified;
set IndieDay20140704;
array vals VAR1 - VAR5;
do i=1 to dim(vals);
if i ne 1 then VAR1=tranwrd(var1,trim(vals(i)),"");
end;
drop i;
run;
Here I'm creating a SAS view on top of table (not a good idea to overwrite the source).
Also I think you should trim() the values from VAR2,VAR3... depending on what you want to achieve and what's in the data.
EDIT:
here the version with 'v0', 'v1'...'v5' strings:
data IndieDay20140704;
length VAR1$10;
VAR1 = 'TEST v0';
output;
VAR1 = 'TEST v11';
output;
VAR1 = 'TEST v1';
output;
run;
data IndieDay20140704_modified / view= IndieDay20140704_modified;
set IndieDay20140704;
org_var1 = var1;
do i=0 to 5;
var1 =tranwrd(var1, catt('v', put(i, 1. -L)),"");
end;
run;
catt('v', put(i, 1. -L)) concatenates string 'v' and the result of put.
put(i, 1. -L)) converts numeric variable i to text using plain numeric format w.d, 1. used here - enough for single digit numbers, -L left aligns the result
Here's one way, there are many others and this may not work if your data has a lot of variability.
data have;
length VAR1$10;
VAR1 = 'fic19v0.csv';
output;
VAR1 = 'fic19v1.cs';
output;
run;
data want ;
set have;
original_var=var1;
var1=substr(var1, 1, index(var1, ".")-3)||".csv";
run;