I have this csv dataset named Movie:
ID,Underage,Name,Rating,Year, Rank on IMDb ,
M1021,,Elanor, Melanor,12,1879,5
M1203,Yes,IT,12,1999,1,
M0081,,Cars 2,13,1999,2,
M1371,No,Kiminonawa,12,2017,3,
M3416,,Living in the past, fading future,13,2018,12
I would like to import Movie into SAS such that "Elanor, Melanor" is the Name instead of 'Elanor' being under Name while 'Melanor' being in Rating.
I tried the follow code:
FILENAME XX '....Movie.csv';
data movieYY (drop=DLM1at field2);
infile XX dlm=',' firstobs=2 dsd;
format ID $5. Underage $3. Name $50. Year 4. Rating $3. 'Rank on IMDb'n 2.;
input #;
DLM1at = find(_INFILE_, ',');
length field2 $4;
field2 = substr(_INFILE_, DLM1at + 1, 4);
if lengthn(compress(field2, '1234567890')) ne 0 then do;
_INFILE_ = substr(_INFILE_, 1, dlm1at - 1) || ' ' ||
substr(_INFILE_, dlm1at + 1);
end;
input ID Underage Name Year Rating 'Rank on IMDb'n;
run;
May I know what should i do? I am still a beginner in SAS. Thank you!
Add quotes to the name of each movie, or use another delimiter. Any data within a delimited file that also has the same delimiter must be in quotes. For example:
data foo;
infile datalines dlm="," dsd;
length id 8. name $25.;
input id name$;
datalines;
1, "Smith, John"
2, "Cage, Nicolas"
;
run;
Related
I have a SAS dataset, let us say
it has 4 columns A,B,C,D and the values
A = x
B = x
C = x
**D = x,y**
Here column D has two values inside a single column while converting it into CSV format it generates a new column with the value Y. How to avoid this and to convert SAS dataset into CSV file?
* get some test records in a file;
Data _null_;
file 'c:\tmp\test.txt' lrecl=80;
put '1,22,Hans Olsen,Denmark,333,4';
put '1111,2,Turner, Alfred,England,3333,4';
put '1,222,Horst Mayer,Germany,3,4444';
run;
* Read the file as a delimited file;
data test; infile 'c:\tmp\test.txt' dsd dlm=',' missover;
length v1 v2 8 v3 v4 $40 v5 v6 8;
input
'V1'n : ?? BEST5.
'V2'n : ?? BEST5.
'V3'n : $CHAR40.
'V4'n : $CHAR40.
'V5'n : ?? BEST5.
'V6'n : ?? BEST5.;
run;
* Read the file and write another file.
* If 6 delimiters and not 5, change the third to #;
data test2;
infile 'c:\tmp\test.txt' lrecl=80 truncover;
file 'c:\tmp\test2.txt' lrecl=80;
length rec $80;
drop pos len;
input rec $char80.;
if count(rec,',') = 6 then do;
call scan(rec,4,pos,len,',');
substr(rec,pos-1,1) = '','';
end;
put rec;
run;
* Read the new file as a delimited file;
data test2; infile 'c:\tmp\test2.txt' dsd dlm=',' missover;
length v1 v2 8 v3 v4 $40 v5 v6 8;
input
'V1'n : ?? BEST5.
'V2'n : ?? BEST5.
'V3'n : $CHAR40.
'V4'n : $CHAR40.
'V5'n : ?? BEST5.
'V6'n : ?? BEST5.;
run;
In this code, it add '#' but I want ',' itself in the output.
Could anyone please guide me to do that?
Thanks in advance!!
It sounds like you are starting with an improperly created CSV file.
1,22,Hans Olsen,Denmark,333,4
1111,2,Turner, Alfred,England,3333,4
1,222,Horst Mayer,Germany,3,4444
That should have been made like this:
1,22,Hans Olsen,Denmark,333,4
1111,2,"Turner, Alfred",England,3333,4
1,222,Horst Mayer,Germany,3,4444
If you are positive that you know that the only field with embedded commas is the third then you can use a data step to read it in and generate a valid file.
data _null_;
infile bad dsd truncover ;
file good dsd ;
length v1-v6 dummy $200;
input v1-v2 #;
do i=1 to countw(_infile_,',','q')-5;
input dummy #;
v3=catx(', ',v3,dummy);
end;
input v4-v6 ;
put v1-v6 ;
run;
Once you have a properly formatted CSV file then it is easy to read.
data want;
infile good dsd truncover ;
length v1-v2 8 v3-v4 $40 v5-v6 8;
input v1-v6 ;
run;
But if the extra comma could be in any field then you will probably need to have a human fix those lines.
If your field value contains the field delimiter you will want to double quote the field value. Proc EXPORT will do such double quoting when the data base type is specified as CSV
Example:
data have;
A = 1;
B = 2;
C = 3;
D = 'x,y';
run;
filename csv temp;
proc export data=have outfile=csv dbms=csv;
run;
data _null_;
infile csv;
input;
put _infile_;
run;
The log will show the exported file contains double quoted values as needed in the csv file produced.
Log
A,B,C,D
1,2,3,"x,y"
Most of my data is read in in a fixed width format, such as fixedwidth.txt:
00012000ABC
0044500DEFG
345340000HI
00234000JKL
06453MNOPQR
Where the first 5 characters are colA and the next six are colB. The code to read this in looks something like:
infile "&path.fixedwidth.txt" lrecl = 397 missover;
input colA $5.
colB $6.
;
label colA = 'column A '
colB = 'column B '
;
run;
However some of my data is coming from elsewhere and is formatted as a csv without the leading zeroes, i.e. example.csv:
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
As the csv data is being added to the existing data read in from the fixed width file, I want to match the formatting exactly.
The code I've got so far for reading in example.csv is:
data work.example;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile "&path./example.csv" delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat colA $5.;
informat colB $6.;
format colA z5.; *;
format colB z6.; *;
input
colA $
colB $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
But the formats z5. & z6. only work on columns formatted as numeric so this isn't working and gives this output:
ColA colB
12 ABC
445 DEFG
34534 HI
234 JKL
6453 MNOPQR
When I want:
ColA colB
00012 000ABC
00445 00DEFG
34534 0000HI
00234 000JKL
06453 MNOPQR
With both columns formatted as characters.
Ideally I'd like to find a way to get the output I need using only formats & informats to keep the code easy to follow (I have a lot of columns to keep track of!).
Grateful for any suggestions!
You can use cats to force the csv columns to character, without knowing what types the csv import determined they were. Right justify the resultant to the expected or needed variable length and translate the filled in spaces to zeroes.
For example
data have;
length a 8 b $7; * dang csv data, someone entered 7 chars for colB;
a = 12; b = "MNQ"; output;
a = 123456; b = "ABCDEFG"; output;
run;
data want;
set have (rename=(a=csvA b=csvB));
length a $5 b $6;
* may transfer, truncate or convert, based on length and type of csv variables;
* substr used to prevent blank results when cats (number) is too long;
* instead, the number will be truncated;
a = substr(cats(csvA),1);
b = substr(cats(csvB),1);
a = translate(right(a),'0',' ');
b = translate(right(b),'0',' ');
run;
SUBSTR on the left.
data test;
infile cards firstobs=2 dsd;
length cola $5 colb $6;
cola = '00000';
colb = '000000';
input (a b)($);
substr(cola,vlength(cola)-length(a)+1)=a;
substr(colb,vlength(colb)-length(b)+1)=b;
cards;
colA,colB
12,ABC
445,DEFG
34534,HI
234,JKL
6453,MNOPQR
;;;;
run;
proc print;
run;
can you only use END Stamement in SAS in set statement? For example...why isn't this working?
filename FS '/folders/myfolders/list4.txt';
data steward;
infile FS dlm = ',' END = EOF;
input Name $ Age Gender $;
if EOF = 1;
run;
Most SAS data steps actually stop when the INPUT or SET statement reads past the end of the file.
I suspect that your input file is either empty or does not have enough data to satisfy your INPUT statement.
You don't need to check EOF or IF as the data step will terminate automatically once it reaches the last record.
Solution:
DATA WORK.input1;
LENGTH
name $ 5
age 8
gender $ 1 ;
FORMAT
name $CHAR5.
age BEST2.
gender $CHAR1. ;
INFORMAT
name $CHAR5.
age BEST2.
gender $CHAR1. ;
INFILE 'E:\saswork\Input.txt'
LRECL=256
FIRSTOBS=2 /*I am skipping first row, as it containts column names*/
ENCODING="WLATIN1"
DLM='2c'x /* this is "," delimiter; I am using windows*/
MISSOVER
DSD ;
INPUT
name : $CHAR5.
age : ?? BEST2.
gender : $CHAR1. ;
put _all_;
RUN;
/*Contents of the Input.txt*/
/*name, age, gender*/
/*jack,32,M*/
/*John,45,M*/
/*Sally,38,F*/
Output:
name=jack age=32 gender=M _ERROR_=0 _N_=1
name=John age=45 gender=M _ERROR_=0 _N_=2
name=Sally age=38 gender=F _ERROR_=0 _N_=3
I have to retrieve the X-Axis and Y-Axis pos from ADDITIONAL_DETAILS field which is more than 300 bytes in length.
Somewhere in this string, I am getting the location details as RETLOCID=2312.4892 like that.
I am trying to use PERL REGEX in SAS.
Problem: I am able to get the starting position into postn1 from call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1); but the length is always returned as 8 even though it is more than that.
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1, length1); This is not giving me proper value when I am restricting length to 8. Any help is appreciated. Below is the code:
DATA WORK.LOCATION;
INFILE DATALINES;
INPUT ADDITIONAL_DETAILS $50.;
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
;
RUN;
data WORK.POSITION;
set WORK.POSITION;
if _N_ = 1 then do;
MATCH_PATTERN1 = PRXPARSE("/(RETLOCID=)/");
MATCH_PATTERN2 = PRXPARSE("/([0-9]{1,}\.[0-9]{1,})/");
end;
retain MATCH_PATTERN1 MATCH_PATTERN2;
call prxsubstr(MATCH_PATTERN1, ADDITIONAL_DETAILS, postn1,length1);
call prxsubstr(MATCH_PATTERN2, ADDITIONAL_DETAILS, postn2,length2);
if postn1 > 0 and not missing(ADDITIONAL_DETAILS) then
TRANSACTION_ID = substrn(ADDITIONAL_DETAILS, postn1 + 8, length1);
RUN;
data work.POSITION;
set work.POSITION;
drop MATCH_PATTERN1 postn1 length1;
run;
I need to pull 121.5435 and 7821.5487
Try this:
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
string=prxchange('s/[a-z=_]+//i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
DISTR_QUOTE=66.92
gfdgf
;
run;
Or
DATA WORK.LOCATION;
INPUT ADDITIONAL_DETAILS $50.;
length string $20.;
if prxmatch('/\=/',ADDITIONAL_DETAILS)=0 then string='';
else string=prxchange('s/.*(?<=\=)([^a-z]+).*/$1/i',-1,ADDITIONAL_DETAILS);
datalines;
afdsf RFTXNID=121.5435 xx
fdsg RFTXNID=7821.5487 xx fdsg
gfdgf
DISTR_QUOTE=66.92
;
proc print;
run;
I am reading a .txt file into SAS, that uses "|" as the delimiter. The issue is there is one column that is using "|" as a word separator as well instead of acting like delimiter, this needs to be in one column.
For example the txt file looks like:
apple|fruit|Healthy|choices|of|food|12|2012|chart
needs to look like this in the SAS dataset:
apple | fruit | Healthy choices of Food | 12 | 2012 | chart
How do I eliminate "|" between "Healthy choices of Food"?
I think this will do what you want:
data tmp1;
length tmp $100;
input tmp $;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
apple|fruit|Healthy|choices|of|food|and|lots|of|other|stuff|12|2012|chart
;
run;
data tmp2;
set tmp1;
num_delims=length(tmp)-length(compress(tmp,"|"));
expected_delims=5;
extra_delims=num_delims-expected_delims;
length new_var $100;
i=1;
do while(scan(tmp,i,"|") ne "");
if i<=2 or (extra_delims+2)<i<=num_delims then new_var=trim(new_var)||scan(tmp,i,"|")||"|";
else new_var=trim(new_var)||scan(tmp,i,"|")||"#";
i+1;
end;
new_var=left(tranwrd(new_var,"#"," "));
run;
This isn't particularly elegant, but it will work:
data tmp;
input tmp $50.;
cards;
apple|fruit|Healthy|choices|of|food|12|2012|chart
;
run;
data tmp;
set tmp;
var1 = scan(tmp,1,'|');
var2 = scan(tmp,2,'|');
var4 = scan(tmp,-3,'|');
var5 = scan(tmp,-2,'|');
var6 = scan(tmp,-1,'|');
var3 = tranwrd(tmp,trim(var1)||"|"||trim(var2),"");
var3 = tranwrd(var3,trim(var4)||"|"||trim(var5)||"|"||trim(var6),"");
var3 = tranwrd(var3,"|"," ");
run;
Expanding a little on Itzy's answer, here is another possible solution:
data want;
/* Define variables */
attrib item length=$10 label='Item';
attrib class length=$10 label='Family';
attrib desc length=$80 label='Item Description';
attrib count length=8 label='Some number';
attrib year length=$4 label='Year';
attrib somevar length=$10 label='Some variable';
length countc $8; /* A temp variable */
infile 'c:\temp\delimited_temp.txt' lrecl=1000 truncover;
input;
item = scan(_infile_,1,'|','mo');
class = scan(_infile_,2,'|','mo');
countc = scan(_infile_,-3,'|','mo'); /* Temp var for numeric field */
count = inputn(countc,'8.'); /* Re-read the numeric field */
year = scan(_infile_,-2,'|','mo');
somevar = scan(_infile_,-1,'|','mo');
desc = tranwrd(
substr(_infile_
,length(item)+length(class)+3
,length(_infile_)
- ( length(item)+length(class)+length(countc)
+length(year)+length(somevar)+5))
,'|',' ');
drop countc;
run;
The key in this case it to read your file directly and handle the delimiters yourself. This can be tricky and requires that your data file is exactly as described. A much better solution would be to go back to whoever gave this this data and ask them to deliver it to you in a more appropriate form. Good luck!
Another possible workaround.
data tmp;
infile '/path/to/textfile';
input tmp :$100.;
array varlst (*) $30 v1-v6;
a=countw(tmp,'|');
do i=1 to dim(varlst);
if i<=2 then
varlst(i) = scan(tmp,i,'|');
else if i>=4 then
varlst(i) = scan(tmp,a-(dim(varlst)-i),'|');
else do j=3 to a-(dim(varlst)-i)-1;
varlst(i)=catx(' ', varlst(i),scan(tmp,j,'|'));
end;
end;
drop tmp a i j;
run;