I am trying to download a file from SAS and import it to Hadoop.
Its a huge dataset - 6GB.
When I export the sas dataset to csv file and then import back to sas.(as I was facing few in issues in hadoop, I tried importing back to SAS and verify values). The import shows problems in the dataset in the same tool itself..
The column values are jumbled up.
Few columns have junk values, few columns are overlapped
How can I export the dataset in csv format with the column values intact.
filename output 'AAA.csv' encoding="utf-8";
Proc export data= input_data
outfile= output
dbms = CSV;
run;
Just a guess, but try removing any end of line characters that might exist in your character strings.
For example you could use a simple data step view to convert the strings on the fly. Here is one that replaces any CR or LF character with a pipe character.
data for_export / view=for_export;
set input_data;
array _c _character_;
do over _c;
_c = translate(_c,'||','0D0A'x);
end;
run;
proc export data=for_export outfile=output dbms=CSV;
run;
You might also watch out for backslash characters. Some readers try to interpret those as an escape character.
Related
I'm writing a SAS program to interact with an API. I'm trying to use SAS to capture a specific field from a text file generated by the API.
The generated text "resp" looks like this:
{"result":{"progressId":"ab12","percentComplete":0.0,"status":"inProgress"},"meta":{"requestId":"abcde123","httpStatus":"200 - OK"}}
The field I want to capture is "progressID". In this case, it would be "ab12". If the length of progressID will change, what's the easiest way to capture this field?
My current approach is as follows:
/* The following section will import the text into a SAS table,
seperated by colon. The third column would be "ab12","percentCompelte"
*/
proc import out = resp_table
datafile= resp
dbms = dlm REPLACE;
delimiter = ':';
GETNAMES = NO;
run;
/* The following section will trim off the string ,"percentCompete"*/
data resp_table;
set resp_table;
Progress_ID = SUBSTR(VAR3,2,LENGTH(VAR3)-20);
run;
Do you have an easier/ more concise solution?
Thanks!
Shawn
You can use the JSON library engine to read a json document, and copy the contents to SAS datasets. Work with the data items that the engine creates.
Example:
filename myjson "c:\temp\sandbox.json";
data _null_;
file myjson;
input;
put _infile_;
datalines;
{"result":{"progressId":"ab12","percentComplete":0.0,"status":"inProgress"},"meta":{"requestId":"abcde123","httpStatus":"200 - OK"}}
run;
libname jsondoc json "c:\temp\sandbox.json";
proc copy in=jsondoc out=work;
run;
proc print data=work.Alldata;
where P1='result' and P2='progressId';
run;
Now the question I have is I have a bigger problem as I am getting "this range is repeated or overlapped"... To be specific my values of label are repeating I mean my format has repeated values like a=aa b=aa c=as kind of. How do I resolve this error. When I use the hlo=M as muntilqbel option it gives double the data...
I am mapping like below.
Santhan=Santhan
Chintu=Santhan
Please suggest a solution.
To convert data to a FORMAT use the CNTLIN= option on PROC FORMAT. But first make sure the data describes a valid format. So read the data from the file.
data myfmt ;
infile 'myfile.txt' dsd truncover ;
length fmtname $32 start $100 value $200 ;
fmtname = '$MYFMT';
input start value ;
run;
Make sure to set the lengths of START and VALUE to be long enough for any actual values your source file might have.
Then make sure it is sorted and you do not have duplicate codes (START values).
proc sort data=myfmt out=myfmt_clean nodupkey ;
by start;
run;
The SAS log will show if any observations were deleted because of duplicate START values.
If you do have duplicate values then examine the dataset or original text file to understand why and determine how you want to handle the duplicates. The PROC SORT step above will keep just one of the duplicates. You might just has exact duplicates, in which case keeping only one is fine. Or you might want to collapse the duplicate observations into a single observation and concatenate the multiple decodes into one long decode.
If you want you can add a record that will add the functionality of the OTHER keyword of the VALUE statement in PROC FORMAT. You can use that to set a default value, like 'Value not found', to decode any value you might encounter that was not in your original source file.
data myfmt_final;
set myfmt_clean end=eof;
output;
if eof then do;
start = ' ';
label = 'Value not found';
hlo = 'O' ;
output;
end;
run;
Then use PROC FORMAT to make the format from the cleaned up data file.
proc format cntlin = myfmt_final;
run;
To convert a FORMAT to a dataset use the CNTLOUT= option on PROC FORMAT.
For example if you had created this format previously.
proc format ;
value $myfmt 'ABC'='ABC' 'BCD'='BCD' 'BCD1'='BCD' 'BCD2'='BCD' ;
run;
then you can use another PROC FORMAT step to make a dataset. Use the SELECT statement if you format catalog has more than one format defined and you just want one (or some) of them.
proc format cntlout=myfmt ;
select $myfmt ;
run;
Then you can use that dataset to easily make a text file. For example a comma delimited file.
data _null_;
set myfmt ;
file 'myfmt.txt' dsd ;
put start label;
run;
The result would be a text file that looks like this:
ABC,ABC
BCD,BCD
BCD1,BCD
BCD2,BCD
You get this error because you have the same code that maps to two different categories. I'm going to guess you likely did not import your data correctly from your text file and ended up getting some values truncated but without the full process it's an educated guess.
This will work fine:
proc format;
value $ test
'a'='aa' 'b'='aa' 'c'='as'
;
run;
This version will not work, because a is mapped to two different values, so SAS will not know which one to use.
proc format;
value $ badtest
'a'='aa'
'a' = 'ba'
'b' = 'aa'
'c' = 'as';
run;
This generates the error regarding overlaps in your data.
The way to fix this is to find the duplicates and determine which code they should actually map to. PROC SORT can be used to get your duplicate records.
I'm new to SAS and am having issues with using Linear Regression.
I loaded a CSV file and then in Tasks and Utilities > Tasks > Statistics > Linear Regression I selected WORK.BP (BP = filename) for my data. When I try to select my dependent variable SAS says "No columns are available."
The CVS file appears to have loaded correctly and has 2 columns so I can't figure out what the issue is.
Thanks for the help.
This is the code I used for loading the file:
data BP;
infile '/folders/myfolders/BP.csv' dlm =',' firstobs=2;
input BP $Pressure$;
run;
And this is what the output looks like
By running your code. you import the .csv file with the 'PRESSURE' variable as character variable; in a linear regression model, you need to have all varaible as _numeric_.
In order to do this, I suggest to use the PROC IMPORT to import a .csf format file, instead of a DATA step with an INPUT statement.
In your case, you shold follows those steps:
Define the path where the .csv file is located:
%let path = the_folder_path_where_the_csv_file_is_located ;
Define the number of rows from which start your data (by including the labels/variables names in the count):
%let datarow = 2;
Import the .csv file, here named 'BP', as follows:
proc import datafile="&path.\BP.csv"
out=BP
dbms=csv
replace;
delimiter=",";
datarow=&datarow.;
getnames=YES;
run;
I assumed that the file you want as output has to be called BP too (you'll find it in the work library!) and that the delimiter is the colon.
Hope this helps!
I'm importing a csv to a sas dataset with this code:
PROC IMPORT
DATAFILE = '/folders/myshortcuts/SASsoftware_rialto_2015/providence_med_claims_15.csv'
OUT = medical
DBMS=DLM REPLACE;
DELIMITER='|';
getnames=yes;
run;
For the subsequent code it wants one of the fields called DIAGNOSIS_VERSION_CODE in this dataset to be a character type rather than numeric type which is the default. How can I change that default in the above code or convert the field in the dataset?
I tried this and it didn't work:
contents data=medical;
modify medical;
format DIAGNOSIS_VERSION_CODE $CHAR8.;
contents data=medical;
run;
You cannot use PROC DATASETS to change a variable's definition or values. You will need to create a new dataset. You can use the RENAME statement to make the new variable have the name of the old one.
data new_medical;
set medical ;
new_diagnosis_version_code = put(diagnosis_version_code,Z8.);
rename diagnosis_version_code=old_diagnosis_version_code
new_diagnosis_version_code=diagnosis_version_code
;
run;
To prevent this in the future you should write your own data step to read the data instead of asking PROC IMPORT to guess what data you have. Then you have control over how the variables are created.
I'm trying to use a double pipe delimiter "||" when I export a file from SAS to txt. Unfortunately, it only seems to correctly delimit the header row and uses the single version for the data.
The code is:
proc export data=notes3 outfile='/file_location/notes3.txt'
dbms = dlm;
delimiter = '||';
run;
Which results in:
ID||VAR1||VAR2
1|0|STRING1
2|1|STRING2
3|1|STRING3
If you want to use a two character delimiter, you need to use dlmstr instead of dlm in the file statement in data step file creation. You can't use proc export, unfortunately, as that doesn't support dlmstr.
You can create your own proc export fairly easily, by using dictionary.columns or sashelp.vcolumn to construct the put statement. Feel free to ask more specific questions on that side if you need help with it, but search around for data driven output and you'll most likely find what you need.
The reason proc export won't use a double pipe is because it generates a data step to do the export, which uses a file statement. This is a known limitation - quoting the help file:
Restriction: Even though a character string or character variable is
accepted, only the first character of the string or variable is used
as the output delimiter. This differs from INFILE DELIMITER=
processing.
The header row || works because SAS constructs it as a string constant rather than using a file statement.
So I don't think you can fix the proc export code, but here's a quick and dirty data step that will transform the output into the desired format, provided that your dataset has no missing values and doesn't contain any pipe characters:
/*Export as before to temporary file, using non-printing TAB character as delimiter*/
proc export
data=sashelp.class
outfile="%sysfunc(pathname(work))\temp.txt"
dbms = dlm;
delimiter = '09'x;
run;
/*Replace TAB with double pipe for all rows beyond the 1st*/
data _null_;
infile "%sysfunc(pathname(work))\temp.txt" lrecl = 32767;
file "%sysfunc(pathname(work))\class.txt";
input;
length text $32767;
text = _infile_;
if _n_ > 1 then text = tranwrd(text,'09'x,'||');
put text;
run;
/*View the resulting file in the log*/
data _null_;
infile "%sysfunc(pathname(work))\class.txt";
input;
put _infile_;
run;
As Joe suggested, you could alternatively write your own delimiter logic in a dynamically generated data step, e.g.
/*More efficient option - write your own delimiter logic in a data step*/
proc sql noprint;
select name into :VNAMES separated by ','
from sashelp.vcolumn
where libname = "SASHELP" and memname = "CLASS";
quit;
data _null_;
file "%sysfunc(pathname(work))\class.txt";
set sashelp.class;
length text $32767;
text = catx('||',&VNAMES);
put text;
run;