reading large csv file that contains a column with large text values - sas
i am having an issue with a large data set when reading it into sas that contains a review_text column with large text values. The first column review_id is listing the observation rather than the actual id. The other columns have wrong values and are being displaced within other variables.
DATA review;
INFORMAT review_id $50. ;
INFORMAT review_text $5000. ;
INFORMAT business_name $100. ;
INFORMAT business_id $100. ;
INFORMAT review_date mmddyy10. ;
INFORMAT city $25. ;
INFORMAT state $20. ;
INFORMAT address $250. ;
INFORMAT user_id $100. ;
INFORMAT user_name $100. ;
INFORMAT friends $500. ;
INFORMAT yelping_since mmddyy10. ;
INFORMAT categories $100. ;
INFILE 'C:\users\scott\desktop\yelp_food_reviews.csv' DELIMITER= ',' dsd LRECL=32767 FIRSTOBS=2;
INPUT review_id $ review_text $ business_name $ business_id $ review_date review_ratingbusiness_rating num_biz_reviews city $ state $ address $ postal_code 8. latitude 8.10 longitude 8.10 mon_hours $ tues_hours $ wed_hours $ thurs_hours $ fri_hours $ sat_hours $ sun_hours $ user_id $ user_name $ user_reviews_given 8. ave_rating_given 4.1 friends $ yelping_since categories $ is_open 1.;
run;
enter image description here
Related
change a length format of a column in sas
This is my code: data want ; input branch_id branch_name $ branch_specification $ bold_type $ bold_score $ ; DATALINES ; 612 NATANYA masham_atirey masham 1.15 ; run ; the output for branch_specification is masham_a I wish to longer the lenght.
You could add a length statement under your data statement. length branch_specification $15.; Just keep in mind that the length statement will put your manipulated variable at the front of your data set. You can change the order with a retain statement. data want ; retain branch_id branch_name branch_specification; length branch_specification $15.; input branch_id branch_name $ branch_specification $ bold_type $ bold_score $ ; DATALINES ; 612 NATANYA masham_atirey masham 1.15 ; run ;
You can input branch_specification using the $15. informat while using list input for the rest, e.g. data want ; input branch_id branch_name $ branch_specification :$15. bold_type $ bold_score $ ; DATALINES ; 612 NATANYA masham_atirey masham 1.15 ; run ; This way you don't need a separate length statement, and the variable order is unchanged. Adding the : modifier prevents the input statement reading past the first delimiter (space by default) into the next variable in cases where branch_specification is less than 15 characters long.
SAS Compress function Returns FALSE value
I'm having some trouble using the 'Compress' function in SAS. My aim is to remove any alphabetical characters from the 'Comments' field in this data step. However, when I execute the code below, the 'NewPrice' field shows FALSE rather than the expected value. data WORK.basefile2; length TempFileName Filenameused $300.; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile CSV2 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=3 Filename=TempFileName; Filenameused= TRANWRD(substr(TempFileName, 48), ".xlsx.csv", ""); informat customer_id $8.; informat Name $50.; informat Reco_issue $50.; informat Reco_action $50.; informat ICD $50.; informat Start_date $50.; format customer_id $8.; format Name $50.; format Reco_issue $50.; format Reco_action $50.; format ICD $50.; format Start_date $50.; format Comments $255.; format Description $255.; NewPrice=COMPRESS(Comments, '', 'kd'); input customer_id $ Name $ 'Total Spend'n $ Template $ Product_id $ Description $ Start_date $ CD_Sales CD_Lines CD_Level $ CD_Price $ CD_uom $ CD_Discount $ Reco_issue $ Reco_action $ Reco_price $ Reco_discount $ Impact_£ Impact Agree $ Comments $ ICD $ Structure_Code $ Deal_type $ NewPrice; run; Output when code is executed Sample data (comma delimited): ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£32.13,25,5,£1.32,EA,,Below Hard Floor,Change Rate,£2.76,,£0.48,52.17%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,476553,TEST,11/11/2016,£29.17,11,5,£1.29,EA,,Below Hard Floor,Change Rate,£3.39,,£0.70,61.95%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,476557,TEST,11/11/2016,£17.61,5,5,£3.96,EA,,Below Hard Floor,Change Rate,£9.69,,£1.91,59.13%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,475261,TEST,11/11/2016,£16.70,4,5,£10.92,EA,,Below Hard Floor,Change Rate,£26.67,,£5.25,59.06%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,476546,TEST,11/11/2016,£15.73,10,5,£0.96,EA,,Below Hard Floor,Change Rate,£2.67,,£0.57,64.04%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,476549,TEST,11/11/2016,£5.84,3,5,£1.86,EA,,At Hard Floor,Change Rate,£6.00,,£1.38,69.00%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,477340,TEST,11/11/2016,£3.75,2,5,£4.11,EA,,Below Hard Floor,Change Rate,£11.40,,£2.43,63.95%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",,259738,TEST,13/01/2018,"£45,173.66",403,5,£10.35,EA,20,Below Hard Floor,Change Rate,£10.80,,£0.15,4.17%,N,New Prices Agreed £3.52 ASDFGH,TEST,"£31,333.00",,297622,TEST,13/01/2018,£736.60,5,5,£10.95,EA,20,Below Hard Floor,Change Rate,£11.46,,£0.17,4.45%,N,New Prices Agreed £3.75 ASDFGH,TEST,"£31,333.00",,105384,TEST,19/07/2017,£223.44,1,5,£11.25,BG,42.5,Below Hard Floor,Change Rate,£11.49,,£0.08,2.09%,N,New Prices Agreed £3.76 Any help would be greatly appreciated! Thanks, Henry
First off, I assume you mean that NEWPRICE=' ', not FALSE, since SAS doesn't have a "FALSE" per se. ' ' would be treated as FALSE in a boolean expression, though. COMPRESS is returning ' ' here because the value in COMMENTS has entirely letter characters. Your COMPRESS arguments are asking it to keep digits and spaces ('' is still a single space, even if it doesn't look like it - leave the argument empty entirely if you don't want space ), meaning it would only keep spaces and digits. You have no digits in the COMMENTS field for most records, so you have only spaces, which are treated as equivalent to missing by SAS. The other records, you have the problem that you're keeping spaces, so it's not going to turn into a number neatly. You'll want to use input, and probably not keep spaces, to get the value you want. (You also probably do want to keep ., no?) Or make NEWPRICE a character field. Finally, your line : NewPrice=COMPRESS(Comments, '', 'kd'); is before the input statement, which is a problem - the value for comments isn't defined yet when it runs. This works, for example. Note I don't understand why you have NewPrice listed in input (and some other fields that won't have definitions either)... data WORK.basefile2; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile datalines delimiter = ',' MISSOVER DSD ; informat customer_id $8.; informat Name $50.; informat Reco_issue $50.; informat Reco_action $50.; informat ICD $50.; informat Start_date $50.; format customer_id $8.; format Name $50.; format Reco_issue $50.; format Reco_action $50.; format ICD $50.; format Start_date $50.; format Comments $255.; format Description $255.; input customer_id $ Name $ 'Total Spend'n $ Template $ Product_id $ Description $ Start_date $ CD_Sales CD_Lines CD_Level $ CD_Price $ CD_uom $ CD_Discount $ Reco_issue $ Reco_action $ Reco_price $ Reco_discount $ Impact_Pound Impact Agree $ Comments $ ICD $ Structure_Code $ Deal_type $ ; NewPrice=input(COMPRESS(Comments, '.', 'kd'),best32.); datalines; ASDFGH,TEST,"£31,333.00",15AH,156907,TEST,18/10/2016,"£4,003.10",222,5,£5.19,M,,Below Hard Floor,Change Rate,£6.63,,£0.48,21.72%,N,New Prices Agreed £3.76 ASDFGH,TEST,"£31,333.00",15AH,475266,TEST,11/11/2016,£49.61,29,5,£2.52,EA,,At Hard Floor,Change Rate,£6.36,,£1.28,60.38%,N,In negotiations with the customer for new prices ASDFGH,TEST,"£31,333.00",15AH,404740,TEST,21/09/2017,£38.69,1,5,£116.07,EA,,Below Hard Floor,Change Rate,£163.80,,£15.91,29.14%,N,In negotiations with the customer for new prices ;;;; run;
SAS PROC Import vs DATA step with INFILE
My PROC IMPORT step is throwing "import unsuccessful" effor when I am trying to read a '~' delimited file containing address field. In the CSV file, 5 byte zip code is automatically treated as a numeric field and once in a while I am getting bad data records with invalid zip codes as VXR1#. When this is encountered I am getting "import unsuccessful" error and the SAS job is failing. PROC IMPORT is automatically getting converted to DATA step with an infile. So I tried DATA step with INFILE option and with INFORMATS and FORMATS and changed the FORMAT of ZIP to alphanumeric. But I faced different issue now. With DATA, INFORMAT and FORMAT keywords, the lengths mismatch is happening and the data is getting moved to different locations automatically. Could someone help me to figure out a solution for this issue? Included PROC IMPORT I used and DATA file step I used below for reference: data WORK.TRADER_STATS ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile '/sascode/test/TRADER_STATS.csv' delimiter = '~' MISSOVER DSD lrecl=32767 firstobs=2 ; informat TRADER_id best32. ; informat dealer_ids $60. ; informat dealer_name $27. ; informat dealer_city $15. ; informat dealer_st $2. ; informat dealer_zip $5. ; informat SNO best32. ; informat start_dt yymmdd10. ; informat end_dt yymmdd10. ; input TRADER_id dealer_ids $ dealer_name $ dealer_city $ dealer_st $ dealer_zip sno start_dt end_dt ; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run; proc import file="/sascode/test/TRADER_STATS_BY_DAY.csv" out=WORK.TRADER_STATS_BY_DAY dbms=dlm replace; delimiter='~'; ;run;
Try Using the : colon operator which will tell SAS to use the informat supplied but to stop reading the value for this variable when a delimiter is encountered, which will sort out your problem of - data getting moved to different locations automatically data WORK.TRADER_STATS ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile '/sascode/test/TRADER_STATS.csv' delimiter = '~' MISSOVER DSD lrecl=32767 firstobs=2 ; input TRADER_id : best32. dealer_ids : $60. dealer_name : $27. dealer_city : $15. dealer_st $ : $2. dealer_zip : $5. sno : best32. start_dt : yymmdd10. end_dt : yymmdd10.; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;
SAS: Taking Date data in DD-MMM-YYYY format from a csv file in a date format in a permanent data set
I would like to import data from a csv file in a permanent data set which has this date column with data format like "dd-mmm-yyyy" like "22-FEB-1990". I want this to be imported as date format inside the data set too. I have tried many format informats but i am not getting anything in the column. Here is the code i wrote(While I commented out certain things I have tested all the permutations and combinations with the formats and informats i could think of): libname asgn1 "C:\Users\*****\abc"; data asgn1.Car_sales_1_1; infile "C:\Users\********\Car_sales.csv" dsd dlm="," FIRSTOBS=2 ; input Manufacturer $ Model $ Fuel_efficiency Latest_Launch; * format Latest_Launch mmddyy10.; * informat Latest_Launch mmddyy10.; run; Please help...
Change your informat to date11. (dd-mmm-yyyy). SAS Informats by Category > http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a001239776.htm
I tried the following code and I got just the result I wanted....Thanks #Chris J libname asgn1 "C:\Users\*****\abc"; data asgn1.Car_sales_1_1; infile "C:\Users\********\Car_sales.csv" dsd dlm="," FIRSTOBS=2 ; input Manufacturer $ Model $ Fuel_efficiency Latest_Launch; informat Latest_Launch date11.; format Latest_Launch ddmmyy10.; run;
format date columns when "?" appears in same column
I have several columns with dates and some entries contain the entry "?" and other entries contain dates in the MMDDYY10. format. I compare dates at a later point, and have the code that works for that, but the missing entries and "?" cause errors to occur and observations to be created. here is my import code: data WORK.esn_service ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile 'C:\Documents and Settings\richardg\Desktop\Sirius\esn_service.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; informat DEACTIVATION_DATE best10. ; informat DEACTIVATION_REASON $35. ; informat REACTIVATION_DATE best10. ; format DEACTIVATION_DATE mmddyy10. ; format DEACTIVATION_REASON $35. ; format REACTIVATION_DATE mmddyy10. ; input DEACTIVATION_DATE DEACTIVATION_REASON $ REACTIVATION_DATE ; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run; The two date columns are causing the error. I need to later compare dates, so I cant just pick a random date to replace the problem cells.
You have an issue with how you are importing the data. Once a column is a character variable, it's stuck that way - you have to create a new column to change it. Either change how you import it to bring it in as numeric, or create a new column for each to force it to be numeric. If your data has MMDDYY10 in it already (in the CSV file), then you need to use INFORMAT. INFORMAT controls how SAS reads in the data. data WORK.esn_service ; %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ infile 'C:\Documents and Settings\richardg\Desktop\Sirius\esn_service.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; informat DEACTIVATION_DATE mmddyy10. ; informat DEACTIVATION_REASON $35. ; informat REACTIVATION_DATE mmddyy10. ; format DEACTIVATION_DATE mmddyy10. ; format DEACTIVATION_REASON $35. ; format REACTIVATION_DATE mmddyy10. ; input DEACTIVATION_DATE DEACTIVATION_REASON $ REACTIVATION_DATE ; if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */ run;