That's the dataset. I need a variable for ShipID, Received, Shipped, City, Zip Code. How would I go about doing that?
This is my first statistical programming language course and I am struggling. My professor hasn't been much of a help either.
ShipID Received Shipped Address .
X8742 2018/03/14 2018/03/17 Little River, KS, 67457
There's a ton more lines and I've been lost on it for an hour.
infile "/home/rossfosher0/SAS Homework/SAS Sessions/WarehouseA.txt" firstobs = 2;
input #2-7 ShipID $ #9-18 Received: YYYYMMDD8. #20-28 Shipped: YYYYMMDD8. #City $;
run;
I'm trying to set up a data set for this warehouse.
data mydata;
input #1 shipid $ #7 received yymmdd10. #18 shipped yymmdd10. #28 address $30.;
format received yymmdd10. shipped yymmdd10.;
datalines;
X8742 2018/03/14 2018/03/17 blue ridge, MA 02391
;
run;
Assuming that all rows have values for the first three variables you could just read those using list mode input. Then read the rest of the line as the address.
data want;
infile "..." firstobs=2 truncover;
input shipid $ received shipped address $50. ;
informat received shipped yymmdd.;
format received shipped yymmdd10.;
run;
If the data is really in fixed columns then you can use column locations in your INPUT statement, but that is not compatible with using informats. So either use formatted input for the two date fields or read them as strings.
input shipid $1-7 #8 Received yymmdd10. #19 Shipped yymmdd10. Address $ 30-79 ;
format Received Shipped yymmdd10.;
Tom and DCR are both right. I prefer an easier route using Proc import.
proc import datafile='c:\personal\My_file.csv'
out=SAS_data replace;
DELIMITER=";" ;
getnames=yes;
guessingrows= 32767;
run;
What this does is that it makes a guess based on the file read and auto creates the infile-statement. (I just copy it from log and make adjustments if something is read incorrectly.)
If you know the structure of the data follow the other answers, but this is more beginnger friendly approach. (imho) For more see documentation
Related
Im trying load data from csv. I have a few formats: date, time, numeric, string. I dont have problem to convert data to this format except time format.
Data looks:
Date,Time,Transaction,Item
2016-10-30,9:58:12,1,Bread
2016-10-30,10:05:36,2,Scandinavian
2016-10-30,10:08:00,3,Hot chocolate
My code:
data lab0.piekarnia;
INFILE 'path_to_csv' delimiter=',' firstobs=2;
format Date yymmdd10.;
format Time time8.;
INPUT
Date yymmdd10.
Time time8.
Transaction
Item $;
run;
Result
What I try?
I try to manually convert string '12:22:22', This method give good results, but I dont know how can I implement it when load csv.
data ds1;
j = input('12:22:22',HHMMSS8.);
format j time8.;
run;
data have;
INFILE "path_to_csv" truncover delimiter=',' firstobs=2 ;
format Date yymmdd10.;
format Time time8.;
INPUT date time transaction item $32.;
informat
Date yymmdd10.
Time time.;
/*Instead input and informat statements you can use:
INPUT date:yymmdd10. time:time. transaction item $32.;
*/
run;
The first line has only 7 characters for the time value, but you told SAS to read exactly 8 characters. So it included the comma. When reading delimited data, like a CSV file, you need to use list mode input and not formatted mode. You do this by either eliminating the informat specification from the INPUT statement (and instead attach an informat to the variable with an INFORMAT statement) or by prefixing informat specification with the : (colon) modifier. Also if you don't define the length for ITEM (or give SAS something else, like an informat, that it can use to guess a length) it will be created as length $8.
input date :yymmdd10. time :time8. transaction item :$40.;
I was working on a SAS problem where I need to append the data. The data run is successful but it creates duplicates every time I run the program.
Please check my code and screenshot of the table:
Question: Create a new file "Total_Sales" by appending data file "Hyundai" with the file first created in problem 3.
/*Problem 3*/:
data avik1.var1;
length uniqueid $50 Manufacturer $ 50 Model $20 Sales_in_thousands 8 _4_year_resale_value 8 Price_in_thousands 8;
retain uniqueid Manufacturer Model Latest_Launch Sales_in_thousands _4_year_resale_value Price_in_thousands;
set avik1.conc(drop= Vehicle_type Engine_size Horsepower Wheelbase Width Length Curb_weight Fuel_capacity Fuel_efficiency );
informat Latest_Launch date9.;
format Latest_Launch ddmmyy10.;
run;
proc print data = avik1.var1;
run;
/* Data To be Appended */
data avik1.hyundai;
length uniqueid $ 50 Manufacturer $ 50 Model $20 Sales_in_thousands 8 _4_year_resale_value 8;
informat Latest_Launch date7. ;
format Latest_Launch ddmmyy10.;
input Manufacturer $ Model $ Sales_in_thousands _4_year_resale_value Latest_Launch;
uniqueid=(Model||Manufacturer);
cards;
Hyundai Tuscon 16.919 16.36 2Feb12
Hyundai i45 39.384 19.875 3Jun11
Hyundai Verna 14.114 18.225 4Jan12
Hyundai Terracan 8.558 29.775 10Mar11
;
run;
Proc Print data = avik1.hyundai;
run;
Now I used the following code to append:
data avik1.total_sales;
set avik1.var1 avik1.hyundai;
proc append base=avik1.var1 new=avik1.hyundai force;
run;
proc print data= avik1.total_sales;
run;
The program runs but gets me duplicates which you can check in the image
Screenshot in Yellow Mark Shows Duplicates
I am new to SAS really appreciate your response and solution to this problem. Also please tell me why this is happening.
Thanks!
Did you run it twice? I'm guessing but that could be the reason you see duplicates. I'll try to explain.
In your append code here, you are creating the new dataset total_sales by combining var1 and hyundai:
data avik1.total_sales;
set avik1.var1 avik1.hyundai;
In the below code, you are not creating a new dataset, you are expanding var1 by adding the records from hyundai.
proc append base=avik1.var1 new=avik1.hyundai force;
run;
If you ran this proc append and then ran the first data step again, you will have duplicates of all hyundai records because you are taking the EXPANDED var1 and re-adding the hyundai records.
So the point is, to answer the original question, the proc append procedure is totally unnecessary. You achieved it with just the data step.
I have a .csv file which has some flight information. Sample data is shown below.
date|sched_dep_time|dep_time|sched_arr_time|arr_time
1/1/2013|515|517|819|830
The 515 here actually means 5:15Hrs. How can I read this data into SAS correctly? If I use the time. format, it is coming up with some strange timings. I have seen some code snippets, which has to be written exclusively to do these time conversions. But is there are more straight forward method available?
Use the informat HHMMSS, which will read it in correctly.
data have;
informat date ddmmyy10. sched_dep_time dep_time sched_arr_time arr_time hhmmss.;
format sched_dep_time dep_time sched_arr_time arr_time time.;
input date sched_dep_time dep_time sched_arr_time arr_time;
cards;
1/1/2013 515 517 819 830
;
run;
proc print data=have;run;
I didn't realize the HHMMSS. INFORMAT would work. Reeza's answer is best. If you want a custom function, here you go.
options cmplib=work.fns;
proc fcmp outlib=work.fns.time;
function to_time(x);
minutes = mod(x,100);
hour = (x-minutes)/100;
time = hms(hour,minutes,0);
return (time);
endsub;
run;
data test;
format in_val best.
out_time time.;
in_val = 512;
out_time = to_time(in_val);
put in_val out_time;
run;
So I imported a SAS dataset and specified the desired variables while correctly formating them.
FILENAME currency '/folders/myfolders/SAS assignment/Assignment4/currency.txt';
data assn4.currency;
infile currency;
input
#1 currencynotes $3.
#6 purchasedate mmddyy10.
#19 purchasevalue 7.0000
#30 selldate mmddyy10.
#44 sellvalue 7.0000
#55 numberofnotespurchased;
I then added in a number of SAS variables based on the other variables
data assn4.currency;
set assn4.currency;
Timeheld = selldate-purchasedate;
run;
data assn4.currency;
set assn4.currency;
value_at_dollar_per_purchase = numberofnotespurchased/purchasevalue;
run;
data assn4.currency;
set assn4.currency;
value_at_dollar_per_sale = numberofnotespurchased/sellvalue;
run;
data assn4.currency;
set assn4.currency;
profit= value_at_dollar_per_sale-value_at_dollar_per_purchase;
run;
data assn4.currency;
set assn4.currency;
PPD = profit/Timeheld;
run;
I then wanted to format and print out the dataset along with these new variables, however I do not know the spacing of these new variables and the dataset created in my ASSN4 library has column numbers instead of the spacing information i used from the imported txt file.
data assn4.currency;
infile currency;
input
#1 currencynotes $3.
#6 purchasedate mmddyy10.
#19 purchasevalue 7.0000
#30 selldate mmddyy10.
#44 sellvalue 7.0000
#55 numberofnotespurchased
#65 Timeheld mmddyy10.
value_at_dollar_per_purchase 12.00000000
value_at_dollar_per_sale 12.00000000
profit 12.0000000000
PPD 12.0000000000
;
when I attempt to print out my dataset using
Proc Print data = assn4.currency;
run;
all these new variables had . denoting missing info, while the new dataset created that is in the library shows these values.
I'll try to keep my answer simple and short despite the fact that it seems you lack some basic SAS knowledge.
In a data step, you use infile to read from an external file. To read from a SAS data set, you use a set statement.
In the first step, you created a dataset called currency in a library called assn4 by reading from your text file. In the next few steps, you correctly add variables to that dataset, although all this could be done in one step.
However in the last step, you overwrite your dataset by reading again from your text file (with the infile statement). You then of course lose all the variables you had created.
This does what (I think) you are trying to achieve:
FILENAME currency '/folders/myfolders/SAS assignment/Assignment4/currency.txt';
data assn4.currency;
infile currency;
input
#1 currencynotes $3.
#6 purchasedate mmddyy10.
#19 purchasevalue 7.
#30 selldate mmddyy10.
#44 sellvalue 7.
#55 numberofnotespurchased
;
Timeheld = selldate-purchasedate;
value_at_dollar_per_purchase = numberofnotespurchased/purchasevalue;
value_at_dollar_per_sale = numberofnotespurchased/sellvalue;
profit= value_at_dollar_per_sale-value_at_dollar_per_purchase;
PPD = profit/Timeheld;
format
Timeheld mmddyy10.
value_at_dollar_per_purchase
value_at_dollar_per_sale
profit
PPD 12.
;
run;
Note that I changed your formats to what they are actually equivalent to. Adding a bunch of zeros after the dot in a format does absolutely nothing.
I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.