Reading raw data file dlm - sas

I have problem reading in a raw data file. The problem is that some of inputs gets cut of because of the delimiter. Since one of the title has "\" in front of the real title, the Book_Title output is only "\". I was wondering if there is a way of ignoring those symbols.
Input:
0195153448;"Classical Mythology";"Mark P. O. Morford";"2002";"Oxford University Press"
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
The code:
data rating.books;
infile "&path\BX-Books.csv" dlm=';' missover dsd firstobs=2;
input ISBN: $12.
Book_Title: $quote150.
Book_Author: $quote60.
Year_Of_Publication: $quote8.
Publisher: $quote60.;
run;
Output:
ISBN | Book-Title | Book-Author | Publisher | Publication-Year
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002
085409878X | \ | being dust from the law courts,"| 1973 | Missing value
Desired output:
ISBN | Book-Title | Book-Author | Publisher | Publication-Year
0195153448 | Classical Mythology | Mark P. O. Morford | Oxford University Press | 2002
085409878X | Pie-powder being dust from the law courts |John Alderson Foote | EP Publishing | 1973

It does not look like your source data is following any known pattern.
If you read it without the DSD option then it will treat the second line as having 6 fields.
085409878X;"\"Pie-powder\"; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2="\"Pie-powder\"
v3=being dust from the law courts
v4=John Alderson Foote"
v5="1973"
v6="EP Publishing"
If you try to "fix" the escaped quotes
_infile_=tranwrd(_infile_,'\"','""');
then you will end up with only 4 fields.
085409878X;"""Pie-powder""; being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2="Pie-powder"; being dust from the law courts;John Alderson Foote
v3=1973
v4=EP Publishing
v5=
v6=
To get your desired output you could try removing the \"; and the "\" strings.
_infile_=tranwrd(_infile_,'\";',' ');
_infile_=tranwrd(_infile_,'"\"','');
Which does make it read as you want.
085409878X; Pie-powder being dust from the law courts;John Alderson Foote";"1973";"EP Publishing"
v1=085409878X
v2=Pie-powder being dust from the law courts
v3=John Alderson Foote"
v4=1973
v5=EP Publishing
v6=
Not sure if that will generalize to other lines with extra quotes or extra semi-colons.

You have to change a bit your code to put the missing column into a string $150. like that :
data work.books;
infile "h:\desktop\test.csv" dlm=';' missover dsd firstobs=1;
input ISBN: $12.
Book_Title: $150.
Book_Author: $quote60.
Year_Of_Publication: $quote8.
Publisher: $quote60.;
run;
Then, you have to clean the column from special characters " and \ with this macro function :
%macro cleaningColumn(col);
compress(strip(&col),'\"',' ')
%mend cleaningColumn;
You can include the macro function into a proc sql statement like this :
proc sql;
create table want as
select
ISBN,
%cleaningColumn(Book_Title) as Book_Title,
Book_Author,
Year_Of_Publication,
Publisher
from books;
run;
The column Book_Title will be like this :
Classical Mythology
Pie-powder
Regards,

Related

Grouping successes in SAS on multiple rows by ID where at least 1 success counts as a success

I'm working with a dataset of call logs and I need to summarize how many subscribers have been successfully contacted. Each row is one call, and if at least one call for a subscriber is a success, I need to set a variable that outputs "successful contact" on each row that belongs to that subscriber, even if that row does not list a successful contact. A really nice thing I'd like to do, and an ideal outcome for this problem is to output the number of successful contact that subscriber has had in the dataset on each row belonging to the subscriber, regardless of success or failure of that attempt.
Basically, it would solve my problem roughly to create this kind of output (success_contact would be the variable created):
Subscriber ID | Name | Contact Outcome (call) | Success_Contact
123456 | Bob | Unsuccessful | Successful
123456 | Bob | Successful | Successful
123456 | Bob | Successful | Successful
But it would be super awesome if I could do this:
Subscriber ID | Name | Contact Outcome (call) | Success_Contact
123456 | Bob | Unsuccessful | 2
123456 | Bob | Successful | 2
123456 | Bob | Successful | 2
985666 | Bill | Unsuccessful | 0
985666 | Bill | Unsuccessful | 0
I tried this with PROC SQL:
proc sql;
create table contact_success as
select count('Contact Outcome:'n) as no_success_outreach, 'Subscriber ID'n from work.min
where 'Contact Outcome:'n = 'Successful'
group by 'Subscriber ID';
;
quit;
But this just gave me the number of successful contacts in the whole dataset on each line.
How would I achieve my ideal outcome?
A simple way is to count the number of successful contacts for each person using PROC FREQ, and then merge the total number back in by ID.
data have;
length subscriber_id $20 name $20 contact_outcome $20;
input subscriber_id $ name $ contact_outcome $ ;
datalines;
123456 Bob Unsuccessful
123456 Bob Successful
123456 Bob Successful
985666 Bill Unsuccessful
985666 Bill Unsuccessful
;
proc freq data=have noprint;
where contact_outcome = 'Successful';
tables subscriber_id /missing out=counts;
run;
proc sort data=have;
by subscriber_id;
data want (drop=count);
merge have (in=in1)
counts (in=in2 keep=subscriber_id count)
;
by subscriber_id;
success_contact = ifn(in2,count,0);
run;

Update SQL with a JOIN condition based on a case when

I have a dataset (BASE)with the following strucuture: a column with a index for every records, a column with a classification type, the classification value and a column i'd like to populate.
NAME |CLASSIFICATION|VALUE|STANDARD VALUE
FIDO |ALFABET |F |
ALFA |STANDARD |2 |
BETA |STANDARD |5 |
ETA |MIXED |B65 |
THETA|MIXED |A40 |
Not all records have the same classification, however I have an additional table (TRANSCODE) to convert the different classification methods into the standard one (which is classification):
ALFABET|STANDARD|MIXED
A |1 |A1
B |5 |A30
C |3 |A40
D |5 |A31
E |8 |B65
F |6 |C54
My goal is to populate the fourth column with the corresponding value i can find with the second table. (the records with the standard classification will have two columns with the same classification).
After that my data should be like the following:
NAME |CLASSIFICATION|VALUE|STANDARD VALUE
FIDO |ALFABET |F |6
ALFA |STANDARD |2 |2
BETA |STANDARD |5 |5
ETA |MIXED |B65 |8
THETA|MIXED |A40 |3
In order to do so i'm trying to do a proc sql update with a join condition but it doesn't seem to work
proc sql;
update BASE
left join TRASCODE
on BASE.VALUE= (
    case
        when BASE.CLASSIFCATION = 'ALFABET' then TRANSCODE.ALFABET 
        when BASE.CLASSIFICATION= 'STANDARD' then TRANSCODE.STANDARD
        when BASE.CLASSIFICATION= 'MIXED then TRANSCODE.MIXED
    end
)
set BASE.STANDARD_VALUE = TRANSCODE.STANDARD
;
quit;
Can someone help me?
Thanks a lot
The value selection for the standard value is a lookup query, so you can not join directly to transcode.
Try this UPDATE query that uses a different lookup selection for each classification:
data base;
infile cards missover;
input
NAME $ CLASSIFICATION $ VALUE $ STANDARD_VALUE $; datalines;
FIDO ALFABET F
ALFA STANDARD 2
BETA STANDARD 5
ETA MIXED B65
THETA MIXED A40
run;
data transcode;
input
ALFABET $ STANDARD $ MIXED $; datalines;
A 1 A1
B 5 A30
C 3 A40
D 5 A31
E 8 B65
F 6 C54
run;
proc sql;
update base
set standard_value =
case
when classification = 'ALFABET' then (select standard from transcode where alfabet=value)
when classification = 'MIXED' then (select standard from transcode where mixed=value)
when classification = 'STANDARD' then value
else 'NOTRANSCODE'
end;
%let syslast = base;

SAS Regression Output Data Structure

I am working on a research project that requires me to run a linear regression on the stock returns (of thousands of companies) against the market return for every single day between 1993 to 2014.
The data would be similar to (This is dummy data):
| Ticker | Time | Stock Return | Market Return |
|----------|----------|--------------|---------------|
| Facebook | 12:00:01 | 1% | 1.5% |
| Facebook | 12:00:02 | 1.5% | 2% |
| ... | | | |
| Apple | 12:00:01 | -0.5% | 1.5% |
| Apple | 12:00:03 | -0.3% | 2% |
The data volume is pretty huge. There are around 1.5 G of data for each day. There are 21 years of those data that I need to analyze and run regression on.
Regression formula is something similar to
Stock_Return = beta * Market_Return + alpha
where beta and alpha are two coefficients we are estimating. The coefficients are different for every company and every day.
Now, my question is, how to output the beta & alpha for each company and for each day into a data structure?
I was reading the SAS regression documentation, but it seems that the output is rather a text than a data structure.
The code from documentation:
proc reg;
model y=x;
run;
The output from the documentation:
There is no way that I can read over every beta for every company on every single day. There are tens of thousands of them.
Therefore, I was wondering if there is a way to output and extract the betas into a data structure?
I have background in OOP languages (python and java). Therefore the SAS can be really confusing sometimes ...
SAS in many ways is very similar to an object oriented programming language, though of course having features of functional languages and 4GLs also.
In this case, there is an object: the output delivery system object (ODS). Every procedure in SAS 9 that produces printed output produces it via the output delivery system, and you can generally obtain that output via ODS OUTPUT if you know the name of the object.
You can use ODS TRACE to see the names of the output produced by a particular proc.
data stocks;
set sashelp.stocks;
run;
ods trace on;
proc reg data=stocks;
by stock;
model close=open;
run;
ods trace off;
Note the names in the log. Then whatever you want output-wise, you just wrap the proc with ODS OUTPUT statements.
So if I want parameter estimates, I can grab them:
ods output ParameterEstimates=stockParams;
proc reg data=stocks;
by stock;
model close=open;
run;
ods output close;
You can have as many ODS OUTPUT statements as you want, if you want multiple datasets output.

Reconstitute .txt file of HTML table as Dataset in SAS

I am currently using SAS version 9 to try and read in a flat file in .txt format of a HTML table that I have taken from the following page (entitled Wayne Rooney's Match History):
http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney
I've got the data into a .txt file using a Python webscraper using Scrapy. The format of my .txt file is like thus:
17-08-2013,1 : 4,Swansea,Manchester United,28',7.26,Assist Assist,26-08-2013,0 : 0,Manchester United,Chelsea,90',7.03,None,14-09-2013,2 : 0,Manchester United,Crystal Palace,90',8.44,Man of the Match Goal,17-09-2013,4 : 2,Manchester United,Bayer Leverkusen,84',9.18,Goal Goal Assist,22-09-2013,4 : 1,Manchester City,Manchester United,90',7.17,Goal Yellow Card,25-09-2013,1 : 0,Manchester United,Liverpool,90',None,Man of the Match Assist,28-09-2013,1 : 2,Manchester United,West Bromwich Albion,90'...
...and so on. What I want is a dataset that has the same format as the original table. I know my way round SAS fairly well, but tend not to use infile statements all that much. I have tried a few variations on a theme, but this syntax has got me the nearest to what I want so far:
filename myfile "C:\Python27\Football Data\test.txt";
data test;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
infile myfile DSD;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ;
run;
This returns a dataset with only the first row of the table included. I have tried using fixed widths and pointers to set the dataset dimensions, but because the length of things like team names can change so much, this is causing the data to be reassembled from the flat file incorrectly.
I think I'm most of the way there, but can't quite crack the last bit. If anyone knows the exact syntax I need that would be great.
Thanks
I would read it straight from the web. Something like this; this works about 50% but took a whole ten minutes to write, i'm sure it could be easily improved.
Basic approach is you use #'string' to read in text following a string. You might be better off reading this in as a bytestream and doing a regular expression match on <tr> ... </tr> and then parsing that rather than taking the sort of more brute force method here.
filename rooney url "http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney" lrecl=32767;
data rooney;
infile rooney scanover;
retain are_reading;
input #;
if find(_infile_,'<table id="player-fixture" class="grid fixture">')
then are_reading=1;
if find(_infile_,'</table>') then are_reading=0;
if are_reading then do;
input #'<td class="date">' date ddmmyy10.
#'class="team-link">' home_team $20.
#'class="result-1 rc">' score $10.
#'class="team-link">' away_team $20.
#'title="Minutes played in this match">' mins_played $10.
#'title="Rating in this match">' rating $6.
;
output;
end;
run;
As far as reading the scrapy output, you should change at least two things:
Add the delimiter. Not truly necessary, but I'd consider the code incorrect without it, unless delimiter is space.
Add a trailing "##" to get SAS to hold the line pointer, since you don't have line feeds in your data.
data want;
infile myfile flowover dlm=',' dsd lrecl=32767;
length date $10.
score $6.
home_team $40.
away_team $40.
mins_played $3.
rating $4.
incidents $40.;
input date $
score $
home_team $
away_team $
mins_played $
rating $
incidents $ ##;
run;
Flowover is default, but I like to include it to make it clear.
You also probably want to input the date as a date value (not char), so informat date ddmmyy10.;. The rating is also easily input as numeric if you want to, and both mins played and score could be input as numeric if you're doing analysis on those by adding ' and : to the delimiter list.
Finally, your . on length is incorrect; SAS is nice enough to ignore it, but . is only placed like so for formats.
Here's my final code:
data want;
infile "c:\temp\test2.txt" flowover dlm="',:" lrecl=32767;
informat date ddmmyy10.
score_1 score_2 2.
home_team $40.
away_team $40.
mins_played 3.
rating 4.2
incidents $40.;
input date
score_1
score_2
home_team $
away_team $
mins_played
rating ??
incidents $ ##;
run;
I remove the dsd as that's incompatible with the ' delimiter; if DSD is actually needed then you can add it back, remove that delimiter, and read minutes in as char. I add ?? for rating as it sometimes is "None" so ?? ignores the warnings about that.

How to use SAS to split a string into two variables

I have a dataset as below:
country
United States, Seattle
United Kingdom, London
How can I split country into a data in SAS like:
country city
United States Seattle
United Kingdom London
Use function SCAN() with comma as separator.
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
Another option, INFILE magic (google the term for papers on the topic); useful for parsing many variables from one string and/or dealing with quoted fields and such that would be more work with scan.
filename tempfile "c:\temp\test.txt";
data have;
input #1 country $50.;
datalines;
United States, Seattle
United Kingdom, London
;;;;
run;
data want;
set have;
infile tempfile dlm=',' dsd;
input #1 ##;
_infile_=country;
format newcountry city $50.;
input newcountry $ city $ ##;
run;
tempfile can be any file (or one you create on the fly with any character in it to avoid premature EOF).
Response to:
data test;
set test;
city=scan(country,2,',');
country=scan(country,1,',');
run;
What if I want to split the last comma in the string only, keeping 7410 City?
Example: "Junior 18, Plays Piano, 7410 City