Reading in messy csv data with data step - sas

timeOccur,timeReport,location,details,disposition
01/02/2021 11:20 am,,Sports Complex Road,Incident: PATROL ACTIVITY - Patrol Check<br /> CAD #: 21-01-02-000062,In Service
01/02/2021 11:20 am,,Corporation,Incident: ALARM - Burglary Alarm<br /> CAD #: 21-01-02-000063,Verified Entry/Field Interview
01/02/2021 10:20 am,,Grand Avenue Parking Structure,Incident: PATROL ACTIVITY - Patrol Check<br /> CAD #: 21-01-02-000061,In Service
12/30/2020 11:02 pm,12/30/2020 11:02 pm - 12/30/2020 11:50 pm,Canyon Circle Parking Structure,Incident: TRAFFIC COLLISION - Traffic Collision/Unknown Injury // ALCOHOL - Possession of false identification // TRAFFIC-ALCOHOL - DUI Alcohol // TRAFFIC - ALCOHOL - DUI >.08%<br /> Report #: 20074,Arrest
Need Help reading in this data.
Data CP_Crime;
infile '/home/u60629206/SAS330_Handouts/Crime Scrape 01022021.csv' DLM = ',' DSD firstobs=2 MISSOVER;
input Time_Occur anydtdtm19. Time_Report anydtdtm19. Location :$60. Details :$150. Disposition $35.;
run;
I am unsure how to read the time report, when it gives me a range like that, not looking to format it so what the base sas returns is fine as generated in the time occurs.

Treat Time_Report as as a character string, then convert each datetime within the string to a SAS datetime. Since both datetimes are separated by a -, we can use the scan function to read each one and convert them to SAS datetimes.
Data CP_Crime;
infile '/home/u60629206/SAS330_Handouts/Crime Scrape 01022021.csv' firstobs=2 DLM = ',' DSD MISSOVER;
input Time_Occur:anydtdtm19. Time_Report:$50. Location :$60. Details :$150. Disposition $35.;
Time_Report_Start = input(scan(time_report, 1, '-'), anydtdtm.);
Time_Report_End = input(scan(time_report, 2, '-'), anydtdtm.);
format Time_Occur
Time_Report_Start
Time_Report_End mdyampm.
;
drop time_report;
run;
Output:
Time_Occur Time_Report_Start Time_Report_End Location ...
1/2/2021 11:20 AM . . Sports Complex Road ...
1/2/2021 11:20 AM . . Corporation ...
1/2/2021 10:20 AM . . Grand Avenue Parking Structure ...
12/30/2020 11:02 PM 12/30/2020 11:02 PM 12/30/2020 11:50 PM Canyon Circle Parking Structure ...

Related

Query on mixed input style SAS data.This code is not giving me proper output

DATA nationalparks;
INPUT #1 ParkName $ 1-21 #23 State $ Year #40 Acreage COMMA9.;
DATALINES;
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,91
Great Smoky Mountains NC/TN 1926 520,269
;
RUN;
This SAS code is not showing proper result set.
Your program looks fine to me if the data really is in the columns your code is using.
Make sure you haven't indented the datalines. Also check that your editor hasn't replaced spaces with TAB characters. Make sure that STATE and YEAR always have a value and that the value does not include any spaces. You can use . to mark missing values, even for the character variable STATE. You can read STATE and YEAR using columns instead and then blanks will be treated as missing. No need to use formatted input for the last variable, if you add the : modifier then SAS will use list mode and adjust the width used on the informat to match the width of the next value on the line. But again if ACREAGE is missing then use . to mark that. Or add an INFILE DATALINES TRUNCOVER; statement before the INPUT statement.
DATA nationalparks;
INPUT ParkName $ 1-21 State $ 23-30 Year 32-35 Acreage :COMMA.;
*---+----10---+----20---+----30---+----40---+----50;
DATALINES;
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,91
Great Smoky Mountains NC/TN 1926 520,269
;
Results:
Obs ParkName State Year Acreage
1 Yellowstone ID/MT/WY 1872 4065493
2 Everglades FL 1934 1398800
3 Yosemite CA 1864 76091
4 Great Smoky Mountains NC/TN 1926 520269

SAS: reading dates with different formats

In SAS, how to read the following dates with different formats? (especially 01/05/2018 and 1/6/2018)
01/05/2018
1/6/2018
Jan 05 2018
Jan 6 2018
Any help is greatly appreciated. Thanks!
The ANYDTDTM informat will parse most varieties of human readable date, time or datetime representations into a SAS datetime value. The datepart function of that value will return the SAS date value thereof.
The ANYDTDTE informat will also parse a variety of date, time or datetime representations and return the date part implicitly. However it fails on some of your data items where ANYDTDTM does not.
data _null_;
input
#1 a_datetime_value anydtdtm.
#1 a_date_value anydtdte.
;
hot_date = datepart(a_datetime_value);
put
'_infile_ ' _infile_
/ 'anydtdtm. ' a_datetime_value datetime16.
/ 'datepart() ' hot_date yymmdd10.
/ 'anydtdte. ' a_date_value yymmdd10.
/;
datalines;
01/05/2018
1/6/2018
Jan 05 2018
Jan 6 2018
run;
==== LOG ====
_infile_ 01/05/2018
anydtdtm. 05JAN18:00:00:00
datepart() 2018-01-05
anydtdte. .
_infile_ 1/6/2018
anydtdtm. 06JAN18:00:00:00
datepart() 2018-01-06
anydtdte. 2018-01-06
_infile_ Jan 05 2018
anydtdtm. 05JAN18:00:00:00
datepart() 2018-01-05
anydtdte. .
_infile_ Jan 6 2018
anydtdtm. 06JAN18:00:00:00
datepart() 2018-01-06
anydtdte. .
Read the SAS documentation and conference papers for a greater exploration of the ANYDT** family of informats.

Filling the series in SAS

I have a data which data which looks something like this
/********************************************************************************/
YYMM Sector
1701 Agriculture
1611 Retail
1501 CRE
/*************/
There is another dataset which looks something like this/*************
Customer_ID YYMM
XXXX 1702
XXXX 1701
XXXX 1612
XXXX 1611
XXXX 1610
XXXX 1510
XXXX 1509
/********************************************************/
So basically I want to mere these two datasets on the basis of YYMM but and merge in the sectors. But since the previous data has only few YYMM all I want to do is copy the sectors till a new yymm is encountered from the first dataset.
So the sector from 1701 to 1612 should be agriculture and the sector from 1611 to 1502 is retail and for any month before 1501 it has to be CRE.
Can you please tell me how to do it?
Here is a SQL based solution (similar to the one proposed by pinegulf).
Let us create test datasets:
data T01;
length Sector $20;
infile cards;
input YYMM_to Sector;
cards;
1701 Agriculture
1611 Retail
1501 CRE
;
run;
data T02;
length Customer_id $10;
infile cards;
input Customer_ID YYMM;
cards;
AXXX 1702
BXXX 1701
CXXX 1612
DXXX 1611
EXXX 1610
FXXX 1510
GXXX 1509
;
run;
We can add a "YYMM_from" column to T01:
proc sort data=T01;
by YYMM_to;
run;
data T01;
set T01;
by YYMM_to;
YYMM_from=lag(YYMM_to);
if _N_=1 then YYMM_from=0;
run;
proc print data=T01;
run;
We get:
Obs Sector YYMM_to YYMM_from
------------------------------------------
1 CRE 1501 0
2 Retail 1611 1501
3 Agriculture 1701 1611
Then comes the join:
proc sql;
create table T03 as
select a.*, b.Sector
from T02 a LEFT JOIN T01 b
on YYMM_from<a.YYMM<=YYMM_to;
quit;
proc print data=T03;
quit;
We get:
Obs Customer_id YYMM Sector
-----------------------------------------
1 DXXX 1611 Retail
2 EXXX 1610 Retail
3 FXXX 1510 Retail
4 GXXX 1509 Retail
5 BXXX 1701 Agriculture
6 CXXX 1612 Agriculture
7 AXXX 1702
Here is a solution with proc format. Since your data is in yymm format you can set the limits logical without the data conversion, but I feel more comfortable with actual dates.
data Begin;
input Customer_ID $ YYMM $;
cards;
XXXX 1702
YYYY 1701
ZZZZ 1612
OOOO 1611
AAAA 1610
FFFF 1510
DDDD 1509
; run;
data with_date;
set begin;
date = mdy(substr(yymm,3,2), 1, substr(yymm,1,2) );
run;
proc format; /*Didn't check the bins too much. Adjust as needed.*/
value sector
low - '1jan2015'd ='lows'
'1jan2015'd < - '1nov2016'd = 'CRE'
'1nov2016'd < - '1jan2017'd = 'Retail'
'1jan2017'd < - high = 'Agriculture'
;
run;
data wanted;
set with_date;
format date sector.;
run;
For more on proc format see http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473474.htm

How can I read in data with uneven spacings?

I have data which doesn't appear to have consistent spacings or positioning. It looks like:
1675 C Street , Suite 201
Anchorage AK 99501
61.205475 -149.886882
600 Azalea Road
Mobile AL 36609
30.656824 -88.148781
1601 Harbor Bay Parkway , Suite 150
Alameda CA 94502
37.726114 -122.240546
1900 Point West Way, Suite 270
Sacramento CA 95815
38.5994175 -121.4315844
3600 Wilshire Blvd., Suite 1500
Los Angeles CA 90010
34.06153 -118.303463
From this I'd like to extract the street address, city name, state, zip code, lat, and long. I thought the following code would work, but it produces very weird results.
data voa;
input Address $50.;
input City $ State $ Zip;
input Latitude Longitude;
datalines;
I think the issue comes from the fact that there isn't consistent spacing or positioning of the elements.
Your data will work fine using LIST input you just need to add the "look for double delimiter option" & to CITY plus it need to be a bit longer $16 or so.
input City &$16. State $ Zip;
In the absence of consistent delimiters or fixed width fields, this is easier to do using scan:
data want;
infile cards truncover;
length STATE $2 CITY $32;
input Address $50.;
input;
ZIP = input(scan(_INFILE_, -1),5.);
STATE = scan(_INFILE_, -2);
CITY = trim(substr(_INFILE_,1,index(_INFILE_,STATE) - 1));
input Latitude Longitude;
cards;
1675 C Street , Suite 201
Anchorage AK 99501
61.205475 -149.886882
600 Azalea Road
Mobile AL 36609
30.656824 -88.148781
1601 Harbor Bay Parkway , Suite 150
Alameda CA 94502
37.726114 -122.240546
1900 Point West Way, Suite 270
Sacramento CA 95815
38.5994175 -121.4315844
3600 Wilshire Blvd., Suite 1500
Los Angeles CA 90010
34.06153 -118.303463
;
run;

SAS Import from .CSV with line break

I have .csv with line break and want to import in to SAS, But am facing the problems with data having like CUSTOMER with space (wrap text). Please help me how to overcome from this problem, Similar way I have some other variables, If I import mannualy its working fine.Please find the example below. See SLN PJ0136 to know the problem.
SLN MOD PM NE CUSTOMER
32121 GG 1 1 AVAILABLE UPON REQUEST
71403 EN 1 0 JET SUPPORT SERVICE INC.
305173 EN 1 1 UNKNOWN / COTTONWOOD, LLC / J SUPPORT SERVICE, INC.
PJ0136 PS 0 0 "UNKNOWN / GROUP B-50 INC AA
TC0004 anada CSC Europe
Inglewood Ava"
EB0162 RG 0 0 ATR
I used infile to import
DATA WORK.test1;
%let _EFIERR_ = 0;
INFILE 'C:\Users\26631.IELPWC\Downloads\test.csv'
delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
INFORMAT
SLN $CHAR6. MOD $CHAR2. PM BEST1. NE BEST1. CUSTOMER $CHAR82. ;
FORMAT
SLN $CHAR6. MOD $CHAR2. PM BEST1. NE BEST1. CUSTOMER $CHAR82. ;
INPUT
SLN $ MOD $ PM NE CUSTOMER $ ;
if _ERROR_ then call symputx('_EFIERR_',1);
RUN;
Please see the wrong output
32121 GG 1 1 AVAILABLE UPON REQUEST
71403 EN 1 0 JET SUPPORT SERVICE INC.
305173 EN 1 1 UNKNOWN / COTTONWOOD, LLC / J SUPPORT SERVICE, INC.
PJ0136 PS 0 0 "UNKNOWN / GROUP B-50 INC AA
TC0004 . .
24719 . .
" . .
EB0162 RG 0 0 ATR
Assuming that your input data is in the following format:
SLN,MOD,PM,NE,CUSTOMER
32121,GG,1,1,AVAILABLE UPON REQUEST
71403,EN,1,0,JET SUPPORT SERVICE INC.
305173,EN,1,1,"UNKNOWN / COTTONWOOD, LLC / J SUPPORT SERVICE, INC."
PJ0136,PS,0,0,"UNKNOWN / GROUP B-50 INC AA
TC0004 anada CSC Europe
Inglewood Ava"
EB0162,RG,0,0,ATR
The following SAS code will produce required output:
data TEST (drop=_TMP_:);
format SLN $6. MOD $2. PM 8. NE 8. CUSTOMER $82. _TMP_STR $100.;
infile 'input.csv' truncover firstobs=2 dlm=',' dsd lrecl=10000;
input SLN MOD PM NE _TMP_STR #;
_TMP_COUNT=0;
do until(mod(_TMP_COUNT, 2) = 0);
CUSTOMER=catx('0A'x, CUSTOMER, _TMP_STR);
_TMP_COUNT=_TMP_COUNT + countc(_TMP_STR, '"');
if mod(_TMP_COUNT, 2) then do;
input _TMP_STR;
end;
end;
CUSTOMER=dequote(CUSTOMER);
run;
Please note that the value for CUSTOMER column where SLN='PJ0136' is multiline (Unix style). You can remove this by changing function catx(...) acordingly.