I have the following problem, and I don't really know where to begin. I have a folder called "ALL" and inside that folder there are sub-folders with titles equal to the date they were created in the format DD-MM-YYYY. There is a folder for every day, ie no missing days. Inside each of those folders there are numerous txt files. I would like to read one of these text files from each of the date folders. That file will have a naming convention of "thedata_" followed by a random series of numbers.
So for example, if there are 3 date folders in the ALL folder, then I would like to read 3 separate "thedata_" text files into 1 final SAS file. And subsequently each day a new folder is added, I want to append the "thedata_" file from that folder to the existing SAS file rather than rerun the script from scratch.
Here's one solution. This uses SAS functions to read and populate a dataset that reads every file in every folder so that you do not need to turn on x commands. You can save each one to a macro variable, then loop through and read each file however you'd like. You can modify this to work with the filevar option.
filename all "Directory/ALL";
data myfiles;
length folder_name
file_name
file
folder_path $5000.
;
/* Folder delimiter */
if("&sysscp." = "WIN") then SLASH = '\';
else SLASH = '/';
/* Open the ALL directory */
did = dopen("all");
/* If it was successful, continue */
if(did) then do;
/* Iterate through all subfolders in ALL */
do i = 1 to dnum(did);
/* Get the subfolder name and full path */
folder_name = dread(did, i);
folder_path = cats(pathname('all'), SLASH, folder_name);
/* Assign a filename statement to the subfolder */
rc = filename('sub', folder_path);
/* Give the sub-folder a a directory ID */
did2 = dopen('sub');
/* Open the subfolder and read all the .txt files within it */
if(did2) then do;
do j = 1 to dnum(did2);
file_name = dread(did2, j);
file_ext = scan(file_name, -1, '.');
file = cats(folder_path, SLASH, file_name);
/* Save file name only if the expected value is found */
if(upcase(file_name) =: "THEDATA_" AND upcase(file_ext) = "TXT") then do;
nfiles+1;
call symputx(cats('file', nfiles), file); /* Save each file to a macro variable named file1, file2, etc. */
output;
end;
end;
end;
/* Close the subfolder and move on to the next one */
rc = dclose(did2);
end;
end;
rc = dclose(did);
/* Save the total number of files we found to a macro variable */
call symputx('nFiles', nFiles);
keep file file_name folder_name folder_path;
run;
/* Read all the files */
%macro readFiles;
%do i = 1 %to &nFiles.;
proc import
file = "&&file&i."
out = _thedata_&i.
dbms = csv
replace;
guessingrows=max;
run;
%end;
/* Put all the files together */
data thedata;
set _thedata_:;
run;
proc datasets lib=work nolist;
delete _thedata_:;
quit;
%mend;
%readFiles;
Related
I am trying to iterate over a list of different files and input them.
DO Year = 2000 to 2021 by 1;
filename fileftp ftp year+'.csv.gz' host='ftp.abcgov'
cd='/pub/' user='anonymous'
pass='XXXX' passive recfm=s debug;
INFILE fileftp NBYTE=n;
END;
How do I get it so that year is included in the file name?
Currently, when I try this (year+'.csv.gz') it is trying to recognize year as an option incorrectly.
You need to use the SAS macro facility for this. Since your file is zipped, you'll also need to unzip it before importing the data.
%macro importData;
%do year = 2000 to 2021;
filename fileftp ftp "&year..csv.gz"
host = 'ftp.abcgov'
cd = '/pub/'
user = 'anonymous'
pass = 'XXXX'
recfm = s
passive
debug
;
filename download temp;
/* Download the file to a temporary local space */
%let rc = %sysfunc(fcopy(fileftp, download));
/* Unzip the file */
filename unzip "%sysfunc(pathname(download))" gzip;
/* Read the data and output it by year */
proc import
file = unzip
out = want&year.
dbms = csv
replace;
run;
%end;
%mend;
%importData;
If fcopy does not work for you, you can use a data step to write one file to another.
data _null_;
infile fileftp;
file download;
input;
put _INFILE_ ;
run;
Good Morning
So I have tried to download zip file from the website, and try to assign the location.
The location I want to put is
S:\Projects\
Method1,
First Attempt is below
DATA _null_ ;
x 'start https://yehonal.github.io/DownGit/#/home?url=https:%2F%2Fgithub.com%2FCSSEGISandData%2FCOVID-19%2Ftree%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_daily_reports';
RUN ;
Method1, I can download the file, but this automatically downloaded to my Download folder.
Method 2,
so I found out this way.
filename out "S:\Projects\csse_covid_19_daily_reports.zip";
proc http
url='https://yehonal.github.io/DownGit/#/home?url=https:%2F%2Fgithub.com%2FCSSEGISandData%2FCOVID-19%2Ftree%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_daily_reports'
method="get" out=out;
run;
But the code is not working, not downloading anything.
how can I download the file from the web and assign to the certain location?
I would probably recommend a macro in this case then (or CALL EXECUTE) but I prefer macros and then calling the macro via CALL EXECUTE. Took about a minute running on SAS Academics on Demand (free cloud service).
*set start date for files;
%let start_date = 01-22-2020;
*macro to import data;
%macro importFullData(date);
*file name reference;
filename out "/home/fkhurshed/WANT/&date..csv";
*file to download;
%let download_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/&date..csv";
proc http url=&download_url
method="get" out=out;
run;
*You can add in data import/append steps here as well as necessary;
%mend;
%importFullData(&start_date.);
data importAll;
start_date=input("&start_date", mmddyy10.);
*runs up to previous day;
end_date=today() - 1;
do date=start_date to end_date;
formatted_date=put(date, mmddyyd10.);
str=catt('%importFullData(', formatted_date, ');');
call execute(str);
end;
run;
The url when viewed in a browser is using javascript in the browser to construct a zip file that is automatically downloaded. Proc HTTP does not run javascript, so will not be able to download the ultimate response which is the constructed zip file, thus you get the 404 message.
The list of the files in the repository can be obtain as json from url
https://api.github.com/repos/CSSEGISandData/COVID-19/contents/csse_covid_19_data/csse_covid_19_daily_reports
The listing data contains the download_url for each csv file.
A download_url will look like
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-22-2020.csv
You can download individual files with SAS per #Reeza, or
use git commands or SAS git* functions to download the repository
AFAIK git archive for downloading only a specific subfolder of a repository is not available surfaced by github server
use svn commands to download a specific folder from a git repository
requires svn be installed (https://subversion.apache.org/) I used SlikSvn
Example:
Make some series plots of a response by date from stacked imported downloaded data.
options noxwait xsync xmin source;
* use svn to download all files in a subfolder of a git repository;
* local folder for storing data from
* COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University;
%let covid_data_root = c:\temp\csse;
%let rc = %sysfunc(dcreate(covid,&covid_data_root));
%let download_path = &covid_data_root\covid;
%let repo_subdir_url = https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports;
%let svn_url = %sysfunc(tranwrd(&repo_subdir_url, tree/master, trunk));
%let os_command = svn checkout &svn_url "&download_path";
/*
* uncomment this block to download the (data) files from the repository subfolder;
%sysexec %superq(os_command);
*/
* codegen and execute the PROC IMPORT steps needed to read each csv file downloaded;
libname covid "&covid_data_root.\covid";
filename csvlist pipe "dir /b ""&download_path""";
data _null_;
infile csvlist length=l;
length filename $200;
input filename $varying. l;
if lowcase(scan(filename,-1,'.')) = 'csv';
out = 'covid.day_'||translate(scan(filename,1,'.'),'_','-');
/*
* NOTE: Starting 08/11/2020 FIPS data first starts appearing after a few hundred rows.
* Thus the high GuessingRows
*/
template =
'PROC IMPORT file="#path#\#filename#" replace out=#out# dbms=csv; ' ||
'GuessingRows = 5000;' ||
'run;';
source_code = tranwrd (template, "#path#", "&download_path");
source_code = tranwrd (source_code, "#filename#", trim(filename));
source_code = tranwrd (source_code, "#out#", trim(out));
/* uncomment this line to import each data file */
*call execute ('%nrstr(' || trim (source_code) || ')');
run;
* memname is always uppercase;
proc contents noprint data=covid._all_ out=meta(where=(memname like 'DAY_%'));
run;
* compute variable lengths for LENGTH statement;
proc sql noprint;
select
catx(' ', name, case when type=2 then '$' else '' end, maxlen)
into
:lengths separated by ' '
from
( select name, min(type) as type, max(length) as maxlen, min(varnum) as minvarnum, max(varnum) as maxvarnum
from meta
group by name
)
order by minvarnum, maxvarnum
;
quit;
* stack all the individual daily data;
data covid.all_days;
attrib date length=8 format=mmddyy10.;
length &lengths;
set covid.day_: indsname=dsname;
date = input(substr(dsname,11),mmddyy10.);
format _character_; * reset length based formats;
informat _character_; * reset length based informats;
run ;
proc sort data=covid.all_days out=us_days;
where country_region = 'US';
by province_state admin2 date;
run;
ods html gpath='.' path='.' file='covid.html';
options nobyline;
proc sgplot data=us_days;
where province_state =: 'Cali';
*where also admin2=: 'O';
by province_state admin2;
title "#byval2 County, #byval1";
series x=date y=confirmed;
xaxis valuesformat=monname3.;
label province_state='State' admin2='County';
label confirmed='Confirmed (cumulative?)';
run;
ods html close;
options byline;
Plots
My coworker and I have three zip files, representing three iterations of a monthly download from CMS of the NPPES Data Dissemination (March, April, and May). We use the following code to extract what we need from the newest zip file and create a fairly compact dataset.
PROC IMPORT OUT=NPI_Layout
DATAFILE= "&dir./NPI File Layout.xlsx"
DBMS=XLSX REPLACE;
SHEET="Sheet1";
RUN;
options compress = yes;
data npi_layout;
set npi_layout;
length infmt fmt inpt $60. lbl $200.;
if type = 'NUMBER' then do;
infmt = 'informat '||compress(field)||' '||compress(length)||'.;';
fmt = 'format '||compress(field)||' '||compress(length)||'.;';
inpt = compress(field);
end;
else if type = 'VARCHAR' then do;
infmt = 'informat '||compress(field)||' $'||compress(length)||'.;';
fmt = 'format '||compress(field)||' $'||compress(length)||'.;';;
inpt = compress(field)||' $';
end;
else if type = 'DATE' then do;
infmt = 'informat '||compress(field)||' mmddyy10.;';
fmt = 'format '||compress(field)||' date9.;';
inpt = compress(field);
end;
lbl = 'label '||compress(field)||" = '"||trim(label)||"';";
run;
proc sql noprint;
select infmt
,fmt
,inpt
,lbl
into :infmt1 -
,:fmt1 -
,:inpt1 -
,:lbl1 -
from npi_layout;
quit;
%macro loop;
%let infmt_stmnt = ;
%let fmt_stmnt = ;
%let inpt_stmnt = input;
%let lbl_stmnt = ;
%do i = 1 %to &sqlobs;
%let infmt_stmnt = &infmt_stmnt &&infmt&i;
%let fmt_stmnt = &fmt_stmnt &&fmt&i;
%let inpt_stmnt = &inpt_stmnt &&inpt&i;
%let lbl_stmnt = &lbl_stmnt &&lbl&i;
%end;
data npi.npi;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile inzip(npidata_pfile_20050523-20180513.csv)
delimiter = ',' MISSOVER DSD lrecl = 32767 firstobs = 2;* obs = 10000;
&infmt_stmnt;
&fmt_stmnt;
&inpt_stmnt;
&lbl_stmnt;
run;
%mend loop;
%loop;
When we run the above code on the file from March, we get a successful output. However, when we try to run it on the April and May downloads, we get the following error:
Error in Log
ERROR: Open failure for
*dir*/NPI/Downloads/NPPES_Data_Dissemination_May_2018.zip
during attempt to create a local file handle.
Google only returns a single result, which indicates that it's an error that pops up when a filename (or path, presumably) is wrong. We've double-checked the path and filename multiple times, and it's all correct (and, obviously, the code works on the March file). Additionally, if I change the code so it's trying to pull a non-existent .csv from the zip file, it gives me a different error about that file not existing within the zip, so it's clearly seeing the zip file in the first place. We're not really sure what's going on; any advice?
(The data is sourced from http://download.cms.gov/nppes/NPI_Files.html, if you want to check the file for yourself.)
Did you try adding quotes around the member name?
infile inzip("npidata_pfile_20050523-20180513.csv") ...
Saw the same ERROR message on Windows 10 64-bit with plenty of RAM and disk space.
The Windows internals used by the ZIP engine is likely dealing with streams which involves file handles. So I would suspect the ZIP engine is trying to allocate too much RAM or too large an intermediary file for dealing with extracting the 6GB "npidata_pfile_20050523-20180513.csv".
Submit the issue to SAS Support -- There might be some session settings that would let the engine work against the file. If not, you'll have to extract the file outside SAS.
How large were the April and May pfile sizes?
I could navigate to the library path in the file explorer and search for my dataset.sas7bdat file and look at it's size.
But that's not practical. I'm not even sure where the WORK library is located and even if I did it's on a remote server making it complicated to access.
Is it possible to print the size/weight of a dataset in the log or a report?
using SAS EG 7.1 and SAS 9.3
Motive : I want to do that because I will try to reduce a dataset size and I would like to know how much I gained.
You can use the SQL dictionary view dictionary.tables :
/* Return libref, dataset name, # records, filesize & % compression */
proc sql ;
create table size1 as
select libname, memname, nlobs, filesize, pcompress
from dictionary.tables
where libname = 'WORK'
and memname = 'MYDATA'
;
quit ;
As long as your library uses the BASE engine, you can use the pathname() function to find it. After that you can use the sashelp views to get the filesize. You could also use os commands, but for that you need x command enabled.
The following demonstrates:
%let libds=WORK.SAMPLE;
/* create demo data */
data &libds;
retain string 'blaaaaaaaah';
do x=1 to 10000;
output;
end;
run;
/* extract library and datasaet from libds variable */
%let lib=%scan(&libds,1,.);
%let ds =%scan(&libds,2,.);
/* set up filename to point directly at the dataset */
/* note - if you have indexes you also need to do */
/* this for the .sas7bndx extension */
filename fref "%sysfunc(pathname(&lib))/&ds..sas7bdat";
/* query dictionary view for file attributes */
data _null_;
set sashelp.vextfl(where=(fileref='FREF'));
filesize=filesize/(1024**2); /* size in MB */
putlog filesize= 'MB';
run;
/* clear libref */
filename fref clear;
I would like to iterate through files in a particular folder, and extract substrings of their files names. Is there an easy way to do this?
lib dir '.../folder';
#iterate through all files of dir and extract first five letters of file name;
#open files and do some processesing, aka data steps proc steps;
Firstly, I'd point out that "dir" appears to be a misspelt libref, not a folder. If you are looking for files in a folder, you can use:
%macro get_filenames(location);
filename _dir_ "%bquote(&location.)";
data filenames(keep=fname);
handle=dopen( '_dir_' );
if handle > 0 then do;
count=dnum(handle);
do i=1 to count;
fname=subpad(dread(handle,i),1,5);/* extract first five letters */
output filenames;
end;
end;
rc=dclose(handle);
run;
filename _dir_ clear;
%mend;
%get_filenames("c:\temp\");
If you are looking for datasets in a library, you can use:
proc sql;
create table datasets as
select substr(memname,1,5) as dataset
from dictionary.tables
where libname='LIB'; /* must be uppercase */
either approach will produce a dataset of 'files' which can be subsequently 'stepped through'..