Construct SAS dataset based on file containing metadata - sas

I have two text files, one containing raw data with no headers and another containing the associated column names and lengths. I'd like to use these two files to construct a single SAS dataset containing the data from one file with the column names and lengths from the other.
The file containing the data is a fixed-width text file. That is, each column of data is aligned to a particular column of the text file, padded with spaces to ensure alignment.
datafile.txt:
John 45 Has two kids
Marge 37 Likes books
Sally 29 Is an astronaut
Bill 60 Drinks coffee
The file containing the metadata is tab-delimited with two columns: one with the name of the column in the data file and one with the character length of that column. The names are listed in the order in which they appear in the data file.
metadata.txt:
Name 7
Age 5
Comments 15
My goal is to have a SAS dataset that looks like this:
Name | Age | Comments
-------+------+-----------------
John | 45 | Has two kids
Marge | 37 | Likes books
Sally | 29 | Is an astronaut
Bill | 60 | Drinks coffee
I want every column to be character with the length specified in the metadata file.
There has to be a better way than my naive approach, which is to construct a length statement and an input statement using the imported metadata, like so:
/* Import metadata */
data meta;
length colname $ 50 collen 8;
infile 'C:\metadata.txt' dsd dlm='09'x;
input colname $ collen;
run;
/* Construct LENGTH and INPUT statements */
data _null_;
length lenstmt inptstmt $ 1000;
retain lenstmt inptstmt '' colstart 1;
set meta end=eof;
call catx(' ', lenstmt, colname, '$', collen);
call catx(' ', inptstmt, cats('#', colstart), colname, '$ &');
colstart + collen;
if eof then do;
call symputx('lenstmt', lenstmt);
call symputx('inptstmt', inptstmt);
end;
run;
/* Import data file */
data datafile;
length &lenstmt;
infile 'C:\datafile.txt' dsd dlm='09'x;
input &inptstmt;
run;
This gets me what I need, but there has to be a cleaner way. One could run into trouble with this approach if insufficient space is allocated to the variables storing the length and input statements, or if the statement lengths exceed the maximum macro variable length.
Any ideas?

What you're doing is a fairly standard method of doing this. Yes, you could check things a bit more carefully; I would allocate $32767 for the two statements, for example, just to be cautious.
There are some ways you can improve this, though, that may take some of your worries away.
First off, a common solution is to build this at the row level (as you do) and then use proc sql to create the macro variable. This has a larger maximum length limitation than the data step method (the data step method maximum is $32767 if you don't use multiple variables, SQL's is double that at 64kib).
proc sql;
select catx(' ',colname,'$',collen)
into :lenstmt separated by ' '
from meta; *and similar for inputstmt;
quit;
Second, you can surpass the 64k limit by writing to a file instead of to a macro variable. Take your data step, and instead of accumulating and then using call symput, write each line out to a temp file (or two). Then %include those files instead of using the macro variable in the input datastep - yes, you can %include in the middle of a datastep.
There are other methods, but these two are the most common and should work for most use cases. Some other methods include call execute, run_macro, or using file open commands to work with the file directly. In general, those are either more complicated or less useful than the most common two, although certainly they are also acceptable solutions and not uncommon to see in practice.

call execute show be able to help.
data _null_;
retain start 0;
infile 'c:\metadata.txt' missover end=eof;
if _n_=1 then do;
start=1;
call execute('data final_output; infile "c:\datafile.txt" truncover; input ');
end;
input colname :$8.
collen :8.
;
call execute( '#'|| put(start,8. -l) || ' ' || colname || ' $'|| put(collen,8. -r) ||'. ' );
start=sum(start,collen);
if eof then do;
call execute(';run;');
end;
run;
proc contents data=final_output;run;

Related

How to create a SAS dataset for each individual trading day (TAQ) data and save them to a file

I have the daily trading data (TAQ data) for a month. I am trying to unzip each of them.
The folder's name is EQY_US_ALL_TRADE_202107.
It has several zipped (GZ files) files for each trading day named as
EQY_US_ALL_TRADE_202210701
EQY_US_ALL_TRADE_202210702
EQY_US_ALL_TRADE_202210703 ...
EQY_US_ALL_TRADE_202210729
I want to create a SAS dataset for each individual day and save them to a file. As far as I understand, I need a do loop to go through a month of daily TAQ data and calculate the trade duration and then just save the relevant data to a file so that each saved data set would be small, and then I have to aggregate them all up. For calculating trade duration, I am just taking the difference of the "DATETIME" variable, (ex. dif(datetime))
Until now, I have been working by making one working directory (D:\MainDataset) and doing calculations in it starting with unzipping files. But it is taking too much time and disk space. I need to create separate datasets for each trading day and save it to a file.
data "D:\MainDataset" (keep= filename time exchange symbol saleCondition tradeVolume tradePrice tradeStopStock
tradeCorrection sequenceNumber tradeId sourceOfTrade tradeReportingFacility
participantTimeStamp tradeReportingFacilityTimeStamp);
length folderef $8 time $15. exchange $1. symbol $17. saleCondition $4. tradeStopStock $1.
sourceOfTrade $1. tradeReportingFacility $1.
participantTimeStamp $15. tradeReportingFacilityTimestamp $15.;
rc=filename(folderef,"D:\EQY_US_ALL_TRADE_202107");
did = dopen(folderef);
putlog did=;
/* do k = 1 to dnum(did); Use this to run the loop over all files in the folder */
do k = 1 to 3;
filename = dread(did,k);
putlog filename=;
if scan(filename,-1,'.') ne 'gz' then continue;
fullname = pathname(folderef) || '\' || filename;
putlog fullname=;
do while(1);
infile archive zip filevar=fullname gzip dlm='|' firstobs=2 obs=5000000 dsd truncover eof=nextfile;
input time exchange symbol saleCondition tradeVolume tradePrice tradeStopStock
tradeCorrection sequenceNumber tradeId sourceOfTrade tradeReportingFacility
participantTimeStamp tradeReportingFacilityTimeStamp;
output;
end;
nextfile:
end;
stop;
run;
Proc contents data = "D:\MainDataset";
run;
proc print data ="D:\MainDataset" (obs = 110);
run;
Create code to process one file. Probably coded as a macro that takes as input the name of the file to read and the name of the dataset to create.
%macro taq(infile,dataset);
data &dataset;
infile "&infile" zip gzip dsd dlm='|' truncover firstobs=2;
....
run;
%mend taq;
Then generate a dataset with the names of the files to read and the dataset names you want to create from them. So perhaps something like this:
%let ym=202107;
%let folder=D:\EQY_US_ALL_TRADE_&ym;
data taq_files;
length dataset $32 filename $256 ;
keep dataset filename;
rc=filename('folder',"&folder");
did=dopen('folder');
do fnum=1 to dnum(did);
filename=catx('\',"&folder",dread(did,fnum));
dataset=scan(filename,-2,'./\');
if 'gz'=lowcase(scan(filename,-1,'.')) then output;
end;
did=dclose(did);
rc=filename('folder');
run;
Now that you have the list of files you can use it to call the macro once for each file.
data _null_;
set taq_files;
call execute(cats('%nrstr(%taq)(',filename,',',dataset,')'));
run;
The body of the macro can include the code to both read the values from the delimited files and calculate any new variables you want. There should not be any need to do that in multiple steps based on what you have shown so far.
Your logic for converting the timestamp strings into time values seems overly convoluted. Just use informats that match the style of the strings in the file. For example if the strings start with 6 digits that represent HHMMSS then read that using the HHMMSS6. informat. If the filenames has digit strings in the style YYYYMMDD then read that using the YYMMDD8. informat.
Note that a text file that is compressed to 2Gbytes will generate a dataset that is possibly 10 to 30 times that large. You might want to define the individual datasets as views instead to avoid having the use that space by changing the DATA statement:
data &dataset / view=&dataset ;

Manually Reading in Data in SAS from CSV

So I have a large dataset that is rather oddly formatted and I want to read it in based on the header. It only has unique columns for each unique participant and each participant participated in multiple rounds of the study. The data is from some experiments and is formatted as having variables for each participant (e.g. "participant.code") then some session variables which I can drop and then the actual variables from the experiment. These are formatted as "study.[round number].player.[variable]"
Rather then repeating the variable for every round, I want to just take out the round number as a separate variable and have an observation for every round for each participant.
I want to read these in differently depending on the variable and pick it out. I would rather not have to manually mess with the source file since the experiment is going to be run multiple times.
If someone could just point me towards some relevant material or whatnot that would be great.
Thank you!
Edit: example of some of the raw data:
participant.id_in_session,participant.code,participant.label,participant._is_bot,participant._index_in_pages,participant._max_page_index,participant._current_app_name,participant._current_page_name,participant.time_started_utc,participant.visited,participant.mturk_worker_id,participant.mturk_assignment_id,participant.payoff,session.code,session.label,session.mturk_HITId,session.mturk_HITGroupId,session.comment,session.is_demo,session.config.real_world_currency_per_point,session.config.participation_fee,session.config.name,session.config.treatment,study.1.player.id_in_group,study.1.player.role,study.1.player.payoff,study.1.player.Seatfinal,study.1.player.finalpay,study.1.player.payroundpay,study.1.player.QCorrect,study.1.player.treatment,study.1.player.Q1a,study.1.player.Q1b,study.1.player.Q1c,study.1.player.Q2a,study.1.player.Q3,study.1.player.Q4,study.1.player.Q5,study.1.player.Q6,study.1.player.Q7,study.1.player.Q80,study.1.player.Q81,study.1.player.Q82,study.1.player.offer,study.1.player.OfferNum,study.1.player.OfferTaken,study.1.player.BuyerNumber,study.1.player.Seatnum2,study.1.player.Seatnum,study.1.player.pay,study.1.player.isoffertaken,study.1.player.hastakenoffer,study.1.player.consent,study.1.player.offerPrice,study.1.player.oprice,study.1.player.guess_num_seller,study.1.player.BoughtPrice,study.1.player.reward,study.1.player.guess_num_buyer,study.1.group.id_in_subsession,study.1.subsession.round_number,study.1.subsession.offersrem,study.1.subsession.game_finished,study.1.subsession.numbuyers,study.1.subsession.bnum,study.1.subsession.payround,study.2.player.id_in_group,study.2.player.role,study.2.player.payoff,study.2.player.Seatfinal,study.2.player.finalpay,study.2.player.payroundpay,study.2.player.QCorrect,study.2.player.treatment,study.2.player.Q1a,study.2.player.Q1b,study.2.player.Q1c,study.2.player.Q2a,study.2.player.Q3,study.2.player.Q4,study.2.player.Q5,study.2.player.Q6,study.2.player.Q7,study.2.player.Q80,study.2.player.Q81,study.2.player.Q82,study.2.player.offer,study.2.player.OfferNum,study.2.player.OfferTaken,study.2.player.BuyerNumber,study.2.player.Seatnum2,study.2.player.Seatnum,study.2.player.pay,study.2.player.isoffertaken,study.2.player.hastakenoffer,study.2.player.consent,study.2.player.offerPrice,study.2.player.oprice,study.2.player.guess_num_seller,study.2.player.BoughtPrice,study.2.player.reward,study.2.player.guess_num_buyer,study.2.group.id_in_subsession,study.2.subsession.round_number,study.2.subsession.offersrem,study.2.subsession.game_finished,study.2.subsession.numbuyers,study.2.subsession.bnum,study.2.subsession.payround,study.3.player.id_in_group,study.3.player.role,study.3.player.payoff,study.3.player.Seatfinal,study.3.player.finalpay,study.3.player.payroundpay,study.3.player.QCorrect,study.3.player.treatment,study.3.player.Q1a,study.3.player.Q1b,study.3.player.Q1c,study.3.player.Q2a,study.3.player.Q3,study.3.player.Q4,study.3.player.Q5,study.3.player.Q6,study.3.player.Q7,study.3.player.Q80,study.3.player.Q81,study.3.player.Q82,study.3.player.offer,study.3.player.OfferNum,study.3.player.OfferTaken,study.3.player.BuyerNumber,study.3.player.Seatnum2,study.3.player.Seatnum,study.3.player.pay,study.3.player.isoffertaken,study.3.player.hastakenoffer,study.3.player.consent,study.3.player.offerPrice,study.3.player.oprice,study.3.player.guess_num_seller,study.3.player.BoughtPrice,study.3.player.reward,study.3.player.guess_num_buyer,study.3.group.id_in_subsession,study.3.subsession.round_number,study.3.subsession.offersrem,study.3.subsession.game_finished,study.3.subsession.numbuyers,study.3.subsession.bnum,study.3.subsession.payround,study.4.player.id_in_group,study.4.player.role,study.4.player.payoff,study.4.player.Seatfinal,study.4.player.finalpay,study.4.player.payroundpay,study.4.player.QCorrect,study.4.player.treatment,study.4.player.Q1a,study.4.player.Q1b,study.4.player.Q1c,study.4.player.Q2a,study.4.player.Q3,study.4.player.Q4,study.4.player.Q5,study.4.player.Q6,study.4.player.Q7,study.4.player.Q80,study.4.player.Q81,study.4.player.Q82,study.4.player.offer,study.4.player.OfferNum,study.4.player.OfferTaken,study.4.player.BuyerNumber,study.4.player.Seatnum2,study.4.player.Seatnum,study.4.player.pay,study.4.player.isoffertaken,study.4.player.hastakenoffer,study.4.player.consent,study.4.player.offerPrice,study.4.player.oprice,study.4.player.guess_num_seller,study.4.player.BoughtPrice,study.4.player.reward,study.4.player.guess_num_buyer,study.4.group.id_in_subsession,study.4.subsession.round_number,study.4.subsession.offersrem,study.4.subsession.game_finished,study.4.subsession.numbuyers,study.4.subsession.bnum,study.4.subsession.payround,study.5.player.id_in_group,study.5.player.role,study.5.player.payoff,study.5.player.Seatfinal,study.5.player.finalpay,study.5.player.payroundpay,study.5.player.QCorrect,study.5.player.treatment,study.5.player.Q1a,study.5.player.Q1b,study.5.player.Q1c,study.5.player.Q2a,study.5.player.Q3,study.5.player.Q4,study.5.player.Q5,study.5.player.Q6,study.5.player.Q7,study.5.player.Q80,study.5.player.Q81,study.5.player.Q82,study.5.player.offer,study.5.player.OfferNum,study.5.player.OfferTaken,study.5.player.BuyerNumber,study.5.player.Seatnum2,study.5.player.Seatnum,study.5.player.pay,study.5.player.isoffertaken,study.5.player.hastakenoffer,study.5.player.consent,study.5.player.offerPrice,study.5.player.oprice,study.5.player.guess_num_seller,study.5.player.BoughtPrice,study.5.player.reward,study.5.player.guess_num_buyer,study.5.group.id_in_subsession,study.5.subsession.round_number,study.5.subsession.offersrem,study.5.subsession.game_finished,study.5.subsession.numbuyers,study.5.subsession.bnum,study.5.subsession.payround,study.6.player.id_in_group,study.6.player.role,study.6.player.payoff,study.6.player.Seatfinal,study.6.player.finalpay,study.6.player.payroundpay,study.6.player.QCorrect,study.6.player.treatment,study.6.player.Q1a,study.6.player.Q1b,study.6.player.Q1c,study.6.player.Q2a,study.6.player.Q3,study.6.player.Q4,study.6.player.Q5,study.6.player.Q6,study.6.player.Q7,study.6.player.Q80,study.6.player.Q81,study.6.player.Q82,study.6.player.offer,study.6.player.OfferNum,study.6.player.OfferTaken,study.6.player.BuyerNumber,study.6.player.Seatnum2,study.6.player.Seatnum,study.6.player.pay,study.6.player.isoffertaken,study.6.player.hastakenoffer,study.6.player.consent,study.6.player.offerPrice,study.6.player.oprice,study.6.player.guess_num_seller,study.6.player.BoughtPrice,study.6.player.reward,study.6.player.guess_num_buyer,study.6.group.id_in_subsession,study.6.subsession.round_number,study.6.subsession.offersrem,study.6.subsession.game_finished,study.6.subsession.numbuyers,study.6.subsession.bnum,study.6.subsession.payround,study.7.player.id_in_group,study.7.player.role,study.7.player.payoff,study.7.player.Seatfinal,study.7.player.finalpay,study.7.player.payroundpay,study.7.player.QCorrect,study.7.player.treatment,study.7.player.Q1a,study.7.player.Q1b,study.7.player.Q1c,study.7.player.Q2a,study.7.player.Q3,study.7.player.Q4,study.7.player.Q5,study.7.player.Q6,study.7.player.Q7,study.7.player.Q80,study.7.player.Q81,study.7.player.Q82,study.7.player.offer,study.7.player.OfferNum,study.7.player.OfferTaken,study.7.player.BuyerNumber,study.7.player.Seatnum2,study.7.player.Seatnum,study.7.player.pay,study.7.player.isoffertaken,study.7.player.hastakenoffer,study.7.player.consent,study.7.player.offerPrice,study.7.player.oprice,study.7.player.guess_num_seller,study.7.player.BoughtPrice,study.7.player.reward,study.7.player.guess_num_buyer,study.7.group.id_in_subsession,study.7.subsession.round_number,study.7.subsession.offersrem,study.7.subsession.game_finished,study.7.subsession.numbuyers,study.7.subsession.bnum,study.7.subsession.payround,study.8.player.id_in_group,study.8.player.role,study.8.player.payoff,study.8.player.Seatfinal,study.8.player.finalpay,study.8.player.payroundpay,study.8.player.QCorrect,study.8.player.treatment,study.8.player.Q1a,study.8.player.Q1b,study.8.player.Q1c,study.8.player.Q2a,study.8.player.Q3,study.8.player.Q4,study.8.player.Q5,study.8.player.Q6,study.8.player.Q7,study.8.player.Q80,study.8.player.Q81,study.8.player.Q82,study.8.player.offer,study.8.player.OfferNum,study.8.player.OfferTaken,study.8.player.BuyerNumber,study.8.player.Seatnum2,study.8.player.Seatnum,study.8.player.pay,study.8.player.isoffertaken,study.8.player.hastakenoffer,study.8.player.consent,study.8.player.offerPrice,study.8.player.oprice,study.8.player.guess_num_seller,study.8.player.BoughtPrice,study.8.player.reward,study.8.player.guess_num_buyer,study.8.group.id_in_subsession,study.8.subsession.round_number,study.8.subsession.offersrem,study.8.subsession.game_finished,study.8.subsession.numbuyers,study.8.subsession.bnum,study.8.subsession.payround,study.9.player.id_in_group,study.9.player.role,study.9.player.payoff,study.9.player.Seatfinal,study.9.player.finalpay,study.9.player.payroundpay,study.9.player.QCorrect,study.9.player.treatment,study.9.player.Q1a,study.9.player.Q1b,study.9.player.Q1c,study.9.player.Q2a,study.9.player.Q3,study.9.player.Q4,study.9.player.Q5,study.9.player.Q6,study.9.player.Q7,study.9.player.Q80,study.9.player.Q81,study.9.player.Q82,study.9.player.offer,study.9.player.OfferNum,study.9.player.OfferTaken,study.9.player.BuyerNumber,study.9.player.Seatnum2,study.9.player.Seatnum,study.9.player.pay,study.9.player.isoffertaken,study.9.player.hastakenoffer,study.9.player.consent,study.9.player.offerPrice,study.9.player.oprice,study.9.player.guess_num_seller,study.9.player.BoughtPrice,study.9.player.reward,study.9.player.guess_num_buyer,study.9.group.id_in_subsession,study.9.subsession.round_number,study.9.subsession.offersrem,study.9.subsession.game_finished,study.9.subsession.numbuyers,study.9.subsession.bnum,study.9.subsession.payround,study.10.player.id_in_group,study.10.player.role,study.10.player.payoff,study.10.player.Seatfinal,study.10.player.finalpay,study.10.player.payroundpay,study.10.player.QCorrect,study.10.player.treatment,study.10.player.Q1a,study.10.player.Q1b,study.10.player.Q1c,study.10.player.Q2a,study.10.player.Q3,study.10.player.Q4,study.10.player.Q5,study.10.player.Q6,study.10.player.Q7,study.10.player.Q80,study.10.player.Q81,study.10.player.Q82,study.10.player.offer,study.10.player.OfferNum,study.10.player.OfferTaken,study.10.player.BuyerNumber,study.10.player.Seatnum2,study.10.player.Seatnum,study.10.player.pay,study.10.player.isoffertaken,study.10.player.hastakenoffer,study.10.player.consent,study.10.player.offerPrice,study.10.player.oprice,study.10.player.guess_num_seller,study.10.player.BoughtPrice,study.10.player.reward,study.10.player.guess_num_buyer,study.10.group.id_in_subsession,study.10.subsession.round_number,study.10.subsession.offersrem,study.10.subsession.game_finished,study.10.subsession.numbuyers,study.10.subsession.bnum,study.10.subsession.payround,study.11.player.id_in_group,study.11.player.role,study.11.player.payoff,study.11.player.Seatfinal,study.11.player.finalpay,study.11.player.payroundpay,study.11.player.QCorrect,study.11.player.treatment,study.11.player.Q1a,study.11.player.Q1b,study.11.player.Q1c,study.11.player.Q2a,study.11.player.Q3,study.11.player.Q4,study.11.player.Q5,study.11.player.Q6,study.11.player.Q7,study.11.player.Q80,study.11.player.Q81,study.11.player.Q82,study.11.player.offer,study.11.player.OfferNum,study.11.player.OfferTaken,study.11.player.BuyerNumber,study.11.player.Seatnum2,study.11.player.Seatnum,study.11.player.pay,study.11.player.isoffertaken,study.11.player.hastakenoffer,study.11.player.consent,study.11.player.offerPrice,study.11.player.oprice,study.11.player.guess_num_seller,study.11.player.BoughtPrice,study.11.player.reward,study.11.player.guess_num_buyer,study.11.group.id_in_subsession,study.11.subsession.round_number,study.11.subsession.offersrem,study.11.subsession.game_finished,study.11.subsession.numbuyers,study.11.subsession.bnum,study.11.subsession.payround,study.12.player.id_in_group,study.12.player.role,study.12.player.payoff,study.12.player.Seatfinal,study.12.player.finalpay,study.12.player.payroundpay,study.12.player.QCorrect,study.12.player.treatment,study.12.player.Q1a,study.12.player.Q1b,study.12.player.Q1c,study.12.player.Q2a,study.12.player.Q3,study.12.player.Q4,study.12.player.Q5,study.12.player.Q6,study.12.player.Q7,study.12.player.Q80,study.12.player.Q81,study.12.player.Q82,study.12.player.offer,study.12.player.OfferNum,study.12.player.OfferTaken,study.12.player.BuyerNumber,study.12.player.Seatnum2,study.12.player.Seatnum,study.12.player.pay,study.12.player.isoffertaken,study.12.player.hastakenoffer,study.12.player.consent,study.12.player.offerPrice,study.12.player.oprice,study.12.player.guess_num_seller,study.12.player.BoughtPrice,study.12.player.reward,study.12.player.guess_num_buyer,study.12.group.id_in_subsession,study.12.subsession.round_number,study.12.subsession.offersrem,study.12.subsession.game_finished,study.12.subsession.numbuyers,study.12.subsession.bnum,study.12.subsession.payround,study.13.player.id_in_group,study.13.player.role,study.13.player.payoff,study.13.player.Seatfinal,study.13.player.finalpay,study.13.player.payroundpay,study.13.player.QCorrect,study.13.player.treatment,study.13.player.Q1a,study.13.player.Q1b,study.13.player.Q1c,study.13.player.Q2a,study.13.player.Q3,study.13.player.Q4,study.13.player.Q5,study.13.player.Q6,study.13.player.Q7,study.13.player.Q80,study.13.player.Q81,study.13.player.Q82,study.13.player.offer,study.13.player.OfferNum,study.13.player.OfferTaken,study.13.player.BuyerNumber,study.13.player.Seatnum2,study.13.player.Seatnum,study.13.player.pay,study.13.player.isoffertaken,study.13.player.hastakenoffer,study.13.player.consent,study.13.player.offerPrice,study.13.player.oprice,study.13.player.guess_num_seller,study.13.player.BoughtPrice,study.13.player.reward,study.13.player.guess_num_buyer,study.13.group.id_in_subsession,study.13.subsession.round_number,study.13.subsession.offersrem,study.13.subsession.game_finished,study.13.subsession.numbuyers,study.13.subsession.bnum,study.13.subsession.payround
1,kppf7hjb,,0,221,221,study,FinalPay,2022-04-16 22:08:18.471115,1,,,0.0,lew8kph3,,,,,0,1.0,0.0,externality_control,0,2,Seller,0.0,1,0,0,10,0,125,125,50,100,50,0,0,0,1,1,,,1,3,,0,1,1,100,0,0,,50.0,,,,,,1,1,6,1,5,6,4,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,100,0,0,,45.0,,,,,,1,2,6,1,5,6,13,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,0,0,,0,,,100,0,0,,,,,,,,1,3,5,1,5,6,6,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,6,,0,,,138,1,0,,38.0,,,,,,1,4,6,1,5,6,3,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,2,,0,,,135,1,0,,35.0,,,,,,1,5,6,1,5,6,11,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,0,0,,0,,,100,0,0,,,,,,,,1,6,5,1,5,6,6,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,6,,0,,,132,1,0,,32.0,,,,,,1,7,6,1,5,6,4,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,5,,0,,,150,1,0,,50.0,,,,,,1,8,6,1,5,6,9,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,2,,0,,,100,0,0,,49.0,,,,,,1,9,6,1,5,6,10,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,5,,0,,,100,0,0,,39.0,,,,,,1,10,6,1,5,6,3,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,132,1,0,,32.0,,,,,,1,11,6,1,5,6,10,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,130,1,0,,30.0,,,,,,1,12,6,1,5,6,8,2,Seller,0.0,1,192,132,10,0,,,,,,,,,,,,,1,2,,0,,,128,1,0,,28.0,,,,,,1,13,6,1,5,6,11
Your file is not really as complicated as it first seems. For example the bulk of the data is just 43 columns that repeat 13 times. The STUDY.1 columns, then STUDY.2 columns etc.
For this one just write a program to read it. There are 22 columns that are not "study" columns. Then 13 copies of the 43 study columns.
data want;
infile csv dsd truncover firstobs=2;
input var1 ..... var22 #;
do study=1 to 13;
input svar1 .... svar43 # ;
output;
end;
run;
So you turn each line into 13 observations (study=1 to study=13).
To complete the sketch of a data step above you just need figure out want names you want to use for the 65 (22 + 43) variables other than STUDY. And for each variable what type of variable it is, numeric or character, and when character what length it needs to store the longest possible value.
If you need to work with a lot of different variations of files in this style then it might be worth working on a program to analyze the headers and determine the role of the columns based on the pattern of the header name and perhaps generate the code to read the file.
You might start by building a dataset with just the header names.
data headers;
infile csv dsd obs=1 ;
length col 8 words 8 ;
col+1;
array header [4] $50 ;
input header1 :$50. ## ;
words=countw(header1,'.');
do _n_=words to 1 by -1;
header[_n_] = scan(header1,_n_,'.');
end;
run;
You can use that list of the headers to help you figure out what would be useful names for the variables.
If you want to let SAS guess how to define and name the variables you could try splitting the CSV file into two separate CSV files. One with the first 22 columns and one with the other 43. So first split the headers (perhaps removing the STUDY.N. prefix while you are at it). Then split the data. Add an ROW number to make it easy to join them later.
filename single temp;
filename multiple temp;
data _null_;
infile csv dsd obs=1 ;
input header :$50. ## ;
file single dsd ;
if _n_=1 then put 'ROW,' #;
if _n_<= 22 then put header #;
else do;
file multiple dsd;
if _n_=23 then put 'ROW,STUDY,'# ;
call scan(header,3,pos,len,'.');
header = substr(header,pos);
put header #;
end;
if _n_=22+43 then stop;
run;
data _null_;
infile csv dsd firstobs=2 truncover ;
row+1;
length s1-s43 $200 ;
input s1-s22 #;
file single dsd mod;
put row s1-s22 ;
file multiple dsd mod;
do study=1 to 13 ;
input s1-s43 # ;
put row study s1-s43 ;
end;
run;
Now you can use PROC IMPORT to GUESS how to read SINGLE and MULTIPLE and then you can join them back together.
proc import file=single dbms=csv out=single replace;
run;
proc import file=multiple dbms=csv out=multiple replace;
run;
data want;
merge single multiple;
by row;
run;

Insert (internally existing) column headers as first row to a table

Assume that we have a table INPUT_TABLE which has four columns name, lat, lon, and z, filled with many data sets. In the SAS Explorer it would e.g. look like this:
name lat lon z
1 Germany 49.420469 8.7269178 17
2 England 51.5540693 -0.8249039 16
...
I handover a PREPROCESSED_TABLE based on this INPUT_TABLE to a macro %tabl:
data V42.PREPROCESSED_TABLE;
set V21.INPUT_TABLE;
drop NAME;
run;
%tabl(libin=V42, file=PREPROCESSED_TABLE);
The macro itself I am not allowed to modify.
Among other things, %tabl also writes a plain text file PREPROCESSED_TABLE.txt:
49.420469|8.7269178|17
51.5540693|-0.8249039|16
I would like to have the header names written out as well, e.g.:
lat|lon|z
49.420469|8.7269178|17
51.5540693|-0.8249039|16
My idea is to expand the PREPROCESSED_TABLE somewhere in the data step - could somebody help me with that, please? How can I read out the header names which are internally stored?
If the goal is to make a file with one line with the variable names then just write the file yourself. First get the names into a dataset (in order) and then write them. For example you could use PROC TRANSPOSE with OBS=0 dataset option to generate a file with one observation per variable.
proc transpose data=V42.PREPROCESSED_TABLE(obs=0) out=NAMES ;
var _all_ ;
run;
Which you can then use to write to a file.
data _null_;
set names ;
file 'preprocessed.txt' dsd dlm='|';
put _name_ # ;
run;
If you also want to add the data to that same file just use a second data step. Make sure to use the MOD option on the FILE statement so that data lines are appended to the existing file.
data _null_;
set V42.PREPROCESSED_TABLE;
file 'preprocessed.txt' dsd dlm='|' mod;
put (_all_) (+0);
run;
If you need to call the existing macro for other reasons you could either ignore the file it creates. Or if for some reason the content is different than just the simple dump of the file then you could just concatenate the file with the the headers with the file the macro generates. Say the macro generated 'PREPROCESSED_TABLE.txt' and your code generated the one line file 'headers.txt'. Then this step will read both and write 'PREPROCESSED_TABLE_w_headers.txt';
data _null_;
file 'PREPROCESSED_TABLE_w_headers.txt';
if _n_=1 then do;
infile 'headers.txt';
input;
put _infile_;
end;
infile 'PREPROCESSED_TABLE.txt';
input;
put _infile_;
run;
Given Reeza's and Tom's hints, I figured out a workaround myself: We simple call out macro %tabl twice, once with a 1-row-table with column-names and once with the data. This approach essentially corresponds to attaching to the file first the headers and then then data to the file (except that I have to worry about additional things added by %tabl further down in the process chain).
The technical difficulty I had was how to extract this 1-row-table with column names from the meta-info of the table input table V21.INPUT_TABLE.
My team mate showed me how that is done. To make it testable for everybody, I will show this step for the test data table sashelp.class:
proc contents data=sashelp.class out=meta (keep=NAME VARNUM) noprint;
run;
proc sort data=meta out=meta2;
by VARNUM;
run;
proc transpose data=meta2 out=colheaders (drop=_NAME_ _LABEL_);
var name;
run;
As a result, we will have a table colheaders with exactly one line containing the table headers, sorted by VARNUM which is the order in which they appear in the original table:
COL1 COL2 COL3 COL4 COL5
1 NAME SEX AGE HEIGHT WEIGHT
Problem solved, at least theoretically.

SAS Export Issue as it is giving additional double quote

I am trying to export SAS data into CSV, sas dataset name is abc here and format is
LINE_NUMBER DESCRIPTION
524JG 24PC AMEFA VINTAGE CUTLERY SET "DUBARRY"
I am using following code.
filename exprt "C:/abc.csv" encoding="utf-8";
proc export data=abc
outfile=exprt
dbms=tab;
run;
output is
LINE_NUMBER DESCRIPTION
524JG "24PC AMEFA VINTAGE CUTLERY SET ""DUBARRY"""
so there is double quote available before and after the description here and additional doble quote is coming after & before DUBARRY word. I have no clue whats happening. Can some one help me to resolve this and make me understand what exatly happening here.
expected result:
LINE_NUMBER DESCRIPTION
524JG 24PC AMEFA VINTAGE CUTLERY SET "DUBARRY"
There is no need to use PROC EXPORT to create a delimited file. You can write it with a simple DATA step. If you want to create your example file then just do not use the DSD option on the FILE statement. But note that depending on the data you are writing that you could create a file that cannot be properly parsed because of extra un-protected delimiters. Also you will have trouble representing missing values.
Let's make a sample dataset we can use to test.
data have ;
input id value cvalue $ name $20. ;
cards;
1 123 A Normal
2 345 B Embedded|delimiter
3 678 C Embedded "quotes"
4 . D Missing value
5 901 . Missing cvalue
;
Essentially PROC EXPORT is writing the data using the DSD option. Like this:
data _null_;
set have ;
file 'myfile.txt' dsd dlm='09'x ;
put (_all_) (+0);
run;
Which will yield a file like this (with pipes replacing the tabs so you can see them).
1|123|A|Normal
2|345|B|"Embedded|delimiter"
3|678|C|"Embedded ""quotes"""
4||D|Missing value
5|901||Missing cvalue
If you just remove DSD option then you get a file like this instead.
1|123|A|Normal
2|345|B|Embedded|delimiter
3|678|C|Embedded "quotes"
4|.|D|Missing value
5|901| |Missing cvalue
Notice how the second line looks like it has 5 values instead of 4, making it impossible to know how to split it into 4 values. Also notice how the missing values have a minimum length of at least one character.
Another way would be to run a data step to convert the normal file that PROC EXPORT generates into the variant format that you want. This might also give you a place to add escape characters to protect special characters if your target format requires them.
data _null_;
infile normal dsd dlm='|' truncover ;
file abnormal dlm='|';
do i=1 to 4 ;
if i>1 then put '|' #;
input field :$32767. #;
field = tranwrd(field,'\','\\');
field = tranwrd(field,'|','\|');
len = lengthn(field);
put field $varying32767. len #;
end;
put;
run;
You could even make this datastep smart enough to count the number of fields on the first row and use that to control the loop so that you wouldn't have to hard code it.

How do I stop SAS from adding an extra empty byte to every string variable when I use PROC EXPORT?

When I export a dataset to Stata format using PROC EXPORT, SAS 9.4 automatically expands adds an extra (empty) byte to every observation of every string variable. For example, in this data set:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 ASD X
B 199912 ASD C
;
quit;
proc export data = test1
file = "test1.dta"
dbms = stata replace;
quit;
the variables cust_id, category, and status should be str1, str3, and str1 in the final Stata file, and thus take up 1 byte, 3 bytes, and 1 byte, respectively, for every observation. However, SAS automatically adds an extra empty byte to each observation, which expands their data types to str2, str4, and str2 data type in the outputted Stata file.
This is extremely problematic because that's an extra byte added to every observation of every string variable. For large datasets (I have some with ~530 million observations and numerous string variables), this can add several gigabytes to the exported file.
Once the file is loaded into Stata, the compress command in Stata can automatically remove these empty bytes and shrink the file, but for large datasets, PROC EXPORT adds so many extra bytes to the file that I don't always have enough memory to load the dataset into Stata in the first place.
Is there a way to stop SAS from padding the string variables in the first place? When I export a file with a one character string variable (for example), I want that variable stored as a one character string variable in the output file.
This is how you can do it using existing functions.
filename FT41F001 temp;
data _null_;
file FT41F001;
set test1;
put 256*' ' #;
__s=1;
do while(1);
length __name $32.;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
substr(_FILE_,__s) = vvaluex(__name);
putlog _all_;
__s = sum(__s,vformatwx(__name));
end;
_file_ = trim(_file_);
put;
format month f6.;
run;
To avoid the use of _FILE_;
data _null_;
file FT41F001;
set test1;
__s=1;
do while(1);
length __name $32. __value $128 __w 8;
call vnext(__name);
if missing(__name) or __name eq: '__' then leave;
__value = vvaluex(__name);
__w = vformatwx(__name);
put __value $varying128. __w #;
end;
put;
format month f6.;
run;
If you are willing to accept a flat file answer, I've come up with a fairly simple way of generating one that I think has the properties you require:
data test1;
input cust_id $ 1
month 3-8
category $ 10-12
status $ 14-14
;
datalines;
A 200003 ABC C
A 200004 DEF C
A 200006 XYZ 3
B 199910 SD X
B 199912 D C
;
run;
data _null_;
file "/folders/myfolders/test.txt";
set test1;
put #;
_FILE_ = cat(of _all_);
put;
run;
/* Print contents of the file to the log (for debugging only)*/
data _null_;
infile "/folders/myfolders/test.txt";
input;
put _infile_;
run;
This should work as-is, provided that the total assigned length of all variables in your dataset is less than 32767 (the limit of the cat function in the data step environment- the lower 200 character limit doesn't apply, as that's only when you use cat to create a variable that hasn't been assigned a length). Beyond that you may start to run into truncation issues. A workaround when that happens is to only cat together a limited number of variables at a time - a manual process, but much less laborious than writing out put statements based on the lengths of all the variables, and depending on your data it may never actually come up.
Alternatively, you could go down a more complex macro route, getting variable lengths from either the vlength function or dictionary.columns and using those plus the variable names to construct the required put statement(s).