So I have a large dataset that is rather oddly formatted and I want to read it in based on the header. It only has unique columns for each unique participant and each participant participated in multiple rounds of the study. The data is from some experiments and is formatted as having variables for each participant (e.g. "participant.code") then some session variables which I can drop and then the actual variables from the experiment. These are formatted as "study.[round number].player.[variable]"
Rather then repeating the variable for every round, I want to just take out the round number as a separate variable and have an observation for every round for each participant.
I want to read these in differently depending on the variable and pick it out. I would rather not have to manually mess with the source file since the experiment is going to be run multiple times.
If someone could just point me towards some relevant material or whatnot that would be great.
Thank you!
Edit: example of some of the raw data:
1,kppf7hjb,,0,221,221,study,FinalPay,2022-04-16 22:08:18.471115,1,,,0.0,lew8kph3,,,,,0,1.0,0.0,externality_control,0,2,Seller,0.0,1,0,0,10,0,125,125,50,100,50,0,0,0,1,1,,,1,3,,0,1,1,100,0,0,,50.0,,,,,,1,1,6,1,5,6,4,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,100,0,0,,45.0,,,,,,1,2,6,1,5,6,13,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,0,0,,0,,,100,0,0,,,,,,,,1,3,5,1,5,6,6,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,6,,0,,,138,1,0,,38.0,,,,,,1,4,6,1,5,6,3,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,2,,0,,,135,1,0,,35.0,,,,,,1,5,6,1,5,6,11,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,0,0,,0,,,100,0,0,,,,,,,,1,6,5,1,5,6,6,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,6,,0,,,132,1,0,,32.0,,,,,,1,7,6,1,5,6,4,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,5,,0,,,150,1,0,,50.0,,,,,,1,8,6,1,5,6,9,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,2,,0,,,100,0,0,,49.0,,,,,,1,9,6,1,5,6,10,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,5,,0,,,100,0,0,,39.0,,,,,,1,10,6,1,5,6,3,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,132,1,0,,32.0,,,,,,1,11,6,1,5,6,10,2,Seller,0.0,,0,0,0,0,,,,,,,,,,,,,1,1,,0,,,130,1,0,,30.0,,,,,,1,12,6,1,5,6,8,2,Seller,0.0,1,192,132,10,0,,,,,,,,,,,,,1,2,,0,,,128,1,0,,28.0,,,,,,1,13,6,1,5,6,11
Your file is not really as complicated as it first seems. For example the bulk of the data is just 43 columns that repeat 13 times. The STUDY.1 columns, then STUDY.2 columns etc.
For this one just write a program to read it. There are 22 columns that are not "study" columns. Then 13 copies of the 43 study columns.
data want;
infile csv dsd truncover firstobs=2;
input var1 ..... var22 #;
do study=1 to 13;
input svar1 .... svar43 # ;
So you turn each line into 13 observations (study=1 to study=13).
To complete the sketch of a data step above you just need figure out want names you want to use for the 65 (22 + 43) variables other than STUDY. And for each variable what type of variable it is, numeric or character, and when character what length it needs to store the longest possible value.
If you need to work with a lot of different variations of files in this style then it might be worth working on a program to analyze the headers and determine the role of the columns based on the pattern of the header name and perhaps generate the code to read the file.
You might start by building a dataset with just the header names.
data headers;
infile csv dsd obs=1 ;
length col 8 words 8 ;
array header [4] $50 ;
input header1 :$50. ## ;
do _n_=words to 1 by -1;
header[_n_] = scan(header1,_n_,'.');
You can use that list of the headers to help you figure out what would be useful names for the variables.
If you want to let SAS guess how to define and name the variables you could try splitting the CSV file into two separate CSV files. One with the first 22 columns and one with the other 43. So first split the headers (perhaps removing the STUDY.N. prefix while you are at it). Then split the data. Add an ROW number to make it easy to join them later.
filename single temp;
filename multiple temp;
data _null_;
infile csv dsd obs=1 ;
input header :$50. ## ;
file single dsd ;
if _n_=1 then put 'ROW,' #;
if _n_<= 22 then put header #;
else do;
file multiple dsd;
if _n_=23 then put 'ROW,STUDY,'# ;
call scan(header,3,pos,len,'.');
header = substr(header,pos);
put header #;
if _n_=22+43 then stop;
data _null_;
infile csv dsd firstobs=2 truncover ;
length s1-s43 $200 ;
input s1-s22 #;
file single dsd mod;
put row s1-s22 ;
file multiple dsd mod;
do study=1 to 13 ;
input s1-s43 # ;
put row study s1-s43 ;
Now you can use PROC IMPORT to GUESS how to read SINGLE and MULTIPLE and then you can join them back together.
proc import file=single dbms=csv out=single replace;
proc import file=multiple dbms=csv out=multiple replace;
data want;
merge single multiple;
by row;