Importing multiple text (txt) files into SAS (files have variable attributes in first 2 rows) - sas

I am trying to import multiple text files into SAS. The peculiarity of the data is that the first row has the labels for some of the variables and the second row has text indicating type of some of the variables. The third row has the variable names.
I was intending to use a macro to read the files as the first 7 variables have the same names. I am not sure how to programmatically handle the variable attributes in the files. Please suggest how I could do this.
The code so far:
%macro text2sas(filenam=);
proc import datafile="../&filenam..txt"
out="&filenam"
dbms=dlm replace ;
delimiter = '09'x;
getnames=no;
datarow=1;
guessingrows=max;
run;
%mend text2sas;
%text2sas(filenam=convdat);
%text2sas(filenam=tratdat);
The data for convdat.txt looks like this:
"Dance retail:" "Dummy measurement completed successfully?" "Dramatic measurements?" "Maximal travel :" "Velocity time at start:" "Mean velocity at start:" "Maximal velocity at end:" "Velocity time iinterval:" "Mean velocity interval:" "Crain Dp:"
date string string number number number number number number number
RELAXT RAIN PLUCK RAPPLE VRAT GROSS PANGLE "Straint" "Etramp" "Crumpa" "Cafin" "Cafinat" "Cafinab" "Cafinavr" "Cafinap" "cafinal"
X5980B00099 "CF" G0001001 1234 "Vlapa1" 1 "Crt appoi" "10-May-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001002 1234 "Vlapa1" 1 "Crt appoi" "13-May-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001003 1234 "Vlapa1" 1 "Crt appoi" "19-may-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001004 1234 "Vlapa1" 1 "Crt appoi" "26-may-2010" "1" "1" "0.45" "0.55" "0.98" "0.76" "0.98" "0.12" "5.77"
Data for tratdat looks like this:
"Arbitrary carpets" "Household items" "Garage material" "Sundry data (everything else)" "Vehicle number" "Strains" "ITM" "Finals" "Dreadspan" "Printers" "Comment 1" "comment 2" "Grapple" "Drops" "Triangles"
boolean boolean boolean boolean boolean boolean boolean boolean boolean boolean string boolean boolean boolean boolean
RELAXT RAIN PLUCK RAPPLE VRAT GROSS PANGLE "Ant" "App" "Cro" "BRon" "Dramas" "Slacks" "CRAT" "Frob" "Rilo" "Ph7jj" "P10rt" "Irup" "GLk2" "Dap3" "Oreta"
X5980B00099 "GB" G0001001 1234 "Vlapa1" 1 "Pangolin train" "" "checked" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001002 1234 "Vlapa1" 1 "Pangolin train" "" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001003 1234 "Vlapa1" 1 "Pangolin train" "checked" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001004 1234 "Vlapa1" 1 "Pangolin train" "checked" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""

The ultimate input will involve telling SAS to go to row 3, but as Reeza notes, you will lose your metadata if you just skip to Datarow=4.
I recommend parsing the file in a preprocessing step, and converting that metadata into input statements. This may be complicated, but it shouldn't be too bad... it is however outside the scope of a StackOverflow answer.
You can look into my presentations Writing Code With Your Data and Documentation Driven Programming (co-author) to see what kind of things you can do as far as writing the input statements. You don't have exactly what either of these expect, but you can input those first few lines using data step input and then transpose that dataset to a more useful format.

Looks like the first three lines have the LABEL, TYPE and NAME for the columns. So read that first and use the information to generate code to read the actual lines of data.
Something like this:
data headers ;
length row col 8 type $32 value $200 ;
infile file2 dsd dlm='09'x truncover length=ll column=cc ;
do type='LABEL','TYPE','NAME';
row+1;
do col=1 by 1 until(cc>ll);
input value # ;
if not missing(value) then output;
end;
input;
end;
stop;
run;
proc sort; by col row; run;
proc transpose data=headers out=meta(drop=_name_) ;
by col;
id type ;
var value;
run;
Which for that second file should get data like:
Obs col NAME LABEL TYPE
1 1 RELAXT
2 2 RAIN
3 3 PLUCK
4 4 RAPPLE
5 5 VRAT
6 6 GROSS
7 7 PANGLE
8 8 Ant Arbitrary carpets boolean
9 9 App Household items boolean
10 10 Cro Garage material boolean
11 11 BRon Sundry data (everything else) boolean
12 12 Dramas Vehicle number boolean
13 13 Slacks Strains boolean
14 14 CRAT ITM boolean
15 15 Frob Finals boolean
16 16 Rilo Dreadspan boolean
17 17 Ph7jj Printers boolean
18 18 P10rt Comment 1 string
19 19 Irup comment 2 boolean
20 20 GLk2 Grapple boolean
21 21 Dap3 Drops boolean
22 22 Oreta Triangles boolean
Which you might use to generate code like:
data want ;
infile file2 dsd dlm='09'x truncover firstobs=4 ;
input
RELAXT :$20.
RAIN :$5.
PLUCK :$20.
RAPPLE
VRAT :$20.
GROSS
PANGLE :$40.
Ant :$1.
App :$1.
Cro :$1.
BRon :$1.
Dramas :$1.
Slacks :$1.
CRAT :$1.
Frob :$1.
Rilo :$1.
Ph7jj :$1.
P10rt :$50.
Irup :$1.
GLk2 :$1.
Dap3 :$1.
Oreta :$1.
;
label
Ant ="Arbitrary carpets"
App ="Household items"
Cro ="Garage material"
BRon ="Sundry data (everything else)"
Dramas ="Vehicle number"
Slacks ="Strains"
CRAT ="ITM"
Frob ="Finals"
Rilo ="Dreadspan"
Ph7jj ="Printers"
P10rt ="Comment 1"
Irup ="comment 2"
GLk2 ="Grapple"
Dap3 ="Drops"
Oreta ="Triangles"
;
run;

Related

Combine several rows into one observation

I have a dataset in Stata where one observation is spread out over multiple rows like the table below. The variables are string except for the id, and there exist some duplicate entries for some variables (like the last row in the table).
id
var1
var2
var3
1
name1
1
name2
1
name3
2
name4
2
name5
3
name6
3
name8
3
name9
I want to take the first value and combine all variables to one row / observation. I think this is a really easy task but somehow I don't manage to figure it out.
id
var1
var2
var3
1
name1
name2
name3
2
name4
name5
3
name6
name8
It looks like collapse's service here.
collapse (firstnm) var*, by(id)
I am going to assume as implied in text that name9 is really the same as name8. That being so, here is one solution.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str5(var1 var2 var3)
1 "name1" "" ""
1 "" "name2" ""
1 "" "" "name3"
2 "name4" "" ""
2 "" "name5" ""
3 "name6" "" ""
3 "" "" "name8"
3 "" "" "name8"
end
forval j = 1/3 {
bysort id (var`j') : replace var`j' = var`j'[_N]
}
duplicates drop
+----------------------------+
| id var1 var2 var3 |
|----------------------------|
1. | 1 name1 name2 name3 |
2. | 2 name4 name5 |
3. | 3 name6 name8 |
+----------------------------+
EDIT In the event that what is wanted is the first non-missing string value, collapse remains the solution of choice, but here is a solution without it.
clear
input byte id str5(var1 var2 var3)
1 "name1" "" ""
1 "" "name2" ""
1 "" "" "name3"
2 "name4" "" ""
2 "" "name5" ""
3 "name6" "" ""
3 "" "" "name8"
3 "" "" "name9"
end
gen long obsno = _n
forval j = 1/3 {
bysort id : egen firstnm = min(cond(var`j' != "", obsno, .))
replace var`j' = var`j'[firstnm]
drop firstnm
}
drop obsno
duplicates drop
list

Populating a dataset depending on the values of a variable in another dataset

I have two data sets INPUT and OUTPUT.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
run;
The OUTPUT dataset has a different structure. The variables do not have the same name.
data work.output;
attrib
variable_1 length=8 format=best12. label="Variable 1"
variable_2 length=$50 format=$50. label="Variable 2"
Variable_3 length=8 format=date9. label="Variable 3";
stop;
run;
OUTPUT will be filled with the values from input based on what is specified in column "transformation" in table INPUT: when "transformation" equals "1:1", I want to fill the OUTPUT ds with the values of the corresponding INPUT dataset. If this were a small excel, I would do copy & paste or a lookup.
For example, obs1 of dataset INPUT has transformation = 1:1, so I want to fill variable_1 of dataset OUTPUT with "apple", variable_2 with "banana" and variable_3 with "oats".
For the second observation of ds INPUT I want to multiply each variable with two and assign them to variable_1 - variable_3 respectively.
In my real dataset I have much more columns so I need to automate this, probalby via index, since the variable names do not correspond.
You probably need to code each transformation rule separately.
This works for your example. But you did not include any date transformations so variable3 is not used.
data INPUT;
input
id 1-4
var1 $ 6-10
var2 $ 12-17
var3 $ 19-22
transformation $ 24-26
;
datalines;
1023 apple banana oats 1:1
1049 12 22 8 2x
1219 milk cream fish 1:1
;
proc transpose data=input prefix=value out=step1;
by id transformation;
var var1-var3 ;
run;
data output;
set step1;
length variable1 8 variable2 $50 variable3 8;
format variable3 date9.;
if transformation='1:1' then variable2=value1;
if transformation='2x' then variable1 = 2*input(value1,32.);
run;
Result
Obs id transformation _NAME_ value1 variable1 variable2 variable3
1 1023 1:1 var1 apple . apple .
2 1023 1:1 var2 banana . banana .
3 1023 1:1 var3 oats . oats .
4 1049 2x var1 12 24 .
5 1049 2x var2 22 44 .
6 1049 2x var3 8 16 .
7 1219 1:1 var1 milk . milk .
8 1219 1:1 var2 cream . cream .
9 1219 1:1 var3 fish . fish .

SAS flag each row that contains the max value

I tried searching but couldn't exactly find what I was looking for. I have a dataset with multiple rows per ID. I'd like to add a variable called maxdec and show a 1 for each row that has the max dec for each ID.
Sample Dataset:
ID DEC
123 1
123 2
123 2
123 2
456 2
456 3
456 3
Desired Output:
ID DEC MAXDEC
123 1 .
123 2 1
123 2 1
123 2 1
456 2 .
456 2 .
456 3 1
It is easier to define it with 1 or 0 instead of 1 or missing.
proc sql;
create table want as
select id,dec, dec=max(dec) as maxdec
from have
group by id
;
quit;
proc sort data=have;
by id;
proc summary data=have;
class id;
var dec;
output out=max_info max=max_value;
run;
data want;
merge have
max_info (keep=id max_value)
;
by id;
if dec=max_value then maxdec=1;
run;
The proc summary calculates the maximum value of DEC for each ID, and outputs as variable MAX_VALUE in dataset MAX_INFO. The subsequent data step assigns MAXDEC=1 if the current value of DEC is equal to MAX_VALUE for that ID.
Here is a DoW loop approach
data have;
input ID DEC;
datalines;
123 1
123 2
123 2
123 2
456 2
456 3
456 3
;
data want(drop = m);
do _N_ = 1 by 1 until (last.id);
set have;
by id;
m = max(maxdex, dec);
end;
do _N_ = 1 to _N_;
set have;
maxdex = ifn(dec = m, 1, .);
output;
end;
run;

Choosing the row with the maximum character length in sas

I have the following dataset:
dataseta:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Inc 100 100 0
2. TC Tribal Council Limited INC 20 25 65
desired output:
datasetb:
No. Name1 Name2 Sales Inv Comp
1 TC Tribal Council Limited Inc 120 125 0
Basically, I need to choose the row with the maximum length of characters for the column name2.
I tried the following, but it didn't work
proc sql;
create table datasetb as select no,name1,name2,sum(sales),sum(inv),min(comp) from dataseta group by 1,2,3 having length(name2)=max(length(name2));quit;
If I do the following code, it only partially resolves it, and I get duplicate rows
proc sql;
create table datasetb as select no,name1,max(length(name2)),sum(sales),sum(inv),min(comp) from dataseta group by 1,2 having length(name2)=max(length(name2));quit;
You appear to be joining the results of two separate aggregate computations.
Presuming:
no is unique so as to allow a tie breaker criteria and the first (per no) longest name2 is to be joined with the cost, inv, comp totals over name1.
The query will have lots going on...
1st longest name2 within name1, nested subqueries are needed to:
Determine the longest name2, then
Select first one, according to no, if more than one.
totals over name1
The totals will be a sub-query that is joined to, for delivering the desired result set.
Example (SQL)
data have;
length no 8 name1 $6 name2 $35 sales inv comp 8;
input
no name1& name2& sales inv comp; datalines;
1 TC Tribal Council Inc 100 100 0 * name1=TC group
2 TC Tribal Council Limited INC 20 25 65
3 TC Tribal council co 0 0 0
4 TC The Tribal council Assoctn 10 10 10
7 LS Longshore association 10 10 0 * name=LS group
8 LS The Longshore Group, LLC 2 4 8
9 LS The Longshore Group, llc 15 15 6
run;
proc sql;
create table want as
select
first_longest_name2.no,
first_longest_name2.name1,
first_longest_name2.name2,
name1_totals.sales,
name1_totals.inv,
name1_totals.comp
FROM
(
select
no, name1, name2
from
( select
no, name1, name2
from have
group by name1
having length(name2) = max(length(name2))
) longest_name2s
group by name1
having no = min(no)
) as
first_longest_name2
LEFT JOIN
(
select
name1,
sum(sales) as sales,
sum(inv) as inv,
sum(comp) as comp
from
have
group by name1
) as
name1_totals
ON
first_longest_name2.name1 = name1_totals.name1
;
quit;
Example (DATA Step)
Processing the data in a serial manner, when name1 groups are contiguous rows, can be accomplished using a DOW loop technique -- that is a loop with a SET statement within it.
data want2;
do until (last.name1);
set have;
by name1 notsorted;
if length(name2) > longest then do;
longest = length(name2);
no_at_longest = no;
name2_at_longest = name2;
end;
sales_sum = sum(sales_sum,sales);
inv_sum = sum(inv_sum,inv);
comp_sum = sum(comp_sum,comp);
end;
drop name2 no sales inv comp longest;
rename
no_at_longest = no
name2_at_longest = name2
sales_sum = sales
inv_sum = inv
comp_sum = comp
;
run;

Carrying string labels of string variable after reshape

I have dataset in Stata that looks like this
entityID indicator indicatordescr indicatorvalue
1 gdp Gross Domestic 100
1 pop Population 15
1 area Area 50
2 gdp Gross Domestic 200
2 pop Population 10
2 area Area 300
and there is a one-to-one mapping between values of indicator and values of indicatordescr.
I want to reshape it to wide, i.e. to:
entityID gdp pop area
1 100 15 50
2 200 10 300
where I would like gdp variable label to be "Gross Domestic", pop label "Population" and area "Area".
Unfortunately, as I understand, it is not possible to assign the value of indicatordescr as a value label of indicator, so the reshape can't transform these value labels into variable labels.
I have looked at this : Bring value labels to variable labels when reshaping wide
and this : http://www.stata.com/support/faqs/data-management/apply-labels-after-reshape/
but did not understand how to apply those to my case.
NB: the variable labeling after reshape must be done programatically, because indicator and indicatordescr have many values.
"String labels" here is informal; Stata does not support value labels for string variables. However, what is wanted here is that the distinct values of a string variable become variable labels on reshaping.
Various work-arounds exist. Here's one: put the information in the variable name and then take it out again.
clear
input entityID str4 indicator str14 indicatordescr indicatorvalue
1 gdp "Gross Domestic" 100
1 pop "Population" 15
1 area "Area" 50
2 gdp "Gross Domestic" 200
2 pop "Population" 10
2 area "Area" 300
end
gen what = indicator + "_" + subinstr(indicatordescr, " ", "_", .)
keep entityID what indicatorvalue
reshape wide indicatorvalue , i(entityID) j(what) string
foreach v of var indicator* {
local V : subinstr local v "_" " ", all
local new : word 1 of `V'
rename `v' `new'
local V = substr("`V'", strpos("`V'", " ") + 1, .)
label var `new' "`V'"
}
renpfix indicatorvalue
EDIT If the length of variable names bites, try another work-around:
clear
input entityID str4 indicator str14 indicatordescr indicatorvalue
1 gdp "Gross Domestic" 100
1 pop "Population" 15
1 area "Area" 50
2 gdp "Gross Domestic" 200
2 pop "Population" 10
2 area "Area" 300
end
mata : sdata = uniqrows(st_sdata(., "indicator indicatordescr"))
keep entityID indicator indicatorvalue
reshape wide indicatorvalue , i(entityID) j(indicator) string
renpfix indicatorvalue
mata : for(i = 1; i <= rows(sdata); i++) stata("label var " + sdata[i, 1] + " " + char(34) + sdata[i,2] + char(34))
end
LATER EDIT Although the above is called a work-around, it is a much better solution than the previous.