Convert one table into another format from SAS code - sas

I have a table is SAS which looks like this.
year Country Host Code Value
2010 India Pak 220 111
2010 India Aus 220 123
2010 India NZ 220 23
2010 India SA 240 43
2010 India WI 250 124
2010 India SRI 250 325
2010 India ZIM 280 235
i want to transform this table to following form
Country Code Pak_2010 Aus_2010 NZ_2010 SA_2010 WI_2010 SRI_2010 IM_2010
India 220 111 123 23 0 0 0 0
India 240 0 0 0 43 0 0 0
India 250 0 0 0 0 124 325 0
India 280 0 0 0 0 0 0 235
for one country and code, there will be one value.
Can anyone please suggest me code for doing this transformation?

This is a classical proc transpose, separating your ID variables with a delimiter:
PROC TRANSPOSE
DATA=yourInput
OUT=yourOutput(drop=_name_)
DELIMITER=_;
BY Country Code;
ID Host Year;
VAR Value;

Related

DISCOUNT with multiple criteria in Power BI

I have two tables are Data and Report.
Data Table: In the Data table, two columns are Item, Qty, and Order. The Item columns contain as a text & number and qty and number column stored as text and number.
The item column is repeated according to the order and the same item column contains two different qty according to the order column.
Report Table:
I have a unique item column.
Data and Report file looks like.
Data
ITEM QTY ORDER
123 200 1
123 210 0
5678 220 1
5678 230 0
5555 240 1
6666 250 1
9876 260 1
2345 270 1
901 280 1
901 280 1
902 300 1
902 300 1
123456 200 1
123456 200 1
123456 210 1
123456 210 1
123456 0 1
567 200 1
567 210 1
567 210 1
567 0 1
453 5000 1
453 5000 1
453 5000 1
453 5000 1
112 5000 1
112 5000 1
112 5000 1
112 5000 1
116 5000 1
116 5001 1
116 0 1
116 0 1
116 5000 0
116 5001 0
116 0 0
116 0 0
Report
ITEM DESIRED RESULT (QTY)
123 200
5678 220
5555 240
6666 250
9876 260
2345 270
901 280
902 300
123456 MIXED
567 MIXED
4444 NA
12 NA
10 NA
453 5000
112 5000
116 MIXED
Expand snippet
Desired Result
I would like to pull the qty against the order “1” from the data table into the report table according to the item.
If the item is found in the data table then return the qty in the report table according to the item. {Please refer to the “Data” and “Report table for item 123 and 5678 etc….}
If an item is not found in the data table then return “NA” in the report table according to the item. {Please refer to the “Data” and “Report table for item 10, 12,444}
The same item contains two different qty then returns as a text “Mixed” in the report table according to the item. {Please refer to the “Data” and “Report table for item 123456,116 & 567}
Currently I am using the following calculated column CURRENT DAX FOR QTY = LOOKUPVALUE(DATA[QTY],DATA[ITEM],'DESIRED RESULT'[ITEM],DATA[ORDER],1,"NA") enter image description here
It’s almost working fine but it’s giving the wrong result “NA” were two different qty for the same item & two different order (0,1) or (1) or (o) {Please refer to the “Data” and “Report table for item 123456, 116 & 567} but the desired result is “Mixed” those three items.
Note: I convert the qty column from number to text otherwise it gives an error, is there any alternative option to achieve my result.
Herewith attached the PBI file for your reference https://www.dropbox.com/s/hf40q27pvn3ij2g/DAX-LOOKUPVALUE%20FILTER%20BY.pbix?dl=0.
If I'm understanding correctly, this can be done with the method I suggested previously with the addition of a filter for DATA[ORDER] = 1.
IF (
CALCULATE ( DISTINCTCOUNT ( DATA[QTY] ), DATA[ORDER] = 1 ) > 1,
"MIXED",
CALCULATE ( SELECTEDVALUE ( DATA[QTY], "NA" ), DATA[ORDER] = 1 )
)

Looping in SAS to bring the latest value

I am trying to find days matching to a reference number of days given or else to find the number of days close to the reference days.
I coded till here, however not sure how to go forward.
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 560 86 181
3 2016-02-12 560 82 263
3 2016-02-12 560 69 332
3 2016-02-12 560 77 409
So now I want to bring out the last value close to the reference days.
and the next total_days should start from ZERO again to find the next window. How can I do this?
Here is a code that I wrote
data want;
do until (totaldays <= ref_days);
set have;
by ID ref_days notsorted;
if first.id then totaldays=0;
else totaldays+lags;
end;
run;
Required Output:
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 300 86 181
3 2016-02-12 300 82 263
3 2016-02-12 300 69 .
3 2016-02-12 300 77 146
A while ago I did similar to this via Proc sql. It calculates all the distances and takes the closest one. It works with moderate size dataset. Hopefully it is of some use.
proc sql;
select * from
(
select *,
abs(t1.link-t2.link) as dist /*In your case these would be dateVars*/
from test1 t1
left join test2 t2
on 1=1) group by system1 having dist=min(dist);
;
quit;
There was some talk that the left join on 1=1 is a bit silly (as full outter join would suffice, or something.) However this worked for the problem in question.

SAS using Datalines - "observation read not used"

I am a complete newb to SAS and I only know is basic sql. Currently taking Regression class and having trouble with SAS code.
I am trying to input two columns of data where x variable is State; y variable is # of accidents for a simple regression.
I keep getting this:
ERROR: No valid observations are found.
Number of Observations Read 51
Number of Observations Used 0
Number of Observations with Missing Values 51
Is it because datalines only read numbers and not charcters?
Here is the code as well as the datalines:
Data Firearm_Accidents_1999_to_2014;
ods graphics on;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
Connecticut 0
Delaware 0
District_of_Columbia 0
Florida 350
Georgia 413
Hawaii 0
Idaho 0
Illinois 287
Indiana 288
Iowa 0
Kansas 44
Kentucky 384
Louisiana 562
Maine 0
Maryland 21
Massachusetts 27
Michigan 168
Minnesota 0
Mississippi 332
Missouri 320
Montana 0
Nebraska 0
Nevada 0
New_Hampshire 0
New_Jersey 85
New_Mexico 49
New_York 218
North_Carolina 437
North_Dakota 0
Ohio 306
Oklahoma 227
Oregon 41
Pennsylvania 465
Rhode_Island 0
South_Carolina 324
South_Dakota 0
Tennessee 603
Texas 876
Utah 0
Vermont 0
Virginia 203
Washington 45
West_Virginia 136
Wisconsin 64
Wyoming 0
;
run; proc print;
proc reg data = Firearm_Accidents_1999_to_2014;
model State = Sum_OF_Deaths;
ods graphics off;
run; quit;
OK, some different levels of issues here.
ODS GRAPHICS go before and after procs, not inside them.
When reading a character variable you need to tell SAS using an informat.
This allows you to read in the data. However your regression has several issues. For one, State is a character variable and you can do regression with a character variable. I think that issue is beyond this forum. Review your regression basics and check what you're trying to do.
Data Firearm_Accidents_1999_to_2014;
informat state $32.;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
....
;
run;

Perl regex: More than one line to match and output in a given format: Completely revised

If one can understand how to store stuff in memory, know where it is and how to get it out again in an orderly way it will go a long way to achieving results in Perl.(And probably all programming languages)
I am not a programmer.
I am trying to extract data from 'older program' output and import it into a SQL database. The extraction is the thing.
My previous question was largely incorrect, as I found out when importing data into the table, as I did not have enough data from the 'old program' output file.
I would like to learn from my mistake and re-ask my previous question, hopefully correctly this time.
I have included my poor effort at extracting the data, exactly as it was last time. It doesn't come anywhere near getting the correct data out.
I believe this is quite a complex question but maybe it isn't.
It is certainly above my level of Perl at present, and maybe ever.
Answers to my incorrectly phrased question have been partially understood. Thank you very much for them.
If I could summarize it, my main problem with this task is dealing with the type of question: 'If a line contains ... get data from two lines up and insert it at the beginning of .... Seemingly impossible for me.
I tried regex over the end of line searches but was unable to get that to work.
I was unable also to arrange successive loops to insert data in lines as I wanted. If one loop worked, the next did not and so on. I was prepared to work on successive files in a step by step process but the 'two lines up' question stumped me completely.
I was able to extract other data from these output files relatively easily as they are very orderly files, but this particular question has me stumped.
My revised question is:
My input file consists of batches of data(+-50 - 70 lines long) in the following format:
1(P1) 3 P.ell 05/0120 W/P068819 0 12.0 98/99 380 380 C03 104 PROCESSED 21/02/16 TIME 22.16.52 KSINA=8
AGE SPH %THN %INC SV SI MAI20 HTPC VIPC AGE BA DBH HT SPH CIH% CIV% CVD BCON CMAI C0 C0CAL SI20
0 1100 .0 89.0%SPH 2 2 .00 .0 .0 20.00 1 .0 17.3 0 .0 .0 .0 .0 .0000 .000 0% .00
7 815 25.9 .0 2 2 9.90 75.5 47.2 20.00 1 26.6 17.3 330 .0 .0 .0 13.0 .2099 1.005 .000 17.30
13 550 32.5 .0
18 330 40.0 .0
45 0 100.0 .0
0SQ -4 -4 -4 = SI20 17 17 17 PLANTN---104 GREEN MEADOWS MODEL---P.ELLIOTTII MAC MAC SQ 10 SI20 22.90
HTPC 76 76 76 =MAI20 10 10 10 FROM HTPC HTPC 100 MAI20 20.71
VIPC 47 47 47 =MAI20 10 10 10 HTPC/VIPC REGRESSION---P.ELLIOTTII GENERAL 1/83 VIPC 100 MAI20 20.99
MAIDBH 0
INMAI==> 0
0INPUT FOR CALCULATING HTPC & VIPC = HT ---- ----
AGE DBH HT VTREE SPH BA TOTAL WS UTIL S A B C D TCAI CTCAI TMAI UCAI CUCAI UMAI SCAI CSCAI SMAI IAGE
1 .0 .2 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 1.0
2 .0 .9 .0000 979 0 0 0 0 0 0 0 0 0 .0 0 .0 .0 0 .0 .0 0 .0 2.0
3 3.9 2.0 .0007 979 1 1 1 0 0 0 0 0 0 .7 1 .2 .0 0 .0 .0 0 .0 3.0
4 7.1 3.4 .0041 979 4 4 3 1 1 0 0 0 0 3.4 4 1.0 .6 1 .2 .0 0 .0 4.0
5 9.4 4.6 .0102 979 7 10 5 5 5 0 0 0 0 5.9 10 2.0 4.1 5 .9 .0 0 .0 5.0
6 11.3 5.7 .0188 979 10 18 6 12 12 1 0 0 0 8.4 18 3.1 7.5 12 2.0 .0 0 .0 6.0
7 13.0 6.7 .0293 979 13 29 7 22 19 3 0 0 0 10.3 29 4.1 9.7 22 3.1 .0 0 .0 7.0
17%
THN 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
REM 13.4 6.7 .0315 815 12 26 6 20 17 3 0 0 0
8 15.0 7.6 .0453 815 14 37 6 31 21 10 0 0 0 11.2 40 5.0 10.9 33 4.1 .0 0 .0 7.6
9 16.4 8.5 .0607 815 17 49 6 43 23 20 0 0 0 12.5 52 5.8 12.2 45 5.0 .2 0 .0 8.6
10 17.4 9.4 .0771 815 19 63 7 56 24 30 2 0 0 13.4 66 6.6 13.1 58 5.8 1.3 2 .2 9.6
11 18.3 10.3 .0941 815 21 77 7 70 24 41 5 0 0 13.9 80 7.3 13.6 72 6.5 3.0 5 .4 10.6
12 19.0 11.3 .1118 815 23 91 7 84 24 50 10 0 0 14.4 94 7.8 14.1 86 7.2 5.4 10 .8 11.6
13 19.6 12.2 .1299 815 25 106 8 98 24 56 18 0 0 14.7 109 8.4 14.4 100 7.7 8.0 18 1.4 12.6
33%
THN 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
REM 20.6 12.2 .1421 550 18 78 5 73 16 42 15 0 0
14 21.3 13.0 .1636 550 20 90 6 84 16 44 25 0 0 11.8 121 8.6 11.6 112 8.0 10.0 28 2.0 10.4
15 22.0 13.7 .1864 550 21 103 6 97 16 45 36 0 0 12.5 133 8.9 12.3 124 8.3 11.0 39 2.6 11.2
16 22.7 14.5 .2100 550 22 116 6 109 15 46 48 0 0 13.0 146 9.1 12.7 137 8.6 12.0 51 3.2 12.0
17 23.3 15.3 .2345 550 23 129 6 123 15 46 61 0 0 13.5 160 9.4 13.2 150 8.8 12.9 64 3.8 12.8
18 23.9 15.9 .2598 550 25 143 7 136 15 46 74 1 0 13.9 174 9.6 13.6 164 9.1 13.8 78 4.3 13.6
40%
THN 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
REM 25.3 15.9 .2901 330 17 96 4 92 9 28 54 1 0
19 26.0 16.6 .3203 330 17 106 4 101 9 27 63 3 0 10.0 184 9.7 9.8 174 9.1 10.5 88 4.6 11.0
20 26.6 17.3 .3519 330 18 116 5 112 9 27 71 5 0 10.4 194 9.7 10.2 184 9.2 10.6 99 4.9 11.7
21 27.2 18.0 .3849 330 19 127 5 122 9 27 80 8 0 10.9 205 9.8 10.7 194 9.3 11.1 110 5.2 12.4
22 27.9 18.7 .4192 330 20 138 5 133 8 26 87 11 0 11.3 216 9.8 11.1 206 9.3 11.5 121 5.5 13.2
23 28.4 19.3 .4546 330 21 150 5 145 8 26 94 16 0 11.7 228 9.9 11.4 217 9.4 11.8 133 5.8 14.0
24 29.0 20.0 .4914 330 22 162 5 157 8 26 101 22 0 12.2 240 10.0 11.9 229 9.5 12.3 145 6.1 14.9
25 29.6 20.6 .5292 330 23 175 6 169 8 25 106 29 0 12.5 253 10.1 12.2 241 9.6 12.6 158 6.3 15.7
26 30.2 21.2 .5682 330 24 188 6 182 8 25 112 37 0 12.9 265 10.2 12.6 254 9.8 13.0 171 6.6 16.5
27 30.7 21.8 .6083 330 25 201 6 194 8 25 115 46 0 13.2 279 10.3 13.0 267 9.9 13.3 184 6.8 17.3
28 31.3 22.4 .6492 330 25 214 7 208 8 24 119 56 1 13.5 292 10.4 13.2 280 10.0 13.6 198 7.1 18.2
29 31.9 23.0 .6908 330 26 228 7 221 8 24 122 65 2 13.7 306 10.5 13.5 293 10.1 13.8 212 7.3 19.0
30 32.4 23.5 .7332 330 27 242 7 235 8 24 123 77 3 14.0 320 10.7 13.7 307 10.2 14.0 226 7.5 19.8
31 33.0 23.9 .7766 330 28 256 7 249 8 24 125 88 5 14.3 334 10.8 14.0 321 10.4 14.3 240 7.7 20.4
32 33.6 24.4 .8202 330 29 271 8 263 8 23 126 99 7 14.4 349 10.9 14.1 335 10.5 14.4 255 8.0 21.0
Firstly the two variables in the first line(1(P1...): in this case 'C03 104' need to be extracted from it and be sent to OUTPUT.(Same as previous question, but the output position changes.)
Secondly, all lines beginning with 'THN' need to be extracted as they are except that the THN can be dropped.
If there are two, three, four or even five etc. 'THN' lines, they all need to be extracted from the batch and sent to OUTPUT.(+- same as previous question)
Thirdly, although sequentially the second step, the last figure in the 'AGE' column of the main tabular data just before the 'THN' line, needs to be attached to the extracted 'THN' line directly below it.(in this case the figures 7, 13 and 18) These need to be added to their respective THN lines. See expected output below where the ages have been inserted after the two 'C03 104' variables in each line.
If there are no 'THN' lines in a given batch, the entire batch should be ignored, with no output, and the next batch(starting with a '1(P1)' again) considered.
The correct output expected from the above batch is:
CO3 104 7 11.4 6.7 .0230 164 2 4 1 3 2 0 0 0 0
CO3 104 13 17.5 12.2 .1044 265 6 28 2 25 8 15 3 0 0
CO3 104 18 21.6 15.9 .2142 220 8 47 2 45 6 19 20 0 0
As will be seen from this, the two variables from the top line are inserted at the start of the output THN data line. The age figure read from the input batch is then inserted into its respective THN line and thereafter the rest of the THN line data is attached.
My effort some time ago but not updated is as follows:
while ( my $line = <INPUT> ) {
if($line =~ /\s{6,11}(\w{1}\d{1}\w{0,5})\s{0,5}(\d{3})/) {
my #c_no = "$1,$2\n";
foreach (#c_no) {
print OUTPUT $_;
}
if ($line =~ /^(\s{1}THN)(\s{1,3}\d{0,2}.\d)(\s{1,3}\d{0,2}.\d)(\s{1,2}\d{0,1}.\d{4})(\s{1,2}\d{2,4})
(\s{2,3}\d{1,2})(\s{1,6}\d{1,4})(\s{1,2}\d{1,2})(\s{1,5}\d{1,4})(\s{1,4}\d{1,4})
(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})|(^1(P1))/x){
print OUTPUT "$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14\n";
}
}
}
Advice, guidance and help would be greatly appreciated.
This is much more simply done using split to separate each line into space-delimited fields, and quite straightforward if you maintain state variables for the two fields from the header row, and the age from any row whose first field is entirely digits. Then all that is necessary is to print these three values before the numbers on any line that starts with THN
Note that it's simplest to pass the name of the input file as a parameter on the command line. Then all you have to do is read from <>. All the opening and error handling are already done for you
The output format you've asked for is rather esoteric. I can't see any pattern to the column widths and I've had to write a custom printf format to recreate it. If you need something else then all the values in each output line are in the #data array, which you can use as you wish
use strict;
use warnings 'all';
my ($c1, $c2, $age);
while ( <> ) {
next unless /\S/;
my #fields = split;
if ( $fields[0] eq '1(P1)' ) {
($c1, $c2) = #fields[10,11];
}
elsif ( $fields[0] !~ /\D/ ) {
$age = $fields[0];
}
elsif ( $fields[0] eq 'THN' ) {
my #data = ( $c1, $c2, $age, #fields[1..13] );
printf "%4s %5s %5d %5.1f%5.1f%7.4f%5d%4d%7d%3d%6d%5d%5d%5d%5d%5d\n", #data;
}
}
output
C03 104 7 11.4 6.7 0.0230 164 2 4 1 3 2 0 0 0 0
C03 104 13 17.5 12.2 0.1044 265 6 28 2 25 8 15 3 0 0
C03 104 18 21.6 15.9 0.2142 220 8 47 2 45 6 19 20 0 0
I copied and modified your example data, so this hasn't had a really good test. And I'm printing to STDOUT for testing purposes, but that should be easy to change.
The trick is to recognize that you've got line matching to do, which is great with regexes, and other processing, which is probably better with plain old code. So build a little loop, and process the lines with equal precedence (this is important for detecting errors in the file - don't try to nest things too much). Put in some state variables to help keep track of what comes next, and be sure to reset them appropriately.
Also, one thing I noticed in your example code is that you spent a lot if time getting spacing and number-of-digits right for the fields. That was almost certainly wasted time in this context, since the key was the "THN" at the start of the line. One trick with processing text is to focus on the things you really need, and use .* for the other stuff. That way, line noise or a syntax error or some strange formatting glitch won't screw up your program. (Sometimes .* becomes [^"]* or whatever, but you take the point...)
my $line_prefix, $have_age_col, $age_col;
while (<>) {
if (/^1\(P1\).*\s(?P<two_vars>\w+\s+\w+)\s+PROCESSED .* TIME .* KSINA=.*$/) {
# Start new section
$line_prefix = $+{two_vars};
$have_age_col = 0;
$age_col = undef;
}
if (/^AGE /) {
$have_age_col = 1;
}
if ($have_age_col && /^\s{0,5}(\d+)/) {
$age_col = substr " ".$1, -5;
}
if (/^THN /) {
die "THN encountered without header"
unless $line_prefix;
die "THN encountered without age column"
unless $have_age_col and $age_col;
s/^THN \s*//;
s/\s+$//;
my $output = "$line_prefix $age_col $_\n";
print STDOUT $output;
}
}

Reshape/pivot pandas dataframe

I have a dataframe with variables: id, 2001a, 2001b, 2002a, 2002b, 2003a, 2003b, etc.
I am trying to figure out a way to pivot the data so the variables are: id, year, a, b
The 16.2 documentation refers to some reshaping and pivoting, but that seemed to speak more towards hierarchical columns.
Any suggestions?
I am thinking about creating a hierarchical dataframe, but am not sure how to map the year in the original variable names to a created hierarchical column
sample df:
id 2001a 2001b 2002a 2002b 2003a etc.
1 242 235 5735 23 1521
2 124 168 135 1361 1
3 436 754 1 24 5124
etc.
Here is a way to create hierarchical columns.
df = pd.DataFrame({'2001a': [242,124,236],
'2001b':[242,124,236],
'2002a': [242,124,236],
'2002b': [242,124,236],
'2003a': [242,124,236]})
df.columns = df.columns.str.split('(\d+)', expand=True)
df
2001 2002 2003
a b a b a
0 242 242 242 242 242
1 124 124 124 124 124
2 236 236 236 236 236