Reshape/pivot pandas dataframe - python-2.7

I have a dataframe with variables: id, 2001a, 2001b, 2002a, 2002b, 2003a, 2003b, etc.
I am trying to figure out a way to pivot the data so the variables are: id, year, a, b
The 16.2 documentation refers to some reshaping and pivoting, but that seemed to speak more towards hierarchical columns.
Any suggestions?
I am thinking about creating a hierarchical dataframe, but am not sure how to map the year in the original variable names to a created hierarchical column
sample df:
id 2001a 2001b 2002a 2002b 2003a etc.
1 242 235 5735 23 1521
2 124 168 135 1361 1
3 436 754 1 24 5124
etc.

Here is a way to create hierarchical columns.
df = pd.DataFrame({'2001a': [242,124,236],
'2001b':[242,124,236],
'2002a': [242,124,236],
'2002b': [242,124,236],
'2003a': [242,124,236]})
df.columns = df.columns.str.split('(\d+)', expand=True)
df
2001 2002 2003
a b a b a
0 242 242 242 242 242
1 124 124 124 124 124
2 236 236 236 236 236

Related

How to shift side-by-side bars in Proc SGPLOT for Time Data? discreteoffset doesn't work with (type=time) option

I am currently trying to create a side-by-side dual axis bar chart in proc sgplot for the data which is based on dates. I am currently stuck at last thing, where I am not able to shift the bars using discreteoffset option on vbar, because I am using Type=time on xaxis. If I comment this, then the bars are shifted but then the xaxis tick values look clumsy. So I wonder if there is any other option that can move the bars for Date/time Data? Following is my SAS code.
data input;
input people visits outcome date date9.;
datalines;
41 448 210 1-Jan-18
43 499 207 1-Feb-18
45 544 221 1-Mar-18
49 564 239 1-Apr-18
39 575 236 1-May-18
37 549 210 1-Jun-18
51 602 263 1-Jul-18
32 586 208 1-Aug-18
52 557 225 1-Sep-18
41 534 227 1-Oct-18
48 499 217 1-Nov-18
44 514 235 1-Dec-18
31 582 281 1-Jan-19
33 545 269 1-Feb-19
38 574 259 1-Mar-19
29 564 247 1-Apr-19
29 642 274 1-May-19
28 556 216 1-Jun-19
20 531 187 1-Jul-19
31 604 226 1-Aug-19
19 513 186 1-Sep-19
24 483 185 1-Oct-19
28 401 156 1-Nov-19
18 450 158 1-Dec-19
21 418 178 1-Jan-20
28 396 149 1-Feb-20
43 488 177 1-Mar-20
33 539 205 1-Apr-20
57 631 244 1-May-20
54 695 291 1-Jun-20
58 732 309 1-Jul-20
62 681 301 1-Aug-20
42 654 291 1-Sep-20
57 749 365 1-Oct-20
60 627 249 1-Nov-20
56 623 244 1-Dec-20
54 712 298 1-Jan-21
62 655 262 1-Feb-21
;
run;
proc sgplot data=input;
format date monyy7.;
styleattrs datacolors=(Red DarkBlue) datacontrastcolors=(black black) datalinepatterns=(solid);
vbar date / response=visits discreteoffset=-0.17 barwidth=0.3;
vbar date / response=outcome discreteoffset=0.17 barwidth=0.3;
vline date / response=people y2axis lineattrs=(color=black thickness=3);
xaxis display=(nolabel) /*fitpolicy=rotate valuesrotate=vertical*/ type=time /*interval=month*/;
yaxis grid label='Label1' values=(0 to 800 by 100);
y2axis label='Label2' values=(0 to 70 by 10);
keylegend / title="";
run;
Output I am getting:
Output I want: (With shifted bars, but it is changing dates)
Appreciate any help!
Thank you.
Reshape the data with transpose so the variables wanted side by side become categorical, i.e. name value pairs. The name can be used in vbar as the group= with groupdisplay=cluster.
Note: The xaxis type=time appears to perform special checks based on the format of the vbar variable, and will rendered a pretty two-line axis label when that format is date9. I've never seen this discussed in the documentation.
Example:
Uses name= in the plotting statements so the keylegend can look prettier.
proc transpose data=input out=plot;
by rowid date;
copy people;
var visits outcome;
run;
proc sgplot data=plot;
vbar date / response=col1 group=_name_ groupdisplay=cluster name='relatedcounts';
vline date / response=people group=_name_ y2axis lineattrs=(color=black thickness=3) name='people';
xaxis
type = time
interval = month
;
format date date9.;
yaxis grid label='Related counts' values=(0 to 800 by 100);
y2axis label='# People' values=(0 to 70 by 10);
keylegend 'relatedcounts' / title="";
run;
Will produce

Looping in SAS to bring the latest value

I am trying to find days matching to a reference number of days given or else to find the number of days close to the reference days.
I coded till here, however not sure how to go forward.
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 560 86 181
3 2016-02-12 560 82 263
3 2016-02-12 560 69 332
3 2016-02-12 560 77 409
So now I want to bring out the last value close to the reference days.
and the next total_days should start from ZERO again to find the next window. How can I do this?
Here is a code that I wrote
data want;
do until (totaldays <= ref_days);
set have;
by ID ref_days notsorted;
if first.id then totaldays=0;
else totaldays+lags;
end;
run;
Required Output:
ID Date ref_days lags total_days
1 2017-02-02 224 . 0
1 2017-02-02 224 84 84
1 2017-02-02 224 84 168
2 2015-01-21 213 300 388
3 2016-02-12 560 95 .
3 2016-02-12 300 86 181
3 2016-02-12 300 82 263
3 2016-02-12 300 69 .
3 2016-02-12 300 77 146
A while ago I did similar to this via Proc sql. It calculates all the distances and takes the closest one. It works with moderate size dataset. Hopefully it is of some use.
proc sql;
select * from
(
select *,
abs(t1.link-t2.link) as dist /*In your case these would be dateVars*/
from test1 t1
left join test2 t2
on 1=1) group by system1 having dist=min(dist);
;
quit;
There was some talk that the left join on 1=1 is a bit silly (as full outter join would suffice, or something.) However this worked for the problem in question.

How to determine the number of filled drums, and the room left in each drum

Not quite a homework problem, but it may as well be:
You have a long list of positive integer values stored in column A. These are packets in unit U.
A Drum can fit up to 500 U, but you cannot break up packets.
How many drums are required for any given list of values in column A?
This does not have to be the most efficient answer, processing in row order is absolutely fine.
I Think you should be able to solve this with a formula, but the closest I got was
=CEILING(SUM(A1:A1000)/500;1)
Of course, this breaks up packets.
Additionally, this problem requires me to be able to find the room left in each drum used, but emphasis for this question should remain on just the number required.
This cannot be done with a single simple formula. Each drum and packet needs to be counted. However contrary to my comment, for this particular problem a spreadsheet works well, and there is no need for a macro.
First, set B2 to 500 for use in other formulas. If column A is not yet filled, use the formula =RANDBETWEEN(1,B$2) to add some values.
Column C is the main formula that determines how full each drum is. Set C2 to =A2. C3 is =IF(C2+A3>B$2,A3,C2+A3). Fill C3 down to fill the remaining rows.
For column D, use =IF(C2+A3>B$2,B$2-C2,""). However the last row of column D is shorter: =B$2-C21 and change 21 to whatever the last row is.
Finally in column E we find the answer, which is simply =COUNT(D2:D21).
Packets Drum Size How Full Room left in each drum used Number of filled drums
------- --------- -------- --------------------------- ----------------------
206 500 206 294 13
309 309
68 377
84 461 39
305 305 195
387 387 113
118 118
8 126 374
479 479 21
492 492 8
120 120
291 411 89
262 262
108 370 130
440 440 60
88 88
100 188
102 290 210
478 478 22
87 87 413
For OpenOffice Calc, use semicolons ; instead of commas , in formulas.

SAS using Datalines - "observation read not used"

I am a complete newb to SAS and I only know is basic sql. Currently taking Regression class and having trouble with SAS code.
I am trying to input two columns of data where x variable is State; y variable is # of accidents for a simple regression.
I keep getting this:
ERROR: No valid observations are found.
Number of Observations Read 51
Number of Observations Used 0
Number of Observations with Missing Values 51
Is it because datalines only read numbers and not charcters?
Here is the code as well as the datalines:
Data Firearm_Accidents_1999_to_2014;
ods graphics on;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
Connecticut 0
Delaware 0
District_of_Columbia 0
Florida 350
Georgia 413
Hawaii 0
Idaho 0
Illinois 287
Indiana 288
Iowa 0
Kansas 44
Kentucky 384
Louisiana 562
Maine 0
Maryland 21
Massachusetts 27
Michigan 168
Minnesota 0
Mississippi 332
Missouri 320
Montana 0
Nebraska 0
Nevada 0
New_Hampshire 0
New_Jersey 85
New_Mexico 49
New_York 218
North_Carolina 437
North_Dakota 0
Ohio 306
Oklahoma 227
Oregon 41
Pennsylvania 465
Rhode_Island 0
South_Carolina 324
South_Dakota 0
Tennessee 603
Texas 876
Utah 0
Vermont 0
Virginia 203
Washington 45
West_Virginia 136
Wisconsin 64
Wyoming 0
;
run; proc print;
proc reg data = Firearm_Accidents_1999_to_2014;
model State = Sum_OF_Deaths;
ods graphics off;
run; quit;
OK, some different levels of issues here.
ODS GRAPHICS go before and after procs, not inside them.
When reading a character variable you need to tell SAS using an informat.
This allows you to read in the data. However your regression has several issues. For one, State is a character variable and you can do regression with a character variable. I think that issue is beyond this forum. Review your regression basics and check what you're trying to do.
Data Firearm_Accidents_1999_to_2014;
informat state $32.;
Input State Sum_OF_Deaths;
Datalines;
Alabama 526
Alaska 0
Arizona 150
Arkansas 246
California 834
Colorado 33
....
;
run;

Merge dataset without common variable (By)?

Currently I have two datasets with similar variable lists. Each dataset has a procedure variable. I want to compare the frequency of the procedure variable between datasets. I created a flag in both datasets to id the source dataset, and was going to merge but don't have a common identifier. How do I merge a dataset without deleting any observations? This isn't just a simple Merge without a By function, right?
Currently have:
Data.a Data.b
pproc proc1_numb
70 9
71 15
77 24
80 80
81 42
83 71
86 66
87 125
121 159
125 242
Want Output:
pproc freq
9 1
15 1
24 1
42 1
66 1
70 1
71 2
77 1
80 2
81 1
83 1
86 1
87 1
121 1
125 2
159 1
242 1
If I understand your question properly, you should just concatenate the two datasets into one and rename the variable. Then you can use PROC MEANS to get the frequencies. Something like this:
data all;
set a
b(rename=(proc1_numb=pproc));
run;
proc means nway data=all noprint;
class pproc;
output out=want(drop=_type_ rename=(_freq_=freq));
run;