If index duplicated then add column value to sum - python-2.7

The pandas DF has datetime index with price and volume at that price.
Last Volume
Date_Time
20160907 070000 1.1249 17
20160907 070001 1.1248 12
20160907 070001 1.1249 15
20160907 070002 1.1248 13
20160907 070002 1.1249 20
I want to create a column that keeps a running total(sum) of volume through the sequence if the price repeats. I am trying to create a column that would look like this.
Last Volume VolumeCount
1.1249 17 17
1.1248 12 12
1.1249 15 32
1.1248 13 25
1.1249 20 52
I have been working on different functions and loops and I can't seem to create a column that that isn't a total sum of the group. I would really appreciate any help or suggestions. Thank you.

Try:
DF['VolumeCount'] = DF.groupby('Last')['Volume'].cumsum()
I hope this helps.

You want to accumulated volume on contiguous sets of same Last
consider the df
Last Volume
Date_Time
20160907-70000 1.1249 17
20160907-70001 1.1248 12
20160907-70001 1.1248 15
20160907-70002 1.1248 13
20160907-70002 1.1249 20
Then
df.Volume.groupby((df.Last != df.Last.shift()).cumsum()).cumsum()
Date_Time
20160907-70000 17
20160907-70001 12
20160907-70001 27
20160907-70002 40
20160907-70002 20
Name: Volume, dtype: int64

Related

POA "weird" outcome (IMHO)

I have gathered satellite data (every 5 minutes, from "Solcast") for GHI, DNI and DHI and I use pvlib to get the POA value.
The pvlib function I use:
def get_irradiance(site_location, date, tilt, surface_azimuth, ghi, dni, dhi):
times = pd.date_range(date, freq='5min', periods=12*24, tz=site_location.tz)
solar_position = site_location.get_solarposition(times=times)
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt=tilt,
surface_azimuth=surface_azimuth,
ghi=ghi,
dni=dni,
dhi=dhi,
solar_zenith=solar_position['apparent_zenith'],
solar_azimuth=solar_position['azimuth'])
return pd.DataFrame({'GHI': ghi,
'DNI': dni,
'DHI': dhi,
'POA': POA_irradiance['poa_global']})
When I compare GHI and POA values for 12 June 2022 and 13 June 2022 is see the POA value for 12 June is significantly behind the GHI. The location is in The Netherlands, I use a tilt of 12.5 degrees and an azimuth of 180 degrees. Here is the outcome (per hour, from 6:00 - 20:00):
12 Juni 2022
GHI DNI DHI POA
6 86.750000 312.750000 40.500000 40.277034
7 224.583333 543.000000 69.750000 71.130218
8 366.833333 598.833333 113.833333 178.974322
9 406.083333 182.000000 304.000000 348.272844
10 532.166667 266.750000 346.666667 445.422584
11 725.666667 640.416667 226.500000 509.360716
12 688.500000 329.416667 409.583333 561.630762
13 701.333333 299.750000 439.333333 570.415438
14 725.416667 391.666667 387.750000 532.529676
15 753.916667 629.166667 244.333333 407.665794
16 656.750000 599.750000 215.333333 293.832376
17 381.833333 36.416667 359.416667 356.317883
18 411.750000 569.166667 144.750000 144.254438
19 269.750000 495.916667 102.500000 102.084439
20 134.583333 426.416667 51.583333 51.370738
And
13 June 2022
GHI DNI DHI POA
6 5.666667 0.000000 5.666667 5.616296
7 113.500000 7.750000 111.416667 111.948831
8 259.500000 106.833333 208.416667 256.410392
9 509.166667 637.750000 150.583333 514.516389
10 599.333333 518.666667 240.583333 619.050821
11 745.250000 704.500000 195.583333 788.773772
12 757.250000 549.666667 292.000000 798.739403
13 742.000000 464.583333 335.000000 778.857394
14 818.250000 667.750000 243.000000 869.972769
15 800.750000 776.833333 166.916667 852.559043
16 699.000000 733.666667 167.166667 730.484502
17 582.666667 729.166667 131.916667 593.802853
18 449.166667 756.583333 83.500000 434.958210
19 290.083333 652.666667 68.666667 254.048655
20 139.833333 466.916667 48.333333 97.272684
What can be an explanation of the significantly low POA compared to the GHI values on 12 June?
I have this outcome with other days too: some days have a POA much closer to the GHI than other days. Maybe this is "normal behaviour" and I do not reckon with weather influences which maybe important...
I use the POA to do a PR (Performance Ratio) calculation but I do not get "trusted" results..
Hope someone can shine a light on these values.
Kind regards,
Oscar
The Netherlands.
I'm really sorry, although the weather is unpredictable in the Netherlands I made a very big booboo in using dd-mm-yyyy format instead of mm-dd-yyyy. Something I overlooked for a long time...(I never had used mm-dd-yyyy, but that's a lame excuse...)
Really sorry, hope you did not think about it too long..
Thank you anyway for reacting!
I've good values now!
Oscar (shame..)

Django query to remove older values grouped by id?

Im trying to remove records from a table that have a duplicate value by their oldest timestamp(s), grouping by ID, so the results would be unique values per ID with the newest unique values per ID/timestamp kept, hopefully the below samples will make sense.
sample data:
id value timestamp
10 10 9/4/20 17:00
11 17 9/4/20 17:00
21 50 9/4/20 17:00
10 10 9/4/20 16:00
10 10 9/4/20 15:00
10 11 9/4/20 14:00
11 41 9/4/20 16:00
11 41 9/4/20 15:00
21 50 9/4/20 16:00
so id like to remove any values that have a dupliate value with the same id, keeping the newest timestamps, so the above data would become:
id value timestamp
10 10 9/4/20 17:00
11 17 9/4/20 17:00
21 50 9/4/20 17:00
10 11 9/4/20 14:00
11 41 9/4/20 16:00
EDIT:
query is just
SampleData.objects.all()
One approach could be using Subquery expressions as documented here.
Suppose your SampleData model looks like this:
class SampleData(models.Model):
id2 = models.IntegerField()
value = models.IntegerField()
timestamp = models.DateTimeField()
(I replaced id by id2 to avoid conflicts with the model id).
Then you could delete your duplicates like this:
newest = SampleData.objects.filter(id2=OuterRef('id2'), value=OuterRef('value')).order_by('-timestamp')
SampleData.objects.annotate(newest_id=Subquery(newest.values('pk')[:1])).exclude(pk=F('newest_id')).delete()
Edit:
It seems as if MySQL has some issues handling deletions and subqueries, as documented in this SO post.
In this case a 2 step approach should help: First getting the ids of the objects to delete and then deleting them:
newest = SampleData.objects.filter(id2=OuterRef('id2'), value=OuterRef('value')).order_by('-timestamp')
ids2delete = list(SampleData.objects.annotate(newest_id=Subquery(newest.values('pk')[:1])).exclude(pk=F('newest_id')).values_list('pk', flat=True))
SampleData.objects.filter(pk__in=ids2delete).delete()

Combined Bar plots in python

I have a grouped dataframe which is this:
Speed (mph)
Label Hour
5 13 18.439730
14 24.959555
15 33.912493
7 13 23.397055
14 18.497228
15 33.493978
12 13 32.851146
14 33.187193
15 32.597150
14 13 14.491841
14 12.397724
15 19.581669
21 13 34.985289
14 34.817009
15 34.888187
26 13 35.813901
14 36.622450
15 36.540348
28 13 33.761174
14 33.951116
15 33.736014
29 13 34.545862
14 34.227974
15 34.435377
I am trying to plot bar plots where each bar is grouped by their Label and Hour
An example:
The above graph is just an example I found on the internet. I don't really need the lines and the numbers over the bars.
I tried plotting like this:
newdf.plot.bar()
plt.show()
which gives me -
Question
How can I plot the graph so that Label:5,Hour:13,14,15 and close to gether, then some space and then Label:7,Hour:13,14,15 close together and so on?
It seems you need to unstack
df.unstack().plot.bar()
plt.show()

Select median value per time range

I need to select a median value for each id, in each age range. So in the following table, for id = 1, in age_range of 6 months, I need to select value for row 2. Basically, I need to create a column per id where only median for each range is selected.
id wt age_range
1 22 6
1 23 6
1 24 6
2 25 12
2 24 12
2 44 18
If I understand correctly, you're looking to make a new column where for each id and age_range you have the median value for comparison. You could do this in base SAS by using proc means to output the medians and then merge it back to the original dataset. However proc sql will do this all in one step and to easily name your new column.
proc sql data;
create table want as
select id, wt, age_range, median(wt) as median_wt
from have
group by id, age_range;
quit;
id wt age_range median_wt
1 24 6 23
1 22 6 23
1 23 6 23
2 24 12 24.5
2 25 12 24.5
2 44 18 44

GLPK Mathprog group of sets

I'm trying to code a model that can solve the Multiple Choice Knapsack Problem (MCKP) as described in Knapsack Problems involving dimensions, demands and multiple
choice constraints: generalization and transformations between
formulations (Found here, see figures 8 an 9). You can find an example GMPL model of the basic knapsack problem here. For anyone looking for a quick explanation of the knapsack problem read the following illustration:
You are an adventurer and have stumbled upon a treasure trove. There are hundreds of wonderful items 'i' that each have a weight 'w' and a profit 'p'. Say you have a knapsack with weight capacity as 'c' and you want to make the most profit without overfilling your knapsack. What is the best combination of items such that you make the most profit?
In code:
maximize obj :
sum{(i,w,p) in I} p*x[i];
Where 'I' is the basket of items, and x[i] is the binary variable (0 = not chosen, 1 = chosen)
The problem that I am having trouble with is the addition of multiple groups. MCKP requires exactly one item to be selected from each group. So, for example, lets say we have three groups from which to choose. They could be represented as follows (ignore actual values):
# Items: index, weight, profit
set ONE :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set TWO :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set THREE :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
I am confused on how I can iterate over each group and how I would define the variable x. I assume it would look something like:
var x{i,j} binary;
Where i is the index of items in j of groups. This assumes I define a set of sets:
set Groups{ONE,TWO,THREE}
Then I'd iterate over the groups of items:
sum{j in Groups, (i,w,p) in Groups[j]} p*x[i,j];
But I am concerned because I believe GMPL does not support ordered sets. I have seen this related question where the answer suggests defining a set within a set. However, I am not sure how it would apply in this particular scenario.
My main question, to be clear: In GMPL, how can I iterate over sets of sets (in this case a set of groups where each group has a set of items)?
Unlike AMPL, GMPL doesn't support sets of sets. Here's how to do it in AMPL:
set Groups;
set Items{Groups} dimen 3;
# define x and additional constraints
# ...
maximize obj: sum{g in Groups, (i,w,p) in Items[g]} p*x[i];
data;
set Groups := ONE TWO THREE;
# Items: index, weight, profit
set Items[ONE] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set Items[TWO] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
# Items: index, weight, profit
set Items[THREE] :=
1 10 10
2 10 10
3 15 15
4 20 20
5 20 20
6 24 24
7 24 24
8 50 50;
If you have no more than 300 variables, you can use a free student version of AMPL and solvers (e.g. CPLEX or Gurobi).
Based on this gnu mailing list thread, I believe GMPL/MathProg has support for what you want to do. Here's their example:
set WORKERS;
param number_of_shifts, integer, >= 1;
set WORKER_CLIQUE{1..number_of_shifts}, within WORKERS;
data;
set WORKERS := Jack Kate Sawyer Sun Juliet Richard Desmond Hugo;
param number_of_shifts := 2;
set WORKER_CLIQUE[1] := Sawyer, Juliet;
set WORKER_CLIQUE[2] := Jack, Kate, Hugo;
In your example, I assume you'd use something like, set Items{1..3}, within Groups; with the data block from #vitaut's answer.