So I'm trying to figure out a good way of vectorizing a calculation and I'm a bit stuck.
| A | B (Calculation) | B (Value) |
|---|----------------------|-----------|
| 1 | | |
| 2 | | |
| 3 | | |
| 4 | =SUM(A1:A4)/4 | 2.5 |
| 5 | =(1/4)*A5 + (3/4)*B4 | 3.125 |
| 6 | =(1/4)*A6 + (3/4)*B5 | 3.84375 |
| 7 | =(1/4)*A7 + (3/4)*B6 | 4.6328125 |
I'm basically trying to replicate Wilder's Average True Range (without using TA-Lib). In the case of my simplified example, column A is the precomputed True Range.
Any ideas of how to do this without looping? Breaking down the equation it's effectively a weighted cumulative sum... but it's definitely not something that the existing pandas cumsum allows out of the box.
This is indeed an ewm problem. The issue is that the first 4 rows are crammed together into a single row... then ewm takes over
a = df.A.values
d1 = pd.DataFrame(dict(A=np.append(a[:4].mean(), a[4:])), df.index[3:])
d1.ewm(adjust=False, alpha=.25).mean()
A
3 2.500000
4 3.125000
5 3.843750
6 4.632812
Related
I have a hierarchy-based event stream, where each hierarchy parent node(represented as level0/1) has multiple children (level0(0/1/2) and sub child (level00(0/1/2)). "level" is just a placeholder, each hierarchy level has its own unique name. The only rule is that a parent node hierarchy string is always included in the child's hierarchy string name. Assume that this event stream has 300k and more entries.
| index | hierarchystr |
| ----- | --------------------- |
| 0 | level0level00level000|
| 1 | level0level01 |
| 2 | level0level02level021|
| 3 | level0level02level021|
| 4 | level0level02level020|
| 5 | level0level02level021|
| 6 | level1level02level021|
| 7 | level1level02level021|
| 8 | level1level02level021|
| 9 | level2level02level021|
Now I want to do an inclusive group_by by a separate list and the line should be included if the string in the array is included in the string of the hierarchystr column, expected output (beware hstrs is every time in a different order!):
#hstrs = ["level0", "level1", "level0level01", "level0level02", "level0level02level021"]
|index| 0 | Count |
|-----|---------------------|-------|
|0 |level0 | 6 |
|1 |level1 | 3 |
|2 |level0level01 | 1 |
|3 |level0level02 | 4 |
|4 |level0level02level021| 3 |
I tried the following solutions, but all are slow as hell:
#V1
for hstr in hstrs:
s = df[df.hierarchystr.str.contains(hstr)]
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
beforeset.append(hstr)
#V2
for hstr in hstrs:
s = df.hierarchystr.str.extract('(' + hstr + ')', expand=True)
s2 = s.count()
s3 = s2.values[0]
if s3 > 200:
list.append(hstr)
#V3 - fastest, but also slow and not satisfying
containing =[item for hierarchystr in df.hierarchystr for item in hstrs if item in hierarchystr]
containing = Counter(containing)
df1 = pd.DataFrame([containing]).T
nodeNamesWithOver200 = df1[df1 > 200].dropna().index.values
I also tried versions for all variables at once with pat and extract, but in return the size per group changes in every run, because the list hstrs is every run in a different order.
df.hierarchystr.extract[all](pat="|".join(hstrs))
Is there a regex and method possible that do this task in one step so this is also applicable for huge data frames at an appropriate time - that not depending on the order of the hstrs array?
You can try:
count = [df['hierarchystr'].str.startswith(hstr).sum() for hstr in hstrs]
out = pd.DataFrame({'hstr': hstrs, 'count': count})
print(out)
# Output
hstr count
0 level0 6
1 level1 3
2 level0level01 1
3 level0level02 4
4 level0level02level021 3
I have simplified my problem to solve. Lets suppose I have three tables. One containing data and specific codes that identify objects lets say Apples.
+-------------+------------+-----------+
| Data picked | Color code | Size code |
+-------------+------------+-----------+
| 1-8-2018 | 1 | 1 |
| 1-8-2018 | 1 | 3 |
| 1-8-2018 | 2 | 2 |
| 1-8-2018 | 2 | 3 |
| 1-8-2018 | 2 | 2 |
| 1-8-2018 | 3 | 3 |
| 1-8-2018 | 4 | 1 |
| 1-8-2018 | 4 | 1 |
| 1-8-2018 | 5 | 3 |
| 1-8-2018 | 6 | 1 |
| 1-8-2018 | 6 | 2 |
| 1-8-2018 | 6 | 2 |
+-------------+------------+-----------+
And i have two related helping tables to help understand the codes (their relationships are inactive in the model due to ambiguity with other tables in the real case).
+-----------+--------+
| Size code | Size |
+-----------+--------+
| 1 | Small |
| 2 | Medium |
| 3 | Large |
+-----------+--------+
and
+------------+----------------+-------+
| Color code | Color specific | Color |
+------------+----------------+-------+
| 1 | Light green | Green |
| 2 | Green | Green |
| 3 | Semi green | Green |
| 4 | Red | Red |
| 5 | Dark | Red |
| 6 | Pink | Red |
+------------+----------------+-------+
Lets say that I want to create an extra column in the original table to determine which apples are class A and class B given that medium green Apples are class A and large Red apples are class B, the other remain blank as the example below.
+-------------+------------+-----------+-------+
| Data picked | Color code | Size code | Class |
+-------------+------------+-----------+-------+
| 1-8-2018 | 1 | 1 | |
| 1-8-2018 | 1 | 3 | |
| 1-8-2018 | 2 | 2 | A |
| 1-8-2018 | 2 | 3 | |
| 1-8-2018 | 2 | 2 | A |
| 1-8-2018 | 3 | 3 | |
| 1-8-2018 | 4 | 1 | |
| 1-8-2018 | 4 | 1 | |
| 1-8-2018 | 5 | 3 | B |
| 1-8-2018 | 6 | 1 | |
| 1-8-2018 | 6 | 2 | |
| 1-8-2018 | 6 | 2 | |
+-------------+------------+-----------+-------+
What's the proper DAX to use given the relationships are initially inactive. Preferably solvable without creating any further additional columns in any table. I already tried codes like:
CALCULATE (
"A" ;
FILTER ( 'Size Table' ; 'Size Table'[Size] = "Medium");
FILTER ( 'Color Table' ; 'Color Table'[Color] = "Green")
)
And many variations on the same principle
Given that the relationships are inactive, I'd suggest using LOOKUPVALUE to match ID values on the other tables. You should be able to create a calculated column as follows:
Class =
VAR Size = LOOKUPVALUE('Size Table'[Size],
'Size Table'[Size code], 'Data Table'[Size code])
VAR Color = LOOKUPVALUE('Color Table'[Color],
'Color Table'[Color code], 'Data Table'[Color code])
RETURN SWITCH(TRUE(),
(Size = "Medium") && (Color = "Green"), "A",
(Size = "Large") && (Color = "Red"), "B", BLANK())
If your relationships are active, then you don't need the lookups:
Class = SWITCH(TRUE(),
(RELATED('Size Table'[Size]) = "Medium") &&
(RELATED('Color Table'[Color]) = "Green"),
"A",
(RELATED('Size Table'[Size]) = "Large") &&
(RELATED('Color Table'[Color]) = "Red"),
"B",
BLANK())
Or a bit more elegantly written (especially for more classes):
Class =
VAR SizeColor = RELATED('Size Table'[Size]) & " " & RELATED('Color Table'[Color])
RETURN SWITCH(TRUE(),
SizeColor = "Medium Green", "A",
SizeColor = "Large Red", "B",
BLANK())
I have a model for which I want to perform a group-by on two values and calculate the percentages of each value per outer grouping.
Currently I just make a query to get all the rows and put them into a pandas dataframe and perform something similar to the answer here. Although this works I'm sure it would be more efficient if I could make the query return the information I require directly.
I am currently running Django 2.0.5 with a backend DB on PostgreSQL 9.6.8
I think window functions could be the solution as indicated here but I cannot construct a successful combination of annotate and values to give me the desired output.
Another possible solution could be rollup introduced in PostgreSQL 9.5 if I can find a way to get the summary row as a set of extra columns for each row? But I also think it's not yet supported by Django.
Model:
class ModelA(models.Model):
grouper1 = models.CharField()
grouper2 = models.CharField()
metric1 = models.IntegerField()
All rows:
grouper1 | grouper2 | metric1
---------+----------+---------
A | C | 2
A | C | 2
A | C | 2
A | D | 4
A | D | 4
A | D | 4
B | C | 5
B | C | 5
B | C | 5
B | D | 6
B | D | 4
B | D | 5
Desired output:
grouper1 | grouper2 | sum(metric1) | Percentage
---------+----------+--------------+-----------
A | C | 6 | 40
A | D | 12 | 60
B | C | 15 | 50
B | D | 15 | 50
I got close to what I expected with
ModelA.objects.all(
).values(
'grouper1',
'grouper2'
).annotate(
SumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1'), F('grouper2')]),
GroupSumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1')])
)
However this returns a row for every original row in the database like so:
grouper1 | grouper2 | sum(metric1) | Percentage
---------+----------+--------------+-----------
A | C | 6 | 40
A | C | 6 | 40
A | C | 6 | 40
A | D | 12 | 60
A | D | 12 | 60
A | D | 12 | 60
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | C | 15 | 50
B | D | 15 | 50
In this situation .distinct() might help.
More information is here.
I am a newbe to the data science and I have downloaded the code which will tell the viewers for the next week.
But in this following code I am not able to understand the what the following function does, and how it will predict the values.
The data set is of 7 values for each. Why only 9 are inserted into the braces?
regr1 = linear_model.LinearRegression()
regr1.fit(x1, y1)
predicted_value1 = regr1.predict(9)
What thess lines will do?
Here is the full code:
import pandas as pd
def get_data(file_name):
data = pd.read_csv(file_name)
flash_x_parameter = []
flash_y_parameter = []
arrow_x_parameter = []
arrow_y_parameter = []
for x1,y1,x2,y2 in zip(data['flash_episode_number'],
data['flash_us_viewers'],
data['arrow_episode_number'],data['arrow_us_viewers']):
flash_x_parameter.append([float(x1)])
flash_y_parameter.append(float(y1))
arrow_x_parameter.append([float(x2)])
arrow_y_parameter.append(float(y2))
return flash_x_parameter,
flash_y_parameter,arrow_x_parameter,arrow_y_parameter
def more_viewers(x1,y1,x2,y2):
regr1 = linear_model.LinearRegression()
regr1.fit(x1, y1)
predicted_value1 = regr1.predict(9)
regr2 = linear_model.LinearRegression()
regr2.fit(x2, y2)
predicted_value2 = regr2.predict(9)
print predicted_value1,"are the flash viewers"
print predicted_value2,"are the arrow viewers"
if predicted_value1 > predicted_value2:
print "The Flash Tv Show will have more viewers for next week"
else:
print "Arrow Tv Show will have more viewers for next week"
x1,y1,x2,y2 = get_data('C:\\Users\\SHIVAPRASAD\\Desktop\\test.csv')
more_viewers(x1,y1,x2,y2)
No, your data is NOT the set of 7 values, it has 9 rows:
+----------------+-------------------+----------------+------------------+
| FLASH_EPISODE | FLASH_US_VIEWERS | ARROW_EPISODE | ARROW_US_VIEWERS |
+----------------+-------------------+----------------+------------------+
| 1 | 4.83 | 1 | 2.84 |
| 2 | 4.27 | 2 | 2.32 |
| 3 | 3.59 | 3 | 2.55 |
| 4 | 3.53 | 4 | 2.49 |
| 5 | 3.46 | 5 | 2.73 |
| 6 | 3.73 | 6 | 2.6 |
| 7 | 3.47 | 7 | 2.64 |
| 8 | 4.34 | 8 | 3.92 |
| 9 | 4.66 | 9 | 3.06 |
+----------------+-------------------+----------------+------------------+
(as your code is from Dataconomy Linear Regression Implementation in Python.)
So the value 9 in the command
predicted_value1 = regr1.predict(9)
is OK.
I would like to check if a value has appeared in some previous row of the same column.
At the end I would like to have a cumulative count of the number of distinct observations.
Is there any other solution than concenating all _n rows and using regular expressions? I'm getting there with concatenating the rows, but given the limit of 244 characters for string variables (in Stata <13), this is sometimes not applicable.
Here's what I'm doing right now:
gen tmp=x
replace tmp = tmp[_n-1]+ "," + tmp if _n > 1
gen cumu=0
replace cumu=1 if regexm(tmp[_n-1],x+"|"+x+",|"+","+x+",")==0
replace cumu= sum(cumu)
Example
+-----+
| x |
|-----|
1. | 12 |
2. | 32 |
3. | 12 |
4. | 43 |
5. | 43 |
6. | 3 |
7. | 4 |
8. | 3 |
9. | 3 |
10. | 3 |
+-----+
becomes
+-------------------------------+
| x | tmp |
|-----|--------------------------
1. | 12 | 12 |
2. | 32 | 12,32 |
3. | 12 | 12,32,12 |
4. | 43 | 3,32,12,43 |
5. | 43 | 3,32,12,43,43 |
6. | 3 | 3,32,12,43,43,3 |
7. | 4 | 3,32,12,43,43,3,4 |
8. | 3 | 3,32,12,43,43,3,4,3 |
9. | 3 | 3,32,12,43,43,3,4,3,3 |
10. | 3 | 3,32,12,43,43,3,4,3,3,3|
+--------------------------------+
and finally
+-----------+
| x | cumu|
|-----|------
1. | 12 | 1 |
2. | 32 | 2 |
3. | 12 | 2 |
4. | 43 | 3 |
5. | 43 | 3 |
6. | 3 | 4 |
7. | 4 | 5 |
8. | 3 | 5 |
9. | 3 | 5 |
10. | 3 | 5 |
+-----------+
Any ideas how to avoid the 'middle step' (for me that gets very important when having strings in x instead of numbers).
Thanks!
Regular expressions are great, but here as often elsewhere simple calculations suffice. With your sample data
. input x
x
1. 12
2. 32
3. 12
4. 43
5. 43
6. 3
7. 4
8. 3
9. 3
10. 3
11. end
end of do-file
you can identify first occurrences of each distinct value:
. gen long order = _n
. bysort x (order) : gen first = _n == 1
. sort order
. l
+--------------------+
| x order first |
|--------------------|
1. | 12 1 1 |
2. | 32 2 1 |
3. | 12 3 0 |
4. | 43 4 1 |
5. | 43 5 0 |
|--------------------|
6. | 3 6 1 |
7. | 4 7 1 |
8. | 3 8 0 |
9. | 3 9 0 |
10. | 3 10 0 |
+--------------------+
The number of distinct values seen so far is then just a cumulative sum of first using sum(). This works with string variables too. In fact this problem is one of several discussed within
http://www.stata-journal.com/sjpdf.html?articlenum=dm0042
which is accessible to all as a .pdf. search distinct would have pointed you to this article.
Becoming fluent with what you can do with by:, sort, _n and _N is an important skill in Stata. See also
http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
for another article accessible to all.