I have table with the fields Amount, Condition1, Condition2.
Example:
Amount Condition1 Condition2
----------------------------------
123 Yes Yes
234 No Yes
900 Yes No
I want to calculate the 20% of the amount based on condition:
If both Condition1 and Condition2 is Yes then calculate 20% else 0.
My try: I tried with conditional custom column but unable to add AND in IF in the query editor.
You can write a conditional column like this:
= IF(AND(Table1[Condition1] = "Yes", Table1[Condition2] = "Yes"), 0.2 * Table1[Amount], 0)
Or you can use && instead of the AND function:
= IF(Table1[Condition1] = "Yes" && Table1[Condition2] = "Yes", 0.2 * Table1[Amount], 0)
Or an even shorter version using concatenation:
= IF(Table1[Condition1] & Table1[Condition2] = "YesYes", 0.2 * Table1[Amount], 0)
Try to create a new calculated column.
And Use below DAX query:
new_column = IF(Conditition1 = "Yes", IF(Condititon2 = "Yes",Amt * 0.2 ,0), 0)
How to assign a string or integer variable to turtle, using probabilities of the variables in a group/list? For example it is 0.4 probability that one specific variable is used from specific group/list. The function selects randomly the variable based on probability. I need to use the same method afterwards to choose a variable (string) from a list according to probability.
In python it should be:
import random
def random_value(probability_list, values):
r = random.random()
index = 0
while(r >= 0 and index < len(probability_list)):
r -= probability_list[index]
index += 1
value=values[index - 1]
value_index=index-1
return value,value_index
I tried it in Netlogo like below (get error that index is -1) but is there a better way?
globals [random_nr probabilities some_list index]
to initialize-variables
set some_list[]
set probabilities[]
end
to random_pick
set random_nr random-float 1
set probabilities [0.1 0.2 0.4 0.3]
set some_list ["String1" "String2" "String3" "String4"]
set index 0
while [(random_nr >= 0) and (length probabilities < index)] [
set random_nr random_nr - item index probabilities
set index index + 1 ]
set index index - 1
end
is there a better way?
Yes there is.
NetLogo 6.0 comes with the rnd extension bundled. (You can also download the extension separately for earlier versions of NetLogo.)
The rnd extension offers the rnd:weighted-one-of-list primitive, which does exactly what you're trying to do:
extensions [ rnd ]
to-report pick
let probabilities [0.1 0.2 0.4 0.3]
let some_list ["String1" "String2" "String3" "String4"]
report first rnd:weighted-one-of-list (map list some_list probabilities) last
end
Let me unpack the last expression a bit:
The role of (map list some_list probabilities) is to "zip" the two lists together, in order to get a list of pairs of the form: [["String1" 0.1] ["String2" 0.2] ["String3" 0.4] ["String4" 0.3]].
That list of pairs is passed as the first argument to rnd:weighted-one-of-list. We pass last as the second argument of rnd:weighted-one-of-list to tell it that it should use the second item of each pair as the probability.
rnd:weighted-one-of-list then picks one of the pairs at random, and returns that whole pair. But since we're only interested in the first item of the pair, we use first to extract it.
To understand how that code works, it helps to understand how Anonymous procedures work. Note how we make use of the concise syntax for passing list to map and for passing last to rnd:weighted-one-of-list.
Don't overlook the rnd extension:
https://github.com/NetLogo/Rnd-Extension
But it is possible to do it essentially as you propose. I'll to that here, but it would be better to use explicit arguments.
to-report random-pick
let _r random-float 1
let _ps [0.1 0.2 0.4 0.3]
let _lst ["String1" "String2" "String3" "String4"]
let _i 0
while [_r >= item _i _ps] [
set _r (_r - item _i _ps)
set _i (_i + 1) ]
report item _i _lst
end
I have a data frame as following:
ID Value
A 70
A 80
B 75
C 10
B 50
A 1000
C 60
B 2000
.. ..
I would like to group this data by ID, remove the outliers from the grouped data (the ones we see from the boxplot) and then calculate mean.
So far
grouped = df.groupby('ID')
statBefore = pd.DataFrame({'mean': grouped['Value'].mean(), 'median': grouped['Value'].median(), 'std' : grouped['Value'].std()})
How can I find outliers, remove them and get the statistics.
I believe the method you're referring to is to remove values > 1.5 * the interquartile range away from the median. So first, calculate your initial statistics:
statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \
'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)})
And then determine whether values in the original DF are outliers:
def is_outlier(row):
iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1']
median = statBefore.loc[row.ID]['median']
if row.Value > (median + (1.5* iq_range)) or row.Value < (median - (1.5* iq_range)):
return True
else:
return False
#apply the function to the original df:
df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)
#filter to only non-outliers:
df_no_outliers = df[~(df.outlier)]
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
data = df[~((df['Value'] < (Q1 - 1.5 * IQR)) |(df['Value'] > (Q3 + 1.5 *
IQR))).any(axis=1)]
just do :
In [187]: df[df<100].groupby('ID').agg(['mean','median','std'])
Out[187]:
Value
mean median std
ID
A 75.0 75.0 7.071068
B 62.5 62.5 17.677670
C 35.0 35.0 35.355339
I would like to read a matrix file something which looks like:
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
I would like to fetch all the pairs that have a value of > 0.8. E.g: sample1,sample3 0.8 sample2,sample3 0.8 etc in a large file .
When I use csv.reader, each line is turning in to a list and keeping track of row and column names makes program dodgy. I would like to know an elegant way of doing it like using numpy or pandas.
Desired output:
sample1,sample3 0.8
sample2,sample3 0.8
1 can be ignored because between same sample, it will be 1 always.
You can mask out the off upper-triangular values with np.triu:
In [11]: df
Out[11]:
sample1 sample2 sample3
sample
sample1 1.0 0.7 0.8
sample2 0.7 1.0 0.8
sample3 0.8 0.8 1.0
In [12]: np.triu(df, 1)
Out[12]:
array([[ 0. , 0.7, 0.8],
[ 0. , 0. , 0.8],
[ 0. , 0. , 0. ]])
In [13]: np.triu(df, 1) >= 0.8
Out[13]:
array([[False, False, True],
[False, False, True],
[False, False, False]], dtype=bool)
Then to extract the index/columns where it's True I think you have to use np.where*:
In [14]: np.where(np.triu(df, 1) >= 0.8)
Out[14]: (array([0, 1]), array([2, 2]))
This gives you an array of first index indices and then column indices (this is the least efficient part of this numpy version):
In [16]: index, cols = np.where(np.triu(df, 1) >= 0.8)
In [17]: [(df.index[i], df.columns[j], df.iloc[i, j]) for i, j in zip(index, cols)]
Out[17]:
[('sample1', 'sample3', 0.80000000000000004),
('sample2', 'sample3', 0.80000000000000004)]
As desired.
*I may be forgetting an easier way to get this last chunk (Edit: the below pandas code does it, but I think there may be another way too.)
You can use the same trick in pandas but with stack to get the index/columns natively:
In [21]: (np.triu(df, 1) >= 0.8) * df
Out[21]:
sample1 sample2 sample3
sample
sample1 0 0 0.8
sample2 0 0 0.8
sample3 0 0 0.0
In [22]: res = ((np.triu(df, 1) >= 0.8) * df).stack()
In [23]: res
Out[23]:
sample
sample1 sample1 0.0
sample2 0.0
sample3 0.8
sample2 sample1 0.0
sample2 0.0
sample3 0.8
sample3 sample1 0.0
sample2 0.0
sample3 0.0
dtype: float64
In [24]: res[res!=0]
Out[24]:
sample
sample1 sample3 0.8
sample2 sample3 0.8
dtype: float64
If you want to use Pandas, the following answer will help. I am assuming you will figure out how to read your matrix files into Pandas by yourself. I am also assuming that your columns and rows are labelled correctly. What you will end up with after you read your data is a DataFrame which will look a lot like the matrix you have at the top of your question. I am assuming that all row names are the DataFrame index. I am taking that you have read the data into a variable called df as my starting point.
Pandas is more efficient row-wise than column-wise. So, we do things row-wise, looping over the columns.
pairs = {}
for col in df.columns:
pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)].index.tolist()
# If row names are not an index, but a different column named 'names' run the following line, instead of the line above
# pairs[col] = df[(df[col] >= 0.8) & (df[col] < 1)]['names'].tolist()
Alternatively, you can use apply() to do this, because that too will loop over all columns. (Maybe in 0.17 it will release the GIL for faster results, I do not know because I have not tried it.)
pairs will now contain the column name as key and a list of the names of rows as values where the correlation is greater than 0.8, but less than 1.
If you also want to extract correlation values from the DataFrame, replace .tolist() by .to_dict(). .to_dict() will generate a dict such that index is key and value is value: {index -> value}. So, ultimately your pairs will look like {column -> {index -> value}}. It will also be guaranteed free of nan. Note that .to_dict() will only work if your index contains the row names that you want, else it will return the default index, which is just numbers.
Ps. If your file is huge, I would recommend reading it in chunks. In this case, the piece of code above will be repeated for each chunk. So it should be inside your loop that iterates over chunks. However, then you will have to be careful to append new data coming from the next chunk to pairs. The following links are for your reference:
Pandas I/O docs
Pandas read_csv() function
SO question on chunked read
You might also want to read reference 1 for other types of I/O supported by Pandas.
To read it in you need the skipinitialspace and index_col parameters:
a=pd.read_csv('yourfile.txt',sep=' ',skipinitialspace=True,index_col=0)
To get the values pair wise:
[[x,y,round(a[x][y],3)] for x in a.index for y in a.columns if x!=y and a[x][y]>=0.8][:2]
Gives:
[['sample1', 'sample3', 0.8],
['sample2', 'sample3', 0.8]]
Using scipy.sparse.coo_matrix, as it works with a "(row, col) data" format.
from scipy.sparse import coo_matrix
import numpy as np
M = np.matrix([[1.0, 0.7, 0.8], [0.7, 1.0, 0.8], [0.8, 0.8, 1.0]])
S = coo_matrix(M)
Here, S.row and S.col are arrays of row and column indices, S.data is the array of values at those indices. So you can filter by
idx = S.data >= 0.8
And for instance create a new matrix only with those elements:
S2 = coo_matrix((S.data[idx], (S.row[idx], S.col[idx])))
print S2
The output is
(0, 0) 1.0
(0, 2) 0.8
(1, 1) 1.0
(1, 2) 0.8
(2, 0) 0.8
(2, 1) 0.8
(2, 2) 1.0
Note (0,1) does not appear as the value is 0.7.
pandas' read_table can handle regular expressions in the sep parameter.
In [19]: !head file.txt
sample sample1 sample2 sample3
sample1 1 0.7 0.8
sample2 0.7 1 0.8
sample3 0.8 0.8 1
In [20]: df = pd.read_table('file.txt', sep='\s+')
In [21]: df
Out[21]:
sample sample1 sample2 sample3
0 sample1 1.0 0.7 0.8
1 sample2 0.7 1.0 0.8
2 sample3 0.8 0.8 1.0
From there, you can filter on all values >= 0.8.
In [23]: df[df >= 0.8]
Out[23]:
sample sample1 sample2 sample3
0 sample1 1.0 NaN 0.8
1 sample2 NaN 1.0 0.8
2 sample3 0.8 0.8 1.0
I'm trying to figure out how I would go about formatting a large number to the shorter version by appending 'k' or 'm' using Lua. Example:
17478 => 17.5k
2832 => 2.8k
1548034 => 1.55m
I would like to have the rounding in there as well as per the example. I'm not very good at Regex, so I'm not sure where I would begin. Any help would be appreciated. Thanks.
Pattern matching doesn't seem like the right direction for this problem.
Assuming 2 digits after decimal point are kept in the shorter version, try:
function foo(n)
if n >= 10^6 then
return string.format("%.2fm", n / 10^6)
elseif n >= 10^3 then
return string.format("%.2fk", n / 10^3)
else
return tostring(n)
end
end
Test:
print(foo(17478))
print(foo(2832))
print(foo(1548034))
Output:
17.48k
2.83k
1.55m
Here a longer form, which uses the hint from Tom Blodget.
Maybe its not the perfect form, but its a little more specific.
For Lua 5.0, replace #steps with table.getn(steps).
function shortnumberstring(number)
local steps = {
{1,""},
{1e3,"k"},
{1e6,"m"},
{1e9,"g"},
{1e12,"t"},
}
for _,b in ipairs(steps) do
if b[1] <= number+1 then
steps.use = _
end
end
local result = string.format("%.1f", number / steps[steps.use][1])
if tonumber(result) >= 1e3 and steps.use < #steps then
steps.use = steps.use + 1
result = string.format("%.1f", tonumber(result) / 1e3)
end
--result = string.sub(result,0,string.sub(result,-1) == "0" and -3 or -1) -- Remove .0 (just if it is zero!)
return result .. steps[steps.use][2]
end
print(shortnumberstring(100))
print(shortnumberstring(200))
print(shortnumberstring(999))
print(shortnumberstring(1234567))
print(shortnumberstring(999999))
print(shortnumberstring(9999999))
print(shortnumberstring(1345123))
Result:
> dofile"test.lua"
100.0
200.0
1.0k
1.2m
1.0m
10.0m
1.3m
>
And if you want to get rid of the "XX.0", uncomment the line before the return.
Then our result is:
> dofile"test.lua"
100
200
1k
1.2m
1m
10m
1.3m
>