Need help to create a formula in google spreadsheet - if-statement

I Would like to create a formula to show in a cell the winning/loosing streak of a range if the number is positive or negative, and I would like to add the sum of the total value of the streak in a separate cell.
for example I have a basic table with
A1:A100 I have different number positive or negative and in B1 I want to show the longest positive streak of the range, let's say is from A25 to A45 so the value should be 10, and in B2 I want to show the total value of the sum of A25 to A45.
In C1 and C2 the same but for negative
EDIT:
I would like also To show the highest sum of negative/positive of any streak in the range not especially the longest
I hope is clear enough.

D1:
=ARRAYFORMULA(MAX(LEN(SPLIT(JOIN(, IF(A:A>0, 1, 0)), 0))))
D2:
=ARRAYFORMULA(MAX(LEN(SPLIT(JOIN(, IF(A:A<0, 1, 0)), 0))))
D3:
=ARRAYFORMULA(SUM(SPLIT(INDEX(SORT(TRANSPOSE({LEN(SPLIT(JOIN(,
IF(A:A>0, 1, "♥")), "♥")); SPLIT(JOIN(,
IF(A:A>0, A:A&"♦", "♥")), "♥")}), 1, 0), 1, 2), "♦")))
D4:
=ARRAYFORMULA(SUM(SPLIT(INDEX(SORT(TRANSPOSE({LEN(SPLIT(JOIN(,
IF(A:A<0, 1, "♥")), "♥")); SPLIT(JOIN(,
IF(A:A<0, A:A&"♦", "♥")), "♥")}), 1, 0), 1, 2), "♦")))

Related

Minimum number of iterations

We are given an array with numbers from ranging from 1 to n (no duplicates) where n = size of the array.
We are allowed to do the following operation :
arr[i] = arr[arr[i]-1] , 0 <= i < n
Now, one iteration is considered when we perform above operation on the entire array.
Our task is to find the number of iterations after we encounter a previously encountered sequence.
Constraints :
a) Array has no duplicates
b) 1 <= arr[i] <= n , 0 <= i < n
c) 1 <= n <= 10^6
Ex 1:
n = 5
arr[] = {5, 4, 2, 1, 3}
After 1st iteration array becomes : {3, 1, 4, 5, 2}
After 2nd iteration array becomes : {4, 3, 5, 2, 1}
After 3rd iteration array becomes : {2, 5, 1, 3, 4}
After 4th iteration array becomes : {5, 4, 2, 1, 3}
In the 4th iteration, the sequence obtained is already seen before
So the expected output is 4.
This question was asked in one of job hiring tests, so I dont have any link to the question.
There were 2 sample test cases given out of which I remember one which is given above. I would really appreciate any help on this question
P.S.
I was able to code the brute force solution, where in I stored all the results in a Set and then kept advancing to the next permutation. But it gave TLE
First, note that an array of length n containing 1, 2, ..., n with no duplicates is a permutation.
Next, observe that arr[i] := arr[arr[i] - 1] is squaring the permutation.
That is, consider permutations as elements of the symmetric group S_n, where multiplication is composition of permutations.
Then the above operation is arr := arr * arr.
So, in terms of permutations and their composition, the question is as follows:
You are given a permutation p (= arr).
Consider permutations p, p^2, p^4, p^8, p^16, ...
What is the number of distinct elements among them?
Now, to solve it, consider the cycle notation of the permutation.
Every permutation is a product of disjoint cycles.
For example, 6 1 4 3 5 2 is the product of the following cycles: (1 6 2) (3 4) (5).
In other words, every application of this permutation:
moves elements at positions 1, 6, 2 along the cycle;
moves elements at positions 4, 3 along the cycle;
leaves element at position 5 in place.
So, when we consider p^k (take an identity permutation and apply the permutation p to it k times), we actually process three independent actions:
move elements at positions 1, 6, 2 along the cycle, k times;
move elements at positions 4, 3 along the cycle, k times;
leave element at position 5 in place, k times.
Now, take into account that, after d applications of a cycle of length d, it just returns all the respective elements to their initial places.
So, we can actually formulate p^k as:
move elements at positions 1, 6, 2 along the cycle, (k mod 3) times;
move elements at positions 4, 3 along the cycle, (k mod 2) times;
leave element at position 5 in place.
We can now prove (using Chinese Remainder Theorem, or just using general knowledge of group theory) that the permutations p, p^2, p^3, p^4, p^5, ... are all distinct up to p^m, where m is the least common multiple of all cycle lengths.
In our example with p = 6 1 4 3 5 2, we have p, p^2, p^3, p^4, p^5, and p^6 all distinct.
But p^6 is the identity permutation: moving six times along a cycle of length 2 or 3 results in the items at their initial places.
So p^7 is the same as p^1, p^8 is the same as p^2, and so on.
Our question however is harder: we want to know the number of distinct permutations not among p, p^2, p^3, p^4, p^5, ..., but among p, p^2, p^4, p^8, p^16, ...: p to the power of a power of two.
To do that, consider all cycle lengths c_1, c_2, ..., c_r in our permutation.
For each c_i, find the pre-period and period of 2^k mod c_i:
For example, c_1 = 3, and 2^k mod 3 look as 1, 2, 1, 2, 1, 2, ..., which is (1, 2) with pre-period 0 and period 2.
As another example, c_2 = 2, and 2^k mod 2 look as 1, 0, 0, 0, ..., which is 1, (0) with pre-period 1 and period 1.
In this problem, this part can be done naively, by just marking visited numbers mod c_i in some array.
By Chinese Remainder Theorem again, after all pre-periods are considered, the period of the whole system of cycles will be the least common multiple of all individual periods.
What remains is to consider pre-periods.
These can be processed with your naive solution anyway, as the lengths of pre-periods here is at most log_2 n.
The answer is the least common multiple of all individual periods, calculated as above, plus the length of the longest pre-period.

how the TFIDF values are transformed

I am new to NLP, please clarify on how the TFIDF values are transformed using fit_transform.
Below formula for calculating the IDF is working fine,
log (total number of documents + 1 / number of terms occurrence + 1) + 1
EG: IDF value for the term "This" in the document 1("this is a string" is 1.91629073
After applying fit_transform, values for all the terms are changed, what is the formula\logic used for the transformation
TFID = TF * IDF
EG: TFIDF value for the term "This" in the document 1 ("this is a string") is 0.61366674
How this value is arrived, 0.61366674?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
d = pd.Series(['This is a string','This is another string',
'TFIDF Computation Calculation','TFIDF is the product of TF and IDF'])
df = pd.DataFrame(d)
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(df[0])
print (tfidf_vectorizer.idf_)
#output
#[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355 1.91629073
#1.91629073 1.51082562 1.91629073 1.51082562 1.91629073 1.51082562]
##-------------------------------------------------
##how the above values are getting transformed here
##-------------------------------------------------
print (tfidf.toarray())
#[[0. 0. 0. 0. 0. 0.49681612 0.
#0. 0.61366674 0. 0. 0. 0.61366674]
# [0. 0.61422608 0. 0. 0. 0.39205255
# 0. 0. 0.4842629 0. 0. 0. 0.4842629 ]
# [0. 0. 0.61761437 0.61761437 0. 0.
# 0. 0. 0. 0. 0.48693426 0. 0. ]
# [0.37718389 0. 0. 0. 0.37718389 0.24075159
# 0.37718389 0.37718389 0. 0.37718389 0.29737611 0.37718389 0. ]]
It's normed TF-IDF vectors because by default norm='l2' according to the documentation. So in the output of tfidf.toarray() each element on level 0 / row of the array represents a document and each element of level 1 / column represents a unique word with the sum of squares of vector elements for each document being equal to 1, which you can check by printing print([sum([word ** 2 for word in doc]) for doc in tfidf.toarray()]).
norm : ‘l1’, ‘l2’ or None, optional (default=’l2’)
Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two
vectors is their dot product when l2 norm has been applied. * ‘l1’:
Sum of absolute values of vector elements is 1. See
preprocessing.normalize
print(tfidf) #the same values you find in tfidf.toarray() but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) normed TF-IDF value
(0, 12) 0.6136667440107333 #1st word in 1st sentence: 'This'
(0, 5) 0.4968161174826459 #'is'
(0, 8) 0.6136667440107333 #'string', see that word 'a' is missing
(1, 12) 0.48426290003607125 #'This'
(1, 5) 0.3920525532545391 #'is'
(1, 8) 0.48426290003607125 #'string'
(1, 1) 0.6142260844216119 #'another'
(2, 10) 0.48693426407352264 #'TFIDF'
(2, 3) 0.6176143709756019 #'Computation'
(2, 2) 0.6176143709756019 #'Calculation'
(3, 5) 0.2407515909314943 #'is'
(3, 10) 0.2973761110467491 #'TFIDF'
(3, 11) 0.37718388973255157 #'the'
(3, 7) 0.37718388973255157 #'product'
(3, 6) 0.37718388973255157 #'of'
(3, 9) 0.37718388973255157 #'TF'
(3, 0) 0.37718388973255157 #'and'
(3, 4) 0.37718388973255157 #'IDF'
Because it's normed TF-IDF values the sum of squares of vector elements will be qual to 1. E.g. for the first document at index 0, the sum of squares of vector elements will be equal to 1: sum([0.6136667440107333 ** 2, 0.4968161174826459 ** 2, 0.6136667440107333 ** 2])
You can turn off this transformation by setting norm=None.
print(TfidfVectorizer(norm=None).fit_transform(df[0])) #the same values you find in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray(), but more readable
output: ([index of document on array lvl 0 / row], [index of unique word on array lvl 1 / column]) TF-IDF value
(0, 12) 1.5108256237659907 #1st word in 1st sentence: 'This'
(0, 5) 1.2231435513142097 #'is'
(0, 8) 1.5108256237659907 #'string', see that word 'a' is missing
(1, 12) 1.5108256237659907 #'This'
(1, 5) 1.2231435513142097 #'is'
(1, 8) 1.5108256237659907 #'string'
(1, 1) 1.916290731874155 #'another'
(2, 10) 1.5108256237659907 #'TFIDF'
(2, 3) 1.916290731874155 #'Computation'
(2, 2) 1.916290731874155 #'Calculation'
(3, 5) 1.2231435513142097 #'is'
(3, 10) 1.5108256237659907 #'TFIDF'
(3, 11) 1.916290731874155 #'the'
(3, 7) 1.916290731874155 #'product'
(3, 6) 1.916290731874155 #'of'
(3, 9) 1.916290731874155 #'TF'
(3, 0) 1.916290731874155 #'and'
(3, 4) 1.916290731874155 #'IDF'
Because every word just appears once in each document, the TF-IDF values are the IDF values of each word times 1:
tfidf_vectorizer = TfidfVectorizer(norm=None)
tfidf = tfidf_vectorizer.fit_transform(df[0])
print(tfidf_vectorizer.idf_)
output: Smoothed IDF-values
[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.22314355
1.91629073 1.91629073 1.51082562 1.91629073 1.51082562 1.91629073
1.51082562]
I hope, the above is helpful to you.
Unfortunately, I cannot reproduce the transformation, because
The cosine similarity between two vectors is their dot product when l2
norm has been applied.
seems to be an additional step. Because the TF-IDF values will be biased by the number of words in each document when you use the default setting norm='l2', I would simply turn this setting off by using norm=None. I figured out, that you cannot simply do the transformation by using:
tfidf_norm_calculated = [
[(word/sum(doc))**0.5 for word in doc]
for doc in TfidfVectorizer(norm=None).fit_transform(df[0]).toarray()]
print(tfidf_norm_calculated)
print('Sum of squares of vector elements is 1: ', [sum([word**2 for word in doc]) for doc in tfidf_norm_calculated])
print('Compare to:', TfidfVectorizer().fit_transform(df[0]).toarray())

How can I find median with an even amount of numbers in a list?

This is what I have right now. It just finds the median with an odd amount of numbers.
def median(height):
height.sort()
x = len(height)
x -= 1
posn = x // 2
return height[posn]
"The median is the numeric value separating the higher half of a sample data set from the lower half. The median of a data set can be found by arranging all the values from lowest to highest value and picking the one in the middle. If there is an odd number of data values then the median will be the value in the middle. If there is an even number of data values the median is the mean of the two data values in the middle." - Source
For the data set 1, 1, 2, 5, 6, 6, 9 the median is 5.
For the data set 1, 1, 2, 6, 6, 9 the median is 4. It is the mean of 2 and 6 or, (2+6)/2 = 4.

Equal - depth binning- whether it is just grouping data into k groups

A small confusion on equal - depth or equal frequency binning
Equal depth binning says that - It divides the range into N intervals, each containing approximately same number of samples
Lets take a small portion of iris data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
If I need to bin my 1st column, what will be the results?
Whether it is just grouping the data or it includes some calculation like equal width binning.
What happens if number of elements to be binned is an odd number. How will I bin equally?
like #Anony-Mousse mentions, it is not always possible to exactly get the same number of samples in a bin, approximately is what is desired.
I will walk you through the case when unique(N)/bins > 0, where N represents the values in an array to be binned. Assume
N = [1, 1, 1, 1, 1, 1,
2, 3, 4, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6]
bins = 4
here, length(N) = 20 and length(unique(N)) = 6, making unique(N)/bins = 1.5 > 0. Which means every bin will have approximately 1.5 samples. So you will put 1 in bin1, carrying over the 0.5 residue to the next bin, making the number of elements in that bin to 1.5 + 0.5 = 2, so 2 and 3 will be in bin2. Extrapolating this logic the final bins will have the following split. [1], [2,3], [4], [5,6] of course 1 repeats 6 times and 6 repeats 10 times.
I would not like the ties to sit in separate bins, that is usually the point in having bins (grouping values close to one another).
For cases with unique(N)/bins < 0, the same logic can be applied. Hope this answers your question.
Sometimes you cannot make bins of exactly the same size.
For example, if your data is
1,1,1,2,99
and you want 4 bins, then the most intuitive result should be
[1,1,1], [2], [], [99]
Most tools will produce one of these answers:
[1,1,1], [], [2], [99]
[1,1], [1], [2], [99]
[1], [1], [1], [2,99]
None of them have exactly 1.25 elements in every bin. The two last solutions are closest, but also the least intuitive. That is why one only demands "approximately the same number". Sometimes, there is no good solution that exactly has this frequency.

PYTHON: sum of two numbers

I'm trying to find, for a list of numbers from 1 to 50, what numbers within that range are the sums of two other specific numbers from another list. The other list is 1, 2, 4, 6, 18, 26.
I'm basically trying to run a "for x in range(1,50):" type program that then lists all the numbers from 1 to 50 and next to them says "TRUE" if they are the sum of any two of the numbers in that list (e.g. 1 + 1, 1 + 4, 1 + 26, 4 + 18, 18 + 26 etc etc).
Any ideas??
Thank you!!
Matt
Iterate over all possible pairs of numbers:
sums = []
for n1 in numbers:
for n2 in numbers:
# Add them together and store the result in `sums`
And then check to see if every number from range(50) is in your list of sums:
for n in range(50):
if n in sums:
# `n` is the sum of two numbers from your list
def solveMeFirst(a,b):
# Hint: Type return a+b below
return a+b
num1 = int(input())
num2 = int(input())
res = solveMeFirst(num1,num2)
print(res)