pyplot - yticks data representation help - need to convert in KB/MB - python-2.7

I am trying to plot graph for throughput numbers.
my data is x axis = time in epoch, y = throughput in bytes.
I have y-ticks as
print loc, labels
[ 0. 5000000. 10000000. 15000000. 20000000. 25000000.
30000000. 35000000.]<a list of 8 Text yticklabel objects>
I want to show this data in KB or MB. Please help on how I can go about it?
I am lost and stuck. Currently the data on y starts with 0 -> 3.5 (1e7) which in itself does not make sense about throughput.
So y ticks are - 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5 with 1e7
Appreciate help!

Related

Jaccard similarity in python

I am trying to find the jaccard similarity between two documents. However, i am having hard time to understand how the function sklearn.metrics.jaccard_similarity_score() works behind the scene.As per my understanding the Jaccard's sim = intersection of the terms in docs/ union of the terms in docs.
Consider below example:
My DTM for the two documents is:
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
above func. give me the jaccard sim score
print(sklearn.metrics.jaccard_similarity_score(tf_matrix[0,:],tf_matrix[1,:]))
0.25
I am trying to find the score on my own as :
intersection of terms in both the docs = 4
total terms in doc 1 = 6
total terms in doc 2 = 6
Jaccard = 4/(6+6-4)= .5
Can someone please help me understand if there is something obvious i am missing here.
As stated here:
In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.
Therefore in your example it is calculating the proportion of matching elements. That's why you're getting 0.25 as the result.
According to me
intersection of terms in both the docs = 2.
peek to peek intersection according to their respective index. As we need to predict correct value for our model.
Normal Intersection = 4. Leaving the order of index.
# so,
jaccard_score = 2/(6+6-4) = 0.25

Superhuman Level - Pandas DataFrame Reshaping because of Duplicates

Do you like puzzles that only superhumans can solve? This is the final test to prove such an ability.
A single company might get different levels of funding (seed, a) from multiple banks possibly at different times.
Let's look at the data then the story to get a better picture.
import pandas as pd
data = {'id':[1,2,2,3,4],'company':['alpha','beta','beta','alpha','alpha'],'bank':['z', 'x', 'y', 'z', 'j'],
'rd': ['seed', 'seed', 'seed', 'a', 'a'], 'funding': [100, 200, 200, 300, 50],
'date': ['2006-12-01', '2004-09-01', '2004-09-01', '2007-05-01', '2007-09-01']}
df = pd.DataFrame(data, columns = ['id','company', 'round', 'bank', 'funding', 'date'])
df
Yields:
id company rd bank funding date
0 1 alpha seed z 100 2006-12-01
1 2 beta seed x 200 2004-09-01
2 2 beta seed y 200 2004-09-01
3 3 alpha a z 300 2007-05-01
4 4 alpha a j 50 2007-09-01
Desired Output:
company bank_seed funding_seed date_seed bank_a funding_a date_a
0 alpha z 100 2006-12-01 [z,j] 350 2007-09-01
1 beta [x,y] 200 2004-09-01 None None None
As you can see, I am not a superhuman but shall try to explain my thought process.
Let's look at company alpha
Company alpha first got seed money for $100 from bank z in late 2006. A few months later, their investors were very happy with their progress so bank z gave them money ($300 more!). However, Company alpha needed a little more cash but had to go to some random Swiss bank j to stay alive. Bank j reluctantly gave $50 more. Yay! They now have $350 from their updated 'a' round ending in September 2007.
Company beta is pretty new. They got funding totaling $200 from two different banks. But wait... there's nothing in here about their round 'a'. That's okay we'll put None for now and check back with them later.
The issue is that Company alpha sucks and got money from the Swiss...
This is my non-working code that had worked on a subset of my data - it won't work here.
import itertools
unique_company = df.company.unique()
df_indexed = df.set_index(['company', 'rd'])
index = pd.MultiIndex.from_tuples(list(itertools.product(unique_company, list(df.rd.unique()))))
reindexed = df_indexed.reindex(index, fill_value=0)
reindexed = reindexed.unstack().applymap(lambda cell: 0 if '1970-01-01' in str(cell) else cell)
working_df = pd.DataFrame(reindexed.iloc[:,
reindexed.columns.get_level_values(0).isin(['company', 'funding'])].to_records())
If you know how to solve part of this problem, go ahead and put it below. Thank you in advance for taking the time to look at this! :)
Lastly, if you want to see how my code does work. Then, do this but you lose so much valuable info...
df = df.drop_duplicates(subset='id')
df = df.drop_duplicates(subset='rd')
Take a pre-processing step to spread out the funding across records with the same 'id' and 'date'
df.funding /= df.groupby(['id', 'date']).funding.transform('count')
Then process
d1 = df.groupby(['company', 'round']).agg(
dict(bank=lambda x: tuple(x), funding='sum', date='last')
).unstack().sort_index(1, 1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1
bank funding date bank funding date
round a a a seed seed seed
company
alpha (z, j) 350.0 2007-09-01 (z,) 100.0 2006-12-01
beta None NaN NaT (x, y) 200.0 2004-09-01
Groupby, aggregate and unstack will get you close to what you want
df.groupby(['company', 'round']).agg({'bank': lambda x: ','.join(x), 'funding': 'sum', 'date': 'max'}).unstack().reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
You get
company_ bank_a bank_seed funding_a funding_seed date_a date_seed
0 alpha z,j z 350.0 100.0 2007-09-01 2006-12-01
1 beta None x,y NaN 400.0 None 2004-09-01

How can I remove indices of non-max values that correspond to duplicate values of separate list from both lists?

I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.
# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]
# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]
# actual dataset is of size ~ 11,000
The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.
def list_duplicates(data_list):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in data_list if x in seen or seen_add(x))
return list(seen_twice)
# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]
# get index of duplicates
for dup in dups:
print(time_count.index(dup))
>> 2
>> 4
When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.
for dup in dups:
indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s
I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.
I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)
Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:
import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)
For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.
OK, if you want to do this with numpy, best is to turn both of your lists into arrays:
l = np.array(linspeed)
tc = np.array(time_count)
Now, finding unique times is just an np.unique call:
u, i, c = np.unique(tc, return_inverse = True, return_counts = True)
u
Out[]: array([ 4., 6., 8., 10., 14., 16.])
i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)
c
Out[]: array([1, 1, 2, 3, 1, 1])
Now you can either build your maximums with a for loop
m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])
m
Out[]: array([ 280. , 275. , 475.2, 400.9, 360.1, 400.9])
Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.
np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. , 275. , 475.2, 400.9, 323.8, 289.7])
Now your m and u vectors are the same length and include the output you want.

Plot non present numbers with pandas

I have a large pandas Series, which contains unique numbers from 0 to 1,000,000. The series is not complete, but lacks some numbers in this range. I want to get a rough idea of what numbers are missing, so I'm thinking I should plot the data as a line with gaps showing the missing data.
How would I accomplish that? This does not work:
nums = pd.Series(myNumbers)
nums.plot()
The following provides a list of the missing numbers in Series nums. You can then plot them as needed. For your purposes adjust the max to 1E6.
max = 10 # highest number to look for in the Series
import pandas as pd
nums = pd.Series([1, 2, 3, 4, 5, 6, 9])
missing = [n for n in xrange(int(max + 1)) if n not in nums.values]
print missing
# prints: [0, 7, 8, 10]
I think there are two concerns with the plotting function you wrote. First, there are one million numbers. Second, the x-axis for the plot will be indexes in the series (start at 0, going sequentially); the y-axis will be numbers that you care about (nums.values in the code here). Therefore, you are looking for missing y-axis values.
I think it depends on what you mean by missing. If those are nans, then you can do something like
len(nums[nums.apply(numpy.isnan)])
if you are looking for numbers that are not present between 0-1M in the series, then do something like
a= set([i for i in xrange(int(1e6))])
b= set(nums.values)
print len(a-b) # or plot it as scatter.

rrd4j archive type

I can't manage to create an archive with the correct type.
What am I missing?
My example is very similar to the official example on https://code.google.com/p/rrd4j/wiki/Tutorial
RRD creation:
rrdDef.setStartTime(L - 300);
rrdDef.addDatasource("speed", DsType.GAUGE, 600, Double.NaN, Double.NaN);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 1, 24);
rrdDef.addArchive(ConsolFun.MAX, 0.5, 6, 10);
I add some values: (1,2,3 for each step)
long x = L;
while (x <= L + 4200) {
Sample sample = rrdDb.createSample();
sample.setAndUpdate((x + 11) + ":1");
sample.setAndUpdate((x + 12) + ":2");
sample.setAndUpdate((x + 14) + ":3");
x += 300;
}
And then I fetch it:
FetchRequest fetchRequest = rrdDb.createFetchRequest(ConsolFun.MAX, (L - 600), L + 4500);
FetchData fetchData = fetchRequest.fetchData();
String s = fetchData.dump();
I get the result: (hoping to find the maximum)
920804100: NaN
920804400: NaN
920804700: +1.0000000000E00
920805000: +1.0166666667E00
920805300: +1.0166666667E00
...
920808600: +1.0166666667E00
920808900: +1.0166666667E00
920809200: NaN
I would like to see the maximum value here. Tried it with total as well, and I get THE SAME result.
What do I have to change, so I get the greatest value sent in one step, or to get the sum of the values sent in one step.
Thanks
The MAX is not the maximum value input but the maximum consolidated data point. What you're saying to rrd given your example is
At one point in time I'm going 1MPH
One second later I'm going 2MPH
Two seconds later I'm going 4MPH
rrd now has 3 data points covering 3 seconds of a 300 second interval. What should rrd store? 1, 2, or 3? None of the above it has to normalize the data in some way to say between X and X+STEP the rate is Y.
To complicate matters it's not for certain that your 3 data points are landing in the the same 300 second interval. Your first 2 data points could be in one interval and the 4MPH could be in the next one. This is because the starting data point stored is not exactly start+step. i.e. if you start at 14090812456 it might be something like 14090812700 even though your step is 300
The only way to store exact input values with GAUGE is to push updates at the exact step times rrd store the data points. I'm going 1MPH at x, 2MPH at x+300, 4MPH at x+300 where x starts at the first data point.
Here is a bash example showing this working using your rrd settings, I'm using a constant start time and x starting at what I know is rrd's first data point.
L=1409080000
rrdtool create max.rrd --start=$L DS:speed:GAUGE:600:U:U RRA:MAX:0.5:1:24 RRA:MAX:0.5:6:10
x=$(($L+200))
while [ $x -lt $(($L+3000)) ]; do
rrdtool update max.rrd "$(($x)):1"
rrdtool update max.rrd "$(($x+300)):2"
rrdtool update max.rrd "$(($x+600)):3"
x=$(($x+900))
done
rrdtool fetch max.rrd MAX -r 600 -s 1409080000
speed
1409080200: 1.0000000000e+00
1409080500: 2.0000000000e+00
1409080800: 3.0000000000e+00
1409081100: 1.0000000000e+00
1409081400: 2.0000000000e+00
1409081700: 3.0000000000e+00
1409082000: 1.0000000000e+00
Not really that usefull but if you increase the resolution to say 1200 seconds you start getting max over larger time intervals
rrdtool fetch max.rrd MAX -r 1200 -s 1409080000
speed
1409081400: 3.0000000000e+00
1409083200: 3.0000000000e+00
1409085000: nan
1409086800: nan
1409088600: nan