I have a large pandas Series, which contains unique numbers from 0 to 1,000,000. The series is not complete, but lacks some numbers in this range. I want to get a rough idea of what numbers are missing, so I'm thinking I should plot the data as a line with gaps showing the missing data.
How would I accomplish that? This does not work:
nums = pd.Series(myNumbers)
nums.plot()
The following provides a list of the missing numbers in Series nums. You can then plot them as needed. For your purposes adjust the max to 1E6.
max = 10 # highest number to look for in the Series
import pandas as pd
nums = pd.Series([1, 2, 3, 4, 5, 6, 9])
missing = [n for n in xrange(int(max + 1)) if n not in nums.values]
print missing
# prints: [0, 7, 8, 10]
I think there are two concerns with the plotting function you wrote. First, there are one million numbers. Second, the x-axis for the plot will be indexes in the series (start at 0, going sequentially); the y-axis will be numbers that you care about (nums.values in the code here). Therefore, you are looking for missing y-axis values.
I think it depends on what you mean by missing. If those are nans, then you can do something like
len(nums[nums.apply(numpy.isnan)])
if you are looking for numbers that are not present between 0-1M in the series, then do something like
a= set([i for i in xrange(int(1e6))])
b= set(nums.values)
print len(a-b) # or plot it as scatter.
Related
I have the following spreadsheet:
https://docs.google.com/spreadsheets/d/1Ib2Do3htfRg3NAuI-HyRA3MBM1XwUviFcAxlvF7q1J0/edit?usp=sharing
I have created 2 sparklines, 1 works, 1 doesn't. The one that does not work references the second column as the x-axis to calculate the slope. The slope is needed to give the graph some nice trending color.
My question is, how can I convert the second column into a serial [1, 2, 3, 4, 5]? So that when it is put as the x-axis, the slope would be calculated correctly. Of course, this conversion needs to happen within the formula itself. Thanks for any help.
try:
=ARRAYFORMULA(SPARKLINE(C2:C, {
"charttype", "line";
"color", IF(SLOPE(C2:C, ROW(B2:B)-1)>0, "lime", "red");
"linewidth", 2}))
I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])
I am trying to find the jaccard similarity between two documents. However, i am having hard time to understand how the function sklearn.metrics.jaccard_similarity_score() works behind the scene.As per my understanding the Jaccard's sim = intersection of the terms in docs/ union of the terms in docs.
Consider below example:
My DTM for the two documents is:
array([[1, 1, 1, 1, 2, 0, 1, 0],
[2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)
above func. give me the jaccard sim score
print(sklearn.metrics.jaccard_similarity_score(tf_matrix[0,:],tf_matrix[1,:]))
0.25
I am trying to find the score on my own as :
intersection of terms in both the docs = 4
total terms in doc 1 = 6
total terms in doc 2 = 6
Jaccard = 4/(6+6-4)= .5
Can someone please help me understand if there is something obvious i am missing here.
As stated here:
In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.
Therefore in your example it is calculating the proportion of matching elements. That's why you're getting 0.25 as the result.
According to me
intersection of terms in both the docs = 2.
peek to peek intersection according to their respective index. As we need to predict correct value for our model.
Normal Intersection = 4. Leaving the order of index.
# so,
jaccard_score = 2/(6+6-4) = 0.25
I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.
# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]
# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]
# actual dataset is of size ~ 11,000
The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.
def list_duplicates(data_list):
seen = set()
seen_add = seen.add
seen_twice = set(x for x in data_list if x in seen or seen_add(x))
return list(seen_twice)
# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]
# get index of duplicates
for dup in dups:
print(time_count.index(dup))
>> 2
>> 4
When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.
for dup in dups:
indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s
I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.
I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)
Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:
import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)
For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.
OK, if you want to do this with numpy, best is to turn both of your lists into arrays:
l = np.array(linspeed)
tc = np.array(time_count)
Now, finding unique times is just an np.unique call:
u, i, c = np.unique(tc, return_inverse = True, return_counts = True)
u
Out[]: array([ 4., 6., 8., 10., 14., 16.])
i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)
c
Out[]: array([1, 1, 2, 3, 1, 1])
Now you can either build your maximums with a for loop
m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])
m
Out[]: array([ 280. , 275. , 475.2, 400.9, 360.1, 400.9])
Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.
np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. , 275. , 475.2, 400.9, 323.8, 289.7])
Now your m and u vectors are the same length and include the output you want.
I need to draw a bar graph for the values:
male=('2', '1', '2', '6', '6', '1') # list may increase
time=('Tue_Aug_13_04:37:40_2013', 'Mon_Jul__1_02:33:11_2013','Tue_Aug_13_04:37:40_2013', 'Thu_Jul__4_01:53:32_2013', 'Mon_Jul__1_10:05:55_2013','Mon_Jul__1_04:15:25_2013')# list may increase
female=(16, 11, 16, 12, 12, 11) # list may increase
Male in green colour, female in red colour as the image attached below:
The code which I tried:
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse, Polygon
fig = plt.figure()
ax1 = fig.add_subplot(131)
ax1.bar(male, color='red', edgecolor='black')
ax1.bar(bottom=range(female), color='blue', edgecolor='black')
ax1.set_xticks(time)
plt.show()
What modifications do I need to make in order to draw the bar graph as shown in the image attached for my values?
1.) I strongly suggest that you familiarize yourself with the python syntax:
What's the difference between lists enclosed by square brackets and parentheses?
What's the difference between '2' and 2?
2.) Make use of the matplotlib documentation to figure out the correct syntaxt for the plot commands you are using.
3.) In this particular case: To get you going, change your data to:
male=[2, 1, 2, 6, 6, 1] # list may increase
time=['Tue_Aug_13_04:37:40_2013', 'Mon_Jul__1_02:33:11_2013','Tue_Aug_13_04:37:40_2013', 'Thu_Jul__4_01:53:32_2013', 'Mon_Jul__1_10:05:55_2013','Mon_Jul__1_04:15:25_2013']# list may increase
female=[16, 11, 16, 12, 12, 11] # list may increase
Please examine carefully what has changed.
4.) The bar command you try to call has not enough input arguments. With the changed data from above, try this:
ax1.bar(range(len(time)),male,width=0.5, color='red', edgecolor='black')
ax1.bar(range(len(time)),female,width=0.5,bottom=male,color='blue', edgecolor='black')
What has changed?
you need the following inputs: left, height, width=0.8
you had only one of those
due to the fact that your dates are given as strings, you need a generic counter for the x-axis, hence the range(len(time)) to provide as many tics as there are entries in time.
now, you specify the height according to the values in male and female - none of which should be strings!
define a width
in your case, you want the bars to be stacked - therefore, specify the first set of values as bottom for the second
4.) Because time is made up of strings, you cannot use it for the ticks. Instead, try:
ax1.set_xticklabels(time,rotation=90)
Here, you use the strings from time as tick-labels. The rotation=90 is a nice feature so that the long strings do not overlap.
5.) If the labels are cut off by the plot window, try this:
plt.tight_layout()
plt.show()
This should get you back on track.
Good key words for a web-search inlcude:
matplotlib stacked bar
matplotlib tick labels rotation
matplotlib ticks date