rrdtool's fetch doesnt show single point - python-2.7

I used rrdtool python extension for save data in rrd:
## Creating db.
rrdtool.create(rrd_file,
'--step', '2',
'DS:%s:GAUGE:4:U:U' % DSNAME,
'RRA:AVERAGE:0,5:1:288',
)
value = 23
for i in range(4):
rrdtool.update('/home/way/workspace/RrdDaemon/test', "%s:%s" % (datetime_2_sec(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))), str(value)))
sleep(2)
Cycle worked 4 times, and i want to get 4 points. But i get 3 only:
1460382646: -nan
1460382648: 2,3000000000e+01
1460382650: 2,3000000000e+01
1460382652: 2,3000000000e+01
1460382654: -nan
I tried to change heartbeat, step , xff - nothing helps me. Now i try with 1 iteration:
for i in range(1):
rrdtool.update('/home/way/workspace/RrdDaemon/test', "%s:%s" % (datetime_2_sec(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"))), str(value)))
Timestamp : 1460385371
Result:
1460385368: -nan
1460385370: -nan
1460385372: -nan
sudo rrdtool info test:
filename = "test"
rrd_version = "0003"
step = 2
last_update = 1460385371
header_size = 584
ds[ds0].index = 0
ds[ds0].type = "GAUGE"
ds[ds0].minimal_heartbeat = 3
ds[ds0].min = NaN
ds[ds0].max = NaN
ds[ds0].last_ds = "23"
ds[ds0].value = NaN
ds[ds0].unknown_sec = 1
rra[0].cf = "AVERAGE"
rra[0].rows = 288
rra[0].cur_row = 65
rra[0].pdp_per_row = 1
rra[0].xff = 0,0000000000e+00
rra[0].cdp_prep[0].value = NaN
rra[0].cdp_prep[0].unknown_datapoints = 0
Do i make anything wrong or its the way which rrd working for?
Thank you.

The problem you are running into is that you are crossing from an unknown state into a known state. The minimal_heartbeat defines the maximum interval permissible between two updates, for rrdtool to consider the time in between the updates to contain valid data.
This also means that the first update after a period of unknown data only serves to indicate the time when the data becomes known again ... the next update (within the interval defined by the minimal_heartbeat).

Related

Parsing periods in a column dataframe

I have a csv with one of the columns that contains periods:
timespan (string): PnYnMnD, where P is a literal value that starts the expression, nY is the number of years followed by a literal Y, nM is the number of months followed by a literal M, nD is the number of days followed by a literal D, where any of these numbers and corresponding designators may be absent if they are equal to 0, and a minus sign may appear before the P to specify a negative duration.
I want to return a data frame that contains all the data in the csv with parsed timespan column.
So far I have a code that parses periods:
import re
timespan_regex = re.compile(r'P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?')
def parse_timespan(timespan):
# check if the input is a valid timespan
if not timespan or 'P' not in timespan:
return None
# check if timespan is negative and skip initial 'P' literal
curr_idx = 0
is_negative = timespan.startswith('-')
if is_negative:
curr_idx = 1
# extract years, months and days with the regex
match = timespan_regex.match(timespan[curr_idx:])
years = int(match.group(1) or 0)
months = int(match.group(2) or 0)
days = int(match.group(3) or 0)
timespan_days = years * 365 + months * 30 + days
return timespan_days if not is_negative else -timespan_days
print(parse_timespan(''))
print(parse_timespan('P2Y11M20D'))
print(parse_timespan('-P2Y11M20D'))
print(parse_timespan('P2Y'))
print(parse_timespan('P0Y'))
print(parse_timespan('P2Y4M'))
print(parse_timespan('P16D'))
Output:
None
1080
-1080
730
0
850
16
How do I apply this code to the whole csv column while running the function processing csv?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv(f_path, names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
my_ocan['timespan'] = parse_timespan(my_ocan['timespan']) #I tried like this, but sure it is not working :)
return my_ocan
Thank you and have a lovely day :)
Like with Python's builtin map, Pandas also has that method. You can check its documentation here. Since you already have your function ready which takes a single parameter and returns a value, you just need this:
my_ocan['timespan'] = my_ocan['timespan'].map(parse_timespan) #This will take each value in the column "timespan", pass it to your function 'parse_timespan', and update the specific row with the returned value
And here is a generic demo:
import pandas as pd
def demo_func(x):
#Takes an int or string, prefixes with 'A' and returns a string.
return "A" + str(x)
df = pd.DataFrame({"Column_1": [1, 2, 3, 4], "Column_2": [10, 9, 8, 7]})
print(df)
df['Column_1'] = df['Column_1'].map(demo_func)
print("After mapping:\n{}".format(df))
Output:
Column_1 Column_2
0 1 10
1 2 9
2 3 8
3 4 7
After mapping:
Column_1 Column_2
0 A1 10
1 A2 9
2 A3 8
3 A4 7

load multiple csv files into Dataframe: columns names issue

I have multiple csv files with the same format (14 rows 4 columns).
I tried to load all of them into a single dataFrame, and use file's name to rename the values of the first column (1-14)
1 500 0 0
2 350 0 1
3 500 1 0
.............
13 600 0 0
14 800 0 0
I tried the following code but I am not getting what I am expecting:
filenames = os.listdir('Threshold/')
Y = pd.DataFrame () #empty df
# file name are in the following foramt "subx_ICA_thre.csv"
# need to get x (subject number to be used later for renaming columns values)
Sub_list=[]
for filename in filenames:
s= int(''.join(filter(str.isdigit, filename)))
Sub_list.append(int(s))
S_Sub_list= sorted(Sub_list)
for x in S_Sub_list: # get the file according to the subject number
temp = pd.read_csv('sub' +str(x)+'_ICA_thre.csv' )
df = pd.concat([Y, temp]) # concat the obtained frame with the empty frame
df.columns = ['id', 'data', 'isEB', 'isEM']
# replace the column values using subject id
for sub in range(1,15):
df['id'].replace(sub, 'sub' +str(x)+'_ICA_'+str(sub) ,inplace=True)
print (df)
output:
id data isEB isEM
0 sub1_ICA_2 200 0 0
1 sub1_ICA_3 275 0 0
2 sub1_ICA_4 500 1 0
................................
11 sub1_ICA_13 275 0 0
12 sub1_ICA_14 300 0 0
id data isEB isEM
0 sub2_ICA_2 275 0 0
1 sub2_ICA_3 500 0 0
2 sub2_ICA_4 400 0 0
.................................
11 sub2_ICA_13 300 0 0
12 sub2_ICA_14 450 0 0
First, it seems that the code makes different dataFrame not a single one.Second, the first row is removed (sub1_ICA_1 is missing, may be replaced with column names).
I couldn't find the problem in the loop that I am using
I think you need create list of DataFrames first, then concat with parameter keys for new values by range in MultiIndex, then modify column id and last remove MultiIndex by reset_index:
Also was added parameter names to read_csv for custom columns names.
Y = []
for x in S_Sub_list:
n = ['id', 'data', 'isEB', 'isEM']
temp = pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n)
Y.append(temp)
#list comprehension alternative
#n = ['id', 'data', 'isEB', 'isEM']
#Y = [pd.read_csv('sub' + str(x) +'_ICA_thre.csv', names = n) for x in S_Sub_list]
df = pd.concat(Y, keys=range(1,len(S_Sub_list) + 1))
df['id'] = 'sub' + df.index.get_level_values(0).astype(str) +'_ICA_'+ df['id'].astype(str)
df = df.reset_index(drop=True)

percentage bins based on predefined buckets

I have a series of numbers and I would like to know % of numbers falling in every bucket of a dataframe.
df['cuts'] have 10, 20 and 50 as values. Specifically, I would like to what % of series are in [0-10], (10-20] and (20-50] bin and this should be appended to the df dataframe.
I wrote the following code. I definitely feel that it could be improvised. Any help is appreciated.
bin_cuts = [-1] + list(df['cuts'].values)
out = pd.cut(series, bins = bin_cuts)
df_pct_bins = pd.value_counts(out, normalize= True).reset_index()
df_pct_bins = pd.concat([df_pct_bins['index'].str.split(', ', expand = True), df_pct_bins['cuts']], axis = 1)
df_pct_bins[1] = df_pct_bins[1].str[:-1].astype(str)
df['cuts'] = df['cuts'].astype(str)
df_pct_bins = pd.merge(df, df_pct_bins, left_on= 'cuts', right_on= 1)
Consider the sample data df and s
df = pd.DataFrame(dict(cuts=[10, 20, 50]))
s = pd.Series(np.random.randint(50, size=1000))
Option 1
np.searchsorted
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.value_counts(
c[np.searchsorted(c, s)],
normalize=True
)))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578
Option 2
pd.cut
c = df.cuts.values
df.assign(
pct=df.cuts.map(
pd.cut(
s,
np.append(-np.inf, c),
labels=c
).value_counts(normalize=True)
))
cuts pct
0 10 0.216
1 20 0.206
2 50 0.578

Subtract value in one data frame from the next value in a second data frame

I have a data frame that is composed of several datasets (about 146 and counting). two of my columns are labeled "start_time" and "stop_time," which represent the start and stop of a response (i.e., the total duration of the response).
I need to get the "inter-response time" or the start_time subtracted from the next corresponding value in start_time. Basically if:
start_time = [1,4,7]
stop_time = [2,5,8]
I need:
stop_time[0] - start_time[1]
stop_time[2] - start_time[3]
in order to get:
iri = [2,2]
My code looks like this:
iri_t = []
def grps():
for grp in lset2_name_grps.groups:
beg_eng_t = pd.DataFrame([lset2_name_grps.stop_time, lset2_name_grps.start_time], columns=['end_t','beg_t'])
end_t = [i for i in lset2_name_grps.stop_time]
beg_t = [i for i in lset2_name_grps.start_time]
beg_t = np.insert(beg_t, len(beg_t),0)
end_t = np.insert(end_t, 0,0)
iri_t.append(np.subtract(end_t, beg_t))
# for i,j in zip(end_t, beg_t):
# iri_t.append(np.subtract(i,j))
# lset2_name_grps['iri'] = iri_t
grps()
Essentially, it doesn't do anything close to what I'm trying to accomplish and the only out I get is either "Not Implemented" or an error.
How about something like this:
import pandas as pd
starts = pd.Series([1, 4, 7])
stops = pd.Series([2, 5, 8])
iri_t = [0]
for i in range(1, len(starts)):
iri_t.append(starts[i] - ends[i-1])
times_df = pd.concat([starts, stops, pd.Series(iri_t)], axis=1)
This creates the following data_frame:
0 1 2
0 1 2 0
1 4 5 2
2 7 8 2
I think what your asking (correct me if I'm wrong) is best accomplished by putting the two columns in a single dataframe, using shift to offset one of your columns, then doing an ordinary subtraction.
df = pd.DataFrame({'start_time':[1,4,7], 'stop_time':[2,5,8]})
df.stop_time - df.start_time.shift()
Out[5]:
0 NaN
1 4
2 4
dtype: float64

How to remove rows from multiindex dataframe with string indices

I have a dataframe with multiindex, from which I want to delete rows according to some index based pattern. For example, I would like to remove frames 1-4 where the annotator is "Peter Test xx" and the label is "empty' in the dataframe below
print df
boundingbox x1 boundingbox y1 \
frame annotator label
0 Peter Test xx empty NaN NaN
1 Peter Test xx empty NaN NaN
2 Peter Test xx empty NaN NaN
3 Peter Test xx empty NaN NaN
Petaa yea NaN NaN
4 Peter Test xx empty NaN NaN
5 P empty frame 494 64
Peter Test xx empty NaN NaN
6 P empty frame 494 64
Peter Test xx empty NaN NaN
7 P empty frame 494 64
Peter Test xx empty NaN NaN
8 P empty frame 494 64
Peter Test xx empty NaN NaN
I can select rows by doing something like
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
return df.loc[tuple(indexer),:]
If I want to delete these rows, ideally I would like to do something like
del df.loc[tuple(indexer),:]
But this does not work (why?). All solutions I found online were based on int based indices. But if I am working with strings as indices, I cannot simply slice or such things.
Something I tried as well was:
def filterFunc(x, frames, annotator, label):
if x[0] in frames\
and x[1] == annotator\
and x[2] == label:
return 1
else:
return 0
mask = df.index.map(lambda x: filterFunc(x, frames, annotator, label))
return df[~mask,:]
Which gives me:
TypeError: unhashable type: 'numpy.ndarray'
Any advice?
Trying to solve another problem I figured out that one can use the index of a selected part of a dataframe in drop:
indexer = [slice(None)]*len(df.index.names)
indexer[df.index.names.index('frame')] = range(1,4)
indexer[df.index.names.index('annotator')] = ['Peter Test xx']
indexer[df.index.names.index('label')] = ['empty']
selection = df.loc[tuple(indexer),:]
df.drop(selection.index)
Is that how it is supposed to be done?
You have to use loc, iloc or ix when doing more complicated slicing:
df[msk] # works
df.iloc[msk, ] # works
df.iloc[msk, :] # works
but
df[msk, ]
TypeError: unhashable type: 'numpy.ndarray'
See different choices for indexing in the docs.