Updating rrdtool database - python-2.7

My first post here so I hope I have not been too verbose.
I found I was losing datapoints due to only having 10 rows in my rrdtool config and wanted to update from a backup source file with older data.
After fixing the rows count the config was created with:
rrdtool create dailySolax.rrd \
--start 1451606400 \
--step 21600 \
DS:toGrid:GAUGE:172800:0:100000 \
DS:fromGrid:GAUGE:172800:0:100000 \
DS:totalEnerg:GAUGE:172800:0:100000 \
DS:BattNow:GAUGE:1200:0:300 \
RRA:LAST:0.5:1d:1010 \
RRA:MAX:0.5:1d:1010 \
RRA:MAX:0.5:1M:1010
and the update line in python is
newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = 'N:'+ (newline)
print UpdateE
try:
rrdtool.update(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
This all worked fine for inputting the original data (from a crontabbed website scrape) but as I said I lost data and wanted to add back the earlier datapoints.
From my backup source I had a plain text file with lines looking like
1509386401:10876.9:3446.22:18489.2:19.0
1509408001:10879.76:3446.99:18495.7:100.0
where the first field is the timestamp. And then used this code to read in the lines for the updates:
with open("rrdRecovery.txt","r") as fp:
for line in fp:
print line
## newline = ToGrid + ':' + FromGrid + ':' + TotalEnergy + ':' + battNow
UpdateE = line
try:
rrdtool.updatev(
"%s/dailySolax.rrd" % (os.path.dirname(os.path.abspath(__file__))),
UpdateE)
When it did not work correctly with a copy of the current version of the database I tried again on an empty database created using the same config.
In each case the update results only in the timestamp data in the database and no data from the other fields.
Python is not complaining and I expected
1509386401:10876.9:3446.22:18489.2:19.0
would update the same as does
N:10876.9:3446.22:18489.2:19.0
The dump shows the lastupdate data for all fields but then this for the rra database
<!-- 2017-10-31 11:00:00 AEDT / 1509408000 --> <row><v>NaN</v><v>NaN</v><v>NaN</v><v>NaN</v></row>
Not sure if I have a python issue - more likely a rrdtool understanding problem. Thanks for any pointers.

The problem you have is that RRDTool timestamps must be increasing. This means that, if you increase the length of your RRAs (back into the past), you cannot put data directly into these points - only add new data onto the end as time increases. Also, when you create a new RRD, the 'last update' time defaults to NOW.
If you have a log of your previous timestamp, then you should be able to add this history, as long as you don't do any 'now' updates before you finish doing so.
First, create the RRD, with a 'start' time earlier than the first historical update.
Then, process all of the historical updates in chronological order, with the appropriate timestamps.
Finally, you can start doing your regular 'now' updates.
I suspect what has happened is that you had your regular cronjob adding in new data before you have run all of your historical data input - or else you created the RRD with a start time after your historical timestamps.

Related

My fuzzy logic code is taking too long and not giving result

I have a task to remove the duplicate codes from a master data of approx 80K data points .
The duplicate identification can only be done using the 'description' field.
I have tried fuzzy logic - both with fuzzywuzzy and thefuzz libraries . Its been more than 18 hours and the code is still running .
I have upgraded my instance size on cloud platform but that is also not helping .
below is the code
Looking forward to ways of getting the result quickly
for index, row in df.iterrows():
value = row[column_name]
if any(fuzz.token_set_ratio(value,x)>90 for x in unique_values):
removed_duplicates = removed_duplicates.append(row)
df=df.drop(index)
else:
#if it is not , add the value to the unique value list
unique_values.append(value)
#save the modified dataframe to a new excel file
df.to_excel("file_without_duplicates_after_fuzzyV23.xlsx",index=False)
removed_duplicates.to_excel("removed_duplicatesV23.xlsx",index= False)
fuzzywuzzy
thefuzz
upgradation of the instance on the cloud platform .
patience

Filtering prefetch_related() to optimize data set

I'm hitting a performance issue on my overview page and am looking for some help.
Below is my current query which is working just fine: the page gets rendered in less than 2 seconds with n=12 queries in total. Great so far.
person_occurrences = Occurrence.objects\
.filter(date__year=first_day_in_month.year, date__month=first_day_in_month.month,
real_time_in__isnull=False, real_time_out__isnull=False)\
.distinct('person')
present_persons = Person.objects.filter(occurrence__in=person_occurrences).order_by('last_name')\
.prefetch_related('occurrence_set') \
.prefetch_related('fooindex_set') \
.prefetch_related('servicevoucher_set')\
.prefetch_related('stay_set__hospitalization_set') \
.prefetch_related('invoice_set')
However I need to access one more nested model under occurrence_set:
.prefetch_related('occurrence_set__meal_noon_options') \
.prefetch_related('occurrence_set__meal_noon_partner_options')
When I add these two, my page takes over 25 seconds to load with n=14 queries in total.
When debugging (django-debug-toolbar), I noticed the executed query for the latter is:
SELECT ••• FROM "stays_mealoption" INNER JOIN
"edc_occurrence_meal_noon_options" ON ("stays_mealoption"."id" =
"edc_occurrence_real_meal_noon_options"."mealoption_id") WHERE
"edc_occurrence_real_meal_noon_options"."occurrence_id" IN (32930,
32931, 32932, 32933, 32934, 32935, 32936, 32937, 32938, 32939, 32940,
32941, 32942, 32943, 32944, 32945, 32946, 32947, 32948, 32949, 32950,
32951, 32952, 32953, 32954, 32955, 32956, 32957, 32958, 32959, 32960,
32961, 32962, 32963, 32964, 32965, 32966, 32967, 32968, 32969, 32970,
32971, 32972, 32973, 32974, 32975, 32976, 32977, 32978, 32979, 32980,
32981, 32982, 32983, 32984, 32985, 32986, 32987, 32988, 32989, 32990,
32991, etc.
The occurrence_id's are apparently all IDs that exist in my DB, currently over 10.000. Therefore the returned data set is huge and is slowing down the page.
I tried to limit the occurrences by filtering on year/month in my 2nd query, i.e. Person.objects.filter(occurrence__in=person_occurrences, occurrence__date__month=first_day_in_month.month, occurrence__date__year=first_day_in_month.year).order_by('last_name')\ <snip>, but that didn't work: prefetch_related() still fetches all occurrence objects.
I also looked into Django 1.7+'s Prefetch() objects:
.prefetch_related(Prefetch('occurrence_set__meal_noon_partner_options',
queryset=occurrence.objects.filter(date__month=first_day_in_month.month, date__year=first_day_in_month.year))) \
.prefetch_related(Prefetch('occurrence_set__meal_noon_partner_options',
queryset=occurrence.objects.filter(date__month=first_day_in_month.month, date__year=first_day_in_month.year)))
... but there I get complaints that the queryset should not be of Occurrence but should be of MealOption.
Any idea on how to properly optimize?
Using Postgresql and Django 1.10.

RRD DB fake value generator

I want to generate fake values in RRD DB for a period of 1 month and with 5 seconds as a frequency for data collection. Is there any tool which would fill RRD DB with fake data for given time duration.
I Googled a lot but did not find any such tool.
Please help.
I would recommend the following one liner:
perl -e 'my $start = time - 30 * 24 * 3600; print join " ","update","my.rrd",(map { ($start+$_*5).":".rand} 0..(30*24*3600/5))' | rrdtool -
this assumes you have an rrd file called my.rrd and that is contains just one data source expecting GAUGE type data.

Missing Tweets from Twitter API (using Tweepy)?

I have been collecting tweets from the past week to collect the past-7-days tweets related to "lung cancer", yesterday, I figured I needed to start collecting more fields, so I added some fields and started re-collecting the same period of Tweets related to "lung cancer" from last week. The problem is, the first time I've collected ~2000 tweets related to lung cancer on 18th, Sept 2014. But last night, it only gave ~300 tweets, when I looked at the time of the tweets for this new set, it's only collecting tweets from something like ~23:29 to 23:59 on 18th Sept 2014. A large chunk of data is obviously missing. I don't think it's something with my code (below), I have tested various ways including deleting most of the fields to be collected and the time of data is still cut off prematurely.
Is this a known issue with Twitter API (when collecting last 7 days' data)? If so, it will be pretty horrible if someone is trying to do serious research. Or is it somewhere in my code that caused this (note: it runs perfectly fine for other previous/subsequent dates)?
import tweepy
import time
import csv
ckey = ""
csecret = ""
atoken = ""
asecret = ""
OAUTH_KEYS = {'consumer_key':ckey, 'consumer_secret':csecret,
'access_token_key':atoken, 'access_token_secret':asecret}
auth = tweepy.OAuthHandler(OAUTH_KEYS['consumer_key'], OAUTH_KEYS['consumer_secret'])
api = tweepy.API(auth)
# Stream the first "xxx" tweets related to "car", then filter out the ones without geo-enabled
# Reference of search (q) operator: https://dev.twitter.com/rest/public/search
# Common parameters: Changeable only here
startSince = '2014-09-18'
endUntil = '2014-09-20'
suffix = '_18SEP2014.csv'
############################
### Lung cancer starts #####
searchTerms2 = '"lung cancer" OR "lung cancers" OR "lungcancer" OR "lungcancers" OR \
"lung tumor" OR "lungtumor" OR "lung tumors" OR "lungtumors" OR "lung neoplasm"'
# Items from 0 to 500,000 (which *should* cover all tweets)
# Increase by 4,000 for each cycle (because 5000-6000 is over the Twitter rate limit)
# Then wait for 20 min before next request (becaues twitter request wait time is 15min)
counter2 = 0
for tweet in tweepy.Cursor(api.search, q=searchTerms2,
since=startSince, until=endUntil).items(999999999): # changeable here
try:
'''
print "Name:", tweet.author.name.encode('utf8')
print "Screen-name:", tweet.author.screen_name.encode('utf8')
print "Tweet created:", tweet.created_at'''
placeHolder = []
placeHolder.append(tweet.author.name.encode('utf8'))
placeHolder.append(tweet.author.screen_name.encode('utf8'))
placeHolder.append(tweet.created_at)
prefix = 'TweetData_lungCancer'
wholeFileName = prefix + suffix
with open(wholeFileName, "ab") as f: # changeable here
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
counter2 += 1
if counter2 == 4000:
time.sleep(60*20) # wait for 20 min everytime 4,000 tweets are extracted
counter2 = 0
continue
except tweepy.TweepError:
time.sleep(60*20)
continue
except IOError:
time.sleep(60*2.5)
continue
except StopIteration:
break
Update:
I have since tried running the same python scripts on a different computer (which is faster and more powerful than my home laptop). And the latter resulted in the expected number of tweets, I don't know why it's happening as my home laptop works fine for many programs, but I think we could rest the case and rule out the potential issues related to the scripts or Twitter API.
If you want to collect more data, I would highly recommend the streaming api that Tweepy has to offer. It has a much higher rate limit, in fact I was able to collect 500,000 tweets in just one day.
Also your rate limit checking is not very robust, you don't know for sure that Twitter will allow you to access 4000 tweets. From experience, I found that the more often you hit the rate limit the fewer tweets you are allowed and the longer you have to wait.
I would recommend using:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
so that your application will not exceed the rate limit, alternatively you should check what you have used with:
print (api.rate_limit_status())
and then you can just sleep the thread like you have done.
Also your end date is incorrect. The end date should be '2014-09-21', one higher than whatever todays date is.

Xively read data in Python

I have written a python 2.7 script to retrieve all my historical data from Xively.
Originally I wrote it in C#, and it works perfectly.
I am limiting the request to 6 hour blocks, to retrieve all stored data.
My version in Python is as follows:
requestString = 'http://api.xively.com/v2/feeds/41189/datastreams/0001.csv?key=YcfzZVxtXxxxxxxxxxxORnVu_dMQ&start=' + requestDate + '&duration=6hours&interval=0&per_page=1000' response = urllib2.urlopen(requestString).read()
The request date is in the correct format, I compared the full c# requestString version and the python one.
Using the above request, I only get 101 lines of data, which equates to a few minutes of results.
My suspicion is that it is the .read() function, it returns about 34k of characters which is far less than the c# version. I tried adding 100000 as an argument to the ad function, but no change in result.
Left another solution wrote in Python 2.7 too.
In my case, got data each 30 minutes because many sensors sent values every minute and Xively API has limited half hour of data to this sent frequency.
It's general module:
for day in datespan(start_datetime, end_datetime, deltatime): # loop increasing deltatime to star_datetime until finish
while(True): # assurance correct retrieval data
try:
response = urllib2.urlopen('https://api.xively.com/v2/feeds/'+str(feed)+'.csv?key='+apikey_xively+'&start='+ day.strftime("%Y-%m-%dT%H:%M:%SZ")+'&interval='+str(interval)+'&duration='+duration) # get data
break
except:
time.sleep(0.3)
raise # try again
cr = csv.reader(response) # return data in columns
print '.'
for row in cr:
if row[0] in id: # choose desired data
f.write(row[0]+","+row[1]+","+row[2]+"\n") # write "id,timestamp,value"
The full script you can find it here: https://github.com/CarlosRufo/scripts/blob/master/python/retrievalDataXively.py
Hope you might help, delighted to answer any questions :)