Rethinkdb mapreduce not has_fields not working properly - mapreduce

I'm trying to find the percent of records (grouped by company) that do not have phone numbers. I can do this with the following two queries:
r.table('users') \
.merge(lambda u: {'groups': r.table('groups').get_all(r.args(u['group_ids'])).coerce_to('array')}) \
.filter(lambda u: u.has_fields('phone')) \
.group(lambda u: u['groups'][0]['company']).count().run()
and to get the count of all records:
r.table('users') \
.merge(lambda u: {'groups': r.table('groups').get_all(r.args(u['group_ids'])).coerce_to('array')}) \
.group(lambda u: u['groups'][0]['company']).count().run()
However, I'd like to use map-reduce to do this in a single query and possibly be more efficient. Here is my query, but it doesn't work because both of the numbers (phone and count) are the same:
r.table('users') \
.merge(lambda u: {'groups': r.table('groups').get_all(r.args(u['group_ids'])).coerce_to('array')}) \
.group(lambda u: u['groups'][0]['company']) \
.map(lambda u: { 'phone': 1 if u.has_fields('phone') else 0, 'count': 1 }) \
.reduce(lambda a, b: {'phone': a['phone'] + b['phone'], 'count': a['count'] + b['count'] }).run()
So my question is, why doesn't has_fields() work in the map command, but does in the filter command?

The problem is that you're using Python's if/then/else operators. Python doesn't expose a way to interact with these, so the driver can't see the whole if/then/else statement. If you use r.branch instead (r.branch(u.has_fields('phone'), 1, 0)) it should work.

Related

Why the ForEach sink doesn't called function (show_data_function) in Spark structured streaming?

I want to see the data available in the spark streaming dataframe and later part I want to apply business operation on that data.
So far I have tried to convert streaming DataFrame to RDD. Once that object converted into RDD, I want to apply a function with transform the data and also create new column with the schema( for specific message).
dsraw = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_kafka_server) \
.option("subscribe", topic) \
.load() \
.selectExpr("CAST(value AS STRING)")
print "type (df_stream)", type(dsraw)
print "schema (dsraw)", dsraw.printSchema()
def show_data_fun(dsraw, epoch_id):
dsraw.show()
row_rdd = dsraw.rdd.map(lambda row: literal_eval(dsraw['value']))
json_data = row_rdd.collect()
print "From rdd : ", type(json_data)
print "From rdd : ", json_data[0]
print "show_data_function_call"
jsonDataQuery = dsraw \
.writeStream \
.foreach(show_data_fun)\
.queryName("df_value")\
.trigger(continuous='1 second')\
.start()
print the first JSON message which is in the stream.

How could I print the time of a max value in a rrdtool graph?

I've searched so many times that I'm tired to.
I've a rrdtool database, with which I'm used to print the MAX, Min, Average value.
Now, I would like to print the time of the max value stored in the rrd db.
Here is the definition of my rrd (CPU monitoring) :
rrdtool create CPU.rrd --start $Date \
DS:CPU_ALL:GAUGE:600:U:U \
DS:User:GAUGE:600:U:U \
DS:Sys:GAUGE:600:U:U \
DS:Wait:GAUGE:600:U:U \
DS:Idle:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:20000 \
RRA:AVERAGE:0.5:1:20000 \
RRA:AVERAGE:0.5:1:20000 \
RRA:AVERAGE:0.5:1:20000 \
RRA:AVERAGE:0.5:1:20000
Here is my graph script :
rrdtool graph CPUUsed.png --start -1w \
DEF:CPUTOTAL=CPU.rrd:CPU_ALL:AVERAGE AREA:CPUTOTAL#FF0000:"CPU Used" LINE2:CPUTOTAL#FF0000 \
--vertical-label "CPU" \
--title "CPU " \
--width 530 \
--height 380 \
GPRINT:CPUTOTAL:MAX:"MAX\:%6.2lf %s" \
GPRINT:CPUTOTAL:MIN:"MIN\:%6.2lf %s" \
GPRINT:CPUTOTAL:AVERAGE:"MOY\:%6.2lf %s" \
GPRINT:CPUTOTAL:LAST:"LAST\:%6.2lf %s"
How could I generate this graph adding the time of the max CPU value ?
OK, you have a couple of problems here.
Firstly, your RRD is misconfigured. You have 5 identical RRAs defined - this does not make sense. One RRA will hold values at the specified resolution for all the defined DSs. However, you may want to have more at higher granularity (to speed up graphs of month or year). You may also want to have a MIN or MAX type RRA so that your MIN and MAX figures are more accurate.
For example, this set defines both MAX and MIN RRAs as well as the average, and will also have 4 rollups that roughly correspond to daily, weekly, monthly and yearly graphs.
RRA:AVERAGE:0.5:1:20000 \
RRA:AVERAGE:0.5:6:2000 \
RRA:AVERAGE:0.5:24:2000 \
RRA:AVERAGE:0.5:288:2000 \
RRA:MAX:0.5:1:20000 \
RRA:MAX:0.5:6:2000 \
RRA:MAX:0.5:24:2000 \
RRA:MAX:0.5:288:2000 \
RRA:MIN:0.5:1:20000 \
RRA:MIN:0.5:6:2000 \
RRA:MIN:0.5:24:2000 \
RRA:MIN:0.5:288:2000
Secondly, when you want to print a single figure in the GPRINT line, you need to use a VDEF to convert your time series data (from the DEF or CDEF) into a single value, using some consolodation functions.
For example, this set of commands will use the MAX and MIN type DEFs defined previously, then calculate summaries over them using a VDEF. Of course, you could just use CPUTOTAL instead of defining CPUTOTALMAX and CPUTOTALMIN (saving yourself the additional RRAs) but, as you move to using the lower-granularity RRAs, accuracy will fall. If you don't have lower-granularity RRAs, then you will be accurate, but will use a lot of additional CPU at graph time and graph creation will be slower. Using different resolution RRAs speeds up graph creation.
DEF:CPUTOTAL=CPU.rrd:CPU_ALL:AVERAGE \
DEF:CPUTOTALMAX=CPU.rrd:CPU_ALL:MAX \
DEF:CPUTOTALMIN=CPU.rrd:CPU_ALL:MIN \
VDEF:overallmax=CPUTOTALMAX,MAXIMUM \
VDEF:overallmin=CPUTOTALMIN,MINIMUM \
VDEF:overallavg=CPUTOTAL,AVG \
VDEF:overalllst=CPUTOTAL,LAST \
AREA:CPUTOTAL#FF0000:"CPU Used" \
LINE2:CPUTOTAL#FF0000 \
GPRINT:overallmax:"MAX\:%6.2lf %s" \
GPRINT:overallmin:"MIN\:%6.2lf %s" \
GPRINT:overallavg:"MOY\:%6.2lf %s" \
GPRINT:overalllst:"LAST\:%6.2lf %s" \
GPRINT:overallmax:"Max was at %c":strftime
The last line will print the time of the maxima rather than the value. When a VDEF calculates a MAX or MIN, it actually returns two components - value, and point in time. Usually you use the value, but by appending :strftime to the GPRINT directrive you can use the time component instead.
I suggest you spend a bit more time working through the tutorials and examples on the RRDTool Website, which should help you gain a better understanding of how RRDTool works.

Django queryset How to retrieve related fields with annotate()

I have this table:
id supply event_date value_average
----------------------------------------
1 a 01-01-2018 5
2 b 02-01-2018 6
3 a 02-01-2018 7
4 b 03-01-2018 8
5 c 03-01-2018 9
I am trying to get the latest value for each supply based on event_date column. I can get the latest event, but I did not found a way to return the value_average as well.
values_average = Purchase.objects \
.values('supply') \
.annotate(last=Max('event_date')) \
.order_by()
current return:
a 02-01-2018
b 03-01-2018
c 03-01-2018
expected return:
a 02-01-2018 7
b 03-01-2018 8
c 03-01-2018 9
I found a way to do that by following this answer:
Django: select values with max timestamps or join to the same table
values_average = Purchase.objects \
.filter(farm=farm, supply__in=queryset) \
.order_by('supply', '-event_date') \
.distinct('supply')
It will only work with Postgres. The final result will be a normal queryset with the latest events. Just take care if your model has Meta ordering.
Django docs on this:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct
I think you just have to add the value_average attribute to the values you want to return :
values_average= Purchase.objects.values('supply','value_average').annotate(last=Max('event_date'))

Retraining Inception and specifying label_count = 2 but receiving three scores instead of two

I have modified the flower retraining code to have label_count =2 as shown here:
gcloud beta ml jobs submit training "$JOB_ID" \
--module-name trainer.task \
--package-path trainer \
--staging-bucket "$BUCKET" \
--region us-central1 \
-- \
--output_path "${GCS_PATH}/training" \
--eval_data_paths "${GCS_PATH}/preproc/eval*" \
--train_data_paths "${GCS_PATH}/preproc/train*" \
--label_count 2 \
--max_steps 4000
And I have modified dict.txt to have only two labels.
But the retrained model outputs three scores instead of two as expected. The unexpected third score is always very small as shown in this example:
KEY PREDICTION SCORES
Key123 0 [0.7956143617630005, 0.2043769806623459, 8.625334885437042e-06]
Why are there three scores and is there a change one can make so the model outputs only two scores?
Note: I have read the answers from Slaven Bilac and JoshGC to the question “cloudml retraining inception - received a label value outside the valid range” but these answers do not address my question above.
It's the "label" we apply to images that had no label in the training set. The behavior is discussed in this comment in model.py line 221
# Some images may have no labels. For those, we assume a default
# label. So the number of labels is label_count+1 for the default
# label.
I agree it's not a very intuitive behavior, but it makes the code a little more robust against datasets that are not as cleaned up! Hope this helps.

Convert Chile Map with Inserts

I have used the NaturalData states/Providences data set to generate a map of Chile using this command:
python converter.py \
--width 900 \
--country_name_index 12 \
--country_code_index 31 \
--where "iso_a2 = 'CL'" \
--projection mill \
--name "cl" \
--language en \
ne_10m_admin_1_states_provinces_shp.shp output/jquery-jvectormap-cl-mill-en.js
It generates a maps like this. (Minus the red circles)
The three circled islands are all labeled ValparaÃso, which corresponds to the providence circled on the main land mass.
Looking at the documentation provided on how to do insets (which uses Alaska and Hawaii as examples), I attempted to move these islands closer, so that my map was more centered.
python converter.py \
--width 900 \
--country_name_index 12 \
--country_code_index 31 \
--where "iso_a2 = 'CL'" \
--projection mill \
--name "cl" \
--language en \
--insets [{"codes": ["CL-VS"], "width": 200, "left": 10, "top": 370}]' \
ne_10m_admin_1_states_provinces_shp.shp output/jquery-jvectormap-cl-mill-en.js
Unfortunately, this fails with
converter.py: error: unrecognized arguments: 200, left: 10, top: 370},]' ne_10m_admin_1_states_provinces_shp.shp output/jquery-jvectormap-cl-mill-en.js
My questions:
How do I resolve the errors in that error message? The parameters are mentioned in both the documentation and in the code so I am unsure what should be used instead.
How can I move the three circled islands to be insets without affecting the mainland ValparaÃso?
Your insets argument is failing because it isn't quoted properly. You can use the following:
--insets "[{\"codes\": [\"CL-VS\"], \"width\": 200, \"left\": 10, \"top\": 370}]"