How to inspect the structure of a Google AutoML Tables model? - google-cloud-ml

I have trained a model using Google AutoML Tables. I would like to inspect the general structure of the trained model (which algos were used, what preprocessing if any, etc)
In the "Viewing model architecture with Cloud Logging" section of the docs here, I see:
If more than one model was used to create the final model, the
hyperparameters for each model are returned as an entry in the
modelParameters array, indexed by position (0, 1, 2, and so on)
My modelParameters array is as shown below (with the first and last elements expanded). Im only doing the AutoML Tables quickstart, which uses the Bank marketing open-source dataset, so I'm suprised that it would return such a complex model (25 stacked/ensembled models?). I would think that just model 0 (a single Gradient Boosted Decision Tree with 300 trees and a max depth of 15) would be sufficient.
Also '25' is a suspiciously round number. Are we sure that the docs are correct and this list isn't actually the best 25, stackranked by accuracy score? Is there a better way to understand the end to end model (including preprocessing) that Google AutoML Tables is producing?
modelParameters: [
0: {
hyperparameters: {
Center Bias: "False"
Max tree depth: 15
Model type: "GBDT"
Number of trees: 300
Tree L1 regularization: 0
Tree L2 regularization: 0.10000000149011612
Tree complexity: 3
}
}
1: {…}
2: {…}
3: {…}
4: {…}
5: {…}
6: {…}
7: {…}
8: {…}
9: {…}
10: {…}
11: {…}
12: {…}
13: {…}
14: {…}
15: {…}
16: {…}
17: {…}
18: {…}
19: {…}
20: {…}
21: {…}
22: {…}
23: {…}
24: {
hyperparameters: {
Center Bias: "False"
Max tree depth: 9
Model type: "GBDT"
Number of trees: 500
Tree L1 regularization: 0
Tree L2 regularization: 0
Tree complexity: 0.10000000149011612
}
}
]
}

The docs and your original interpretation of the results are correct. In this case, AutoML Tables created an ensemble of 25 models.
It also provides the full list of individual models tried during the search process if you click on the "Trials" link. That should be much larger list than 25.

Related

Perl regexp: how to start matching text in part of file after I have find a "flag" string in the context?

Apologize for the way I describe my question, maybe it will be much clarified if I give the instance as below:
Consider the case that I have a certain file, and it is separated into different section with each beginning with a certain string consider as flag: e.g.
consider "From Clock" as the flag I mentioned about
example_file:
From Clock: fdbk_bufg_cell_in_net
To Clock: fdbk_bufg_cell_in_net
Setup : NA Failing Endpoints, Worst Slack NA , Total Violation NA
Hold : NA Failing Endpoints, Worst Slack NA , Total Violation NA
PW : 0 Failing Endpoints, Worst Slack 5.501ns, Total Violation 0.000ns
Pulse Width Checks
Clock Name: fdbk_bufg_cell_in_net
Waveform(ns): { 0.000 3.500 }
Period(ns): 7.000
Sources: { my_atspeed_mmcm/CLKFBOUT }
Check Type Corner Lib Pin Reference Pin Required(ns) Actual(ns) Slack(ns) Location Pin
Min Period n/a BUFGCE/I n/a 1.499 7.000 5.501 BUFGCE_X0Y35 fdbk_bufg_cell/I
From Clock: mmcm_clkout
To Clock: mmcm_clkout
Setup : NA Failing Endpoints, Worst Slack NA , Total Violation NA
Hold : 0 Failing Endpoints, Worst Slack 0.123ns, Total Violation 0.000ns
PW : 625 Failing Endpoints, Worst Slack -0.195ns, Total Violation -121.875ns
Now I only want to match the "PW :" line in "From Clock: mmcm_clkout" part, how do I do so?
You can try something along these lines:
my $match = 0;
while (<>)
{
if (/^From Clock: (\w+)/)
{
$match = ($1 eq "mmcm_clkout");
} elsif ($match && /^PW: whatever/)
{
# do whatever you want with the line
}
}

Map-Reduce Logs on Hive-Tez

I want to get the interpretation of Map-Reduce logs after running a query on Hive-Tez ? What the lines after INFO: conveys ?
Here I have attached a sample
INFO : Session is already open
INFO : Dag name: SELECT a.Model...)
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_14708112341234_1234)
INFO : Map 1: -/- Map 3: -/- Map 4: -/- Map 7: -/- Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
INFO : Map 1: -/- Map 3: 0/118 Map 4: 0/118 Map 7: 0/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
INFO : Map 1: 0/118 Map 3: 0/118 Map 4: 0/118 Map 7: 0/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
INFO : Map 1: 0/118 Map 3: 0/118 Map 4: 0(+5)/118 Map 7: 0/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
INFO : Map 1: 0/118 Map 3: 0(+5)/118 Map 4: 0(+7)/118 Map 7: 0(+1)/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
INFO : Map 1: 0/118 Map 3: 0(+15)/118 Map 4: 0(+18)/118 Map 7: 0(+1)/1 Reducer 2: 0/15 Reducer 5: 0/26 Reducer 6: 0/13
The log you posted is DAG execution log. The DAG consists of
Map 1,Map 3,Map 4,Map 7 mappers vertices and reducers: Reducer 2,Reducer 5,Reducer 6
Map 1: -/- - this means that the vertex is not initialized, the number of mappers are not calculated yet.
Map 4: 0(+7)/118 - this means that totally there are 118 mappers, 7 of them are running in parallel, 0 completed yet, 118-7=111 are pending.
Reducer 2: 0/15 - this means that totally there are 15 reducers, 0 of them are running, 0 of them are completed (15 reducers pending).
Negative figures (there are no such in your example) = number of failed or killed mappers or reducers
Qubole has explanation about Tez log: https://docs.qubole.com/en/latest/user-guide/hive/using-hive-on-tez/hive-tez-tuning.html#understanding-log-pane

Sentence detection and extraction into same data frame

I have a following data frame:
reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product",
"Inexpensive. An improvement over integrated graphics.",
"I love that product so excite. I will order again if I need more .",
"Excellent card, great graphics."),
user = c(1,2,3,4),
Review_Id = c("101968","101968","210546","112546"),
stringsAsFactors = FALSE)
and I need to have desired output:
user review_Id sentence
1 101968 Made with high quality materials.
1 101968 Very Good product
2 101968 Inexpensive.
2 101968 An improvement over integrated graphics.
3 210546 I love that product so excite.
3 210546 I will order again if I need more .
4 112546 Excellent card, great graphics.
I was wondering about something like this: sent_detect(reviews$value)
But how could I combine that function to have that desired output.
If your data really are so tidy, you can just use cSplit from my "splitstackshape" package.
library(splitstackshape)
cSplit(reviews, "value", ".", direction = "long")
# value user Review_Id
# 1: Product was received in excellent condition 1 101968
# 2: Made with high quality materials 1 101968
# 3: Very Good product 1 101968
# 4: Inexpensive 2 101968
# 5: An improvement over integrated graphics 2 101968
# 6: I love that product so excite 3 210546
# 7: I will order again if I need more 3 210546
# 8: Excellent card, great graphics 4 112546

for information retrieval course using python, accessing given tf-idf weight

I am doing this python program where i have to access :
This is what i am trying to achieve with my code: Return a dict mapping doc_id to length, computed as sqrt(sum(w_i**2)), where w_i is the tf-idf weight for each term in the document.
E.g., in the sample index below, document 0 has two terms 'a' (with
tf-idf weight 3) and 'b' (with tf-idf weight 4). It's length is
therefore 5 = sqrt(9 + 16).
>>> lengths = Index().compute_doc_lengths({'a': [[0, 3]], 'b': [[0,4]]})
>>> lengths[0]
5.0
The code i have is this:
templist=[]
for iter in index.values():
templist.append(iter)
d = defaultdict(list)
for i,l in templist[1]:
d[i].append(l)
lent = defaultdict()
for m in d:
lo= math.sqrt(sum(lent[m]**2))
return lo
So, if I'm understanding you correctly, we have to transform the input dictionary:
ind = {'a':[ [1,3] ], 'b': [ [1,4 ] ] }
To the output dictionary:
{1:5}
Where the 5 is calculated as the euclidian distance for the value portion of the input dictionary (the vector [3,4] in this case), Correct?
Given that information, the answer becomes a bit more straight-forwards:
def calculate_length(ind):
# Frist, let's transform the dictionary into a list of doc_id, tl_idf pairs; [[doc_id_1,tl_idf_1],...]
data = [entry[0] for entry in ind.itervalues()] # use just ind.values() in python 3.X
# Next, let's split that list into two, one for doc_id's, one for tl_idfs
doc_ids, tl_idfs = zip(*data)
# We can just assume that all the doc_id's are the same. you could check that here if you wanted
doc_id = doc_ids[0]
# Next, we calculate the length as per our formula
length = sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))
# Finally, we return the output dictionary
return {doc_id: length}
Example:
>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1:5.0}
There are a couple places in here where you could optimize this to remove the intermidary lists (this method can be two lines of operation and a return) but I'll leave that to you to find out since this is a homework assignment. I also hope you take the time to actually understand what this code does, rather than just copying it wholesale.
Also note that this answer makes the very large asumption that all doc_id values are the same, and there will only ever be a single doc_id,tl_idf list at each key in the dictionary! If that's not true, then your transform becomes more complicated. But you did not provide sample input nore textual explination indicating that's the case (though, based on the data structure, I'd think it quite likely).
Update
In fact, it's really bothering me because I definitely think that's the case. Here is a version that solves the more complex case:
from itertools import chain
from collections import defaultdict
def calculate_length(ind):
# We want to transform this first into a dict of {doc_id:[tl_idf_a,...]}
# First we transform it into a generator of ([doc_id,tl_idf],...)
tf_gen = chain.from_iterable(ind.itervalues())
# which we then use to generate our transformed dictionary
tf_dict = defaultdict(list)
for doc_id, tl_idf in tf_gen:
tf_dict[doc_id].append(tl_idf)
# Now we proceed mostly as before, but we can just do it in one line
return dict((doc_id, sqrt(sum(tl_idfs**2 for tl_idfs in tl_idfs))) for doc_id, tl_idfs in tf_dict.iteritems())
Example use:
>>> calculate_length({'a':[ [1,3] ], 'b': [ [1,4 ] ] })
{1: 5.0}
>>> calculate_length({'a':[ [1,3],[2,3] ], 'b': [ [1,4 ], [2,1] ] })
{1: 5.0, 2: 3.1622776601683795}

Trouble with append-spit

I'm attempting to use clojure.contrib.io's (1.2) append-spit to append to a file (go figure).
If I create a text file on my desktop, as a test, and attempt to append to it in a fresh repl, this is what I get:
user> (append-spit "/Users/ihodes/Desktop/test.txt" "frank")
Backtrace:
0: clojure.contrib.io$assert_not_appending.invoke(io.clj:115)
1: clojure.contrib.io$outputstream__GT_writer.invoke(io.clj:266)
2: clojure.contrib.io$eval1604$fn__1616$G__1593__1621.invoke(io.clj:121)
3: clojure.contrib.io$fn__1660.invoke(io.clj:185)
4: clojure.contrib.io$eval1604$fn__1616$G__1593__1621.invoke(io.clj:121)
5: clojure.contrib.io$append_writer.invoke(io.clj:294)
6: clojure.contrib.io$append_spit.invoke(io.clj:342)
7: user$eval1974.invoke(NO_SOURCE_FILE:1)
8: clojure.lang.Compiler.eval(Compiler.java:5424)
9: clojure.lang.Compiler.eval(Compiler.java:5391)
10: clojure.core$eval.invoke(core.clj:2382)
11: swank.commands.basic$eval_region.invoke(basic.clj:47)
12: swank.commands.basic$eval_region.invoke(basic.clj:37)
13: swank.commands.basic$eval807$listener_eval__808.invoke(basic.clj:71)
14: clojure.lang.Var.invoke(Var.java:365)
15: user$eval1972.invoke(NO_SOURCE_FILE)
16: clojure.lang.Compiler.eval(Compiler.java:5424)
17: clojure.lang.Compiler.eval(Compiler.java:5391)
18: clojure.core$eval.invoke(core.clj:2382)
19: swank.core$eval_in_emacs_package.invoke(core.clj:94)
20: swank.core$eval_for_emacs.invoke(core.clj:241)
21: clojure.lang.Var.invoke(Var.java:373)
22: clojure.lang.AFn.applyToHelper(AFn.java:169)
23: clojure.lang.Var.applyTo(Var.java:482)
24: clojure.core$apply.invoke(core.clj:540)
25: swank.core$eval_from_control.invoke(core.clj:101)
26: swank.core$eval_loop.invoke(core.clj:106)
27: swank.core$spawn_repl_thread$fn__489$fn__490.invoke(core.clj:311)
28: clojure.lang.AFn.applyToHelper(AFn.java:159)
29: clojure.lang.AFn.applyTo(AFn.java:151)
30: clojure.core$apply.invoke(core.clj:540)
31: swank.core$spawn_repl_thread$fn__489.doInvoke(core.clj:308)
32: clojure.lang.RestFn.invoke(RestFn.java:398)
33: clojure.lang.AFn.run(AFn.java:24)
34: java.lang.Thread.run(Thread.java:637)
Which clearly isn't what I wanted.
I was wondering if anyone else had these problems, or if I'm doing something incorrectly? The file I'm appending to it not open (at least by me). I'm at a loss.
Thanks so much!
I notice that the relevant functions are marked as deprecated in 1.2, but I'm also under the impression that, as written, they've got some bugs in need of ironing out.
First, a non-deprecated way to do what you were trying to do (which works fine for me):
(require '[clojure.java.io :as io])
(with-open [w (io/writer (io/file "/path/to/file")
:append true)]
(spit w "Foo foo foo.\n"))
(Skipping io/file and simply passing the string to io/writer would work too -- I prefer to use the wrapper partly as a matter of personal taste and partly so that c.j.io doesn't try to treat the string as an URL (only to back out via an exception and go for a file in this case), which is its first choice of interpretation.)
As for why I think clojure.contrib.io might be suffering from a bug:
(require '[clojure.contrib.io :as cio])
(with-bindings {#'cio/assert-not-appending (constantly true)}
(cio/append-spit "/home/windfall/scratch/SO/clj/append-test.txt" "Quux quux quux?\n"))
This does not complain, but neither does it append to the file -- the current contents gets replaced instead. I'm not yet sure what exactly the problem is, but switching to clojure.java.io should avoid it. (Clearly this needs further investigation -- deprecated code still shouldn't be buggy -- I'll try to figure it out.)