Supplying a default value for left outer joins - clojure

I was wondering what would be the best way of specifying a default value when doing an outer-join in cascalog for field that could be null.
(def example-query
(<- [?id ?fname ?lname !days-active]
(users :> ?id ?fname ?lname)
(active :> ?fname ?lname !days-active))
In this example users and active would be previously defined queries and I'm just looking to correlate active user information (?fname ?lname !days-active) and regular user information (?id ?fname ?lname)
So when the join happened if there was no corresponding information for !days-active it would output 0 instead of nil
i.e.
392393 john smith 3
003030 jane doe 0
instead of
392393 john smith 3
003030 jane doe null
Updated Example
(<- [!!database-id ?feature !!user-clicks !!engaged-users ?application-id ?active-users]
(app-id-db-id-feature-clicks-engaged :> ?application-id !!database-id ?feature !!user-clicks !!engaged-users )
(user-info :> ?application-id ?feature ?active-users))]
example output would look something roughly like
4234 search null null 222 5000
3232 profile 500 400 331 6000
with the filtering that I'm interested I could change the fields that would be !!engaged-users and !!user-clicks to have 0 instead of null. Would using multiple Or predicates work?

I think what you want to do is add an or predicate:
(def example-query
(<- [?id ?fname ?lname !days-active]
(users :> ?id ?fname ?lname)
(active :> ?fname ?lname !days-active)
(or !days-active 0 :> ?active-days)))
That's not an outer join, by the way, it's just not filtering out null variables in the !days-active position.

Related

How to return the last 10 entries using clojurewerkz?

I want to fetch 10 latest entries from elasticsearch database.
For this, I am using:
(require '[clojurewerkz.elastisch.rest.document :as esd])
(esd/search es-conn
index_name
mapping
:query (q/prefix :column value)
:from 0 :size 10)
This only fetches the oldest 10 entries from the document.
I want to know how to fetch the latest entries....
I mean which parameter should be passed.
You need to have a _timestamp fields on your document:
https://www.elastic.co/search?q=timestamp
Then you can sort on that and limit the size to 10.

Filling in data frame column using regular expressions (?)

Ok, so I have a data frame of web forum comments. Each row has a cell containing an ID which is part of the link to that comment's parent comment. The rows contain the full permalink to the comment, of which the ID is the varying part.
I'd like to add a column that shows the user name attached to that parent comment. I'm assuming I'll need to use some regular expression function, which I find mystifying at this point.
In workflow terms, I need to find the row whose URL contains the parent comment ID, grab the user name from that row. Here's a toy example:
toy <- rbind(c("yes?", "john", "www.website.com/4908", "3214", NA), c("don't think so", "mary", "www.website.com/3958", "4908", NA))
toy <- as.data.frame(toy)
colnames(toy) <- c("comment", "user", "URL", "parent", "parent_user")
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 <NA>
which needs to become:
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 john
Some values in this column will be NA, since they're top level comments. So something like,
dataframe$parent_user <- dataframe['the row where parent
ID i is found in the URL column', 'the user name column in that row']
Thanks!!
Another option, using the basename function from base R, which "removes all of the path up to and including the last path separator (if any)"
toy$user[match(toy$parent, basename(as.character(toy$URL)))]
#1] <NA> john
#Levels: john mary
Here is a vectorized option with stri_extract and match
library(stringi)
toy$parent_user <- toy$user[match(toy$parent,stri_extract(toy$URL,
regex=paste(toy$parent, collapse="|")))]
toy
# comment user URL parent parent_user
#1 yes? john www.website.com/4908 3214 <NA>
#2 don't think so mary www.website.com/3958 4908 john
Or as #jazzurro mentioned, a faster option would be using stri_extract with data.table and fmatch
library(data.table)
library(fastmatch)
setDT(toy)[, parent_user := user[fmatch(parent,
stri_extract_last_regex(str=URL, pattern = "\\d+"))]]
Or a base R option would be
with(toy, user[match(parent, sub("\\D+", "", URL))])
#[1] <NA> john
#Levels: john mary
nchar('with(toy, user[match(parent, sub("\\D+", "", URL))])')
#[1] 51
nchar('toy$user[match(toy$parent, basename(as.character(toy$URL)))]')
#[1] 60
Perhaps not the prettiest way to do it, but an option:
toy$parent_user <- sapply(toy$parent,
function(x){p <- toy[x == sub('[^0-9]*', '', toy$URL), 'user'];
ifelse(length(p) > 0, as.character(p), NA)})
toy
# comment user URL parent parent_user
# 1 yes? john www.website.com/4908 3214 <NA>
# 2 don't think so mary www.website.com/3958 4908 john
The second line is really just to deal with cases lacking matches.

Clojure clj-time : parse local string

I have trouble making the interop between java.util.Date and clj-time.
I have first raw data which is an instance of java.util.Date, let's day :
(def date (new java.util.util.Date))
I want to turn in into a clj-time object so I do :
(def st-date (.toString date))
Output :
"Mon Mar 21 16:39:23 CET 2016"
I define a formatter
(def date-formatter (tif/formatter "EEE MMM dd HH:mm:ss zzz yyyy"))
All is here I think.
I so try
(tif/parse order-date-formatter st-date)
I have an exception which tell me the format is not right.
I tried
(tif/unparse order-date-formatter (tic/now))
And I have
"lun. mars 21 15:50:29 UTC 2016"
Which is the same datetime as the java String but in French (my language) with UTC
Wrapping the code for test
(defn today-date-to-clj []
(let [st-date (.toString (new java.util.util.Date))
date-formatter (tif/formatter "EEE MMM dd HH:mm:ss zzz yyyy")]
(tif/parse date-formatter st-date)))
I seems that the formatter does not work on the string because it's not the same localization, am I right ? How to change it ?
Thanks for the help !
EDIT
Someone gave me a far better answer but this almost worked for curious people (problem at "CET 2016" but works for unparse)
(def uni-formatter (tif/with-locale (tif/with-zone order-date-formatter (DateTimeZone/forID "Europe/Paris")) java.util.Locale/US))
Instead of using String as an intermediate date representation you should use a direct conversion:
(clj-time.coerce/from-date (java.util.Date.))
Take a closer look at clj-time's coerce functions.
You can pass your java.util.Date object to from-date or from-date-time to get a org.joda.time.DateTime and then apply it to your custom formatter:
(require '[clj-time
[coerce :as c]
[format :as f]])
(->> (java.util.Date.)
(c/to-date-time)
(f/unparse date-formatter))

Clojure - group by two values, count on specific key word

I have an xml file which I have read in and indexed on my editor. I have three status' by which they are denoted in the xml file - GOOD, AVERAGE, POOR; and they are also indexed by a date in DD/MM/YYYY format.
What I have to do is group them initially by month, so therefore stripping the month out of the date which I have successfully done and then loop through again and count the 'GOOD' status' on a per month basis.
I would assume that I would have to group by both Month and Status and then count the occurances of 'GOOD' although I seem to be getting stuck somewhere... what I have developed so far is below, hopefully I just need to tweak it slightly!
...data fed in from xml...
data (doall (get-records (xml/parse is)))
loopA (for [flop data]
{
:STATUS(:QUALITY_STATUS flop)
:MONTH(get (split (:DATE flop) #"/") 1)
})
x(group-by #(select-keys %[:MONTH :STATUS]) loopA)
loopB (str-join \newline (for [floop x]
{
:MONTH_STAMP (second floop)
:Q_STATUS (if (= :STATUS "GOOD") (second floop))
:GOOD_QUAL (count :Q_STATUS)
}))
Ideally what i want to end up with is three columns of data like the below:
Month -- Quality -- Count
01 ------- GOOD ----- 1
02 ------- GOOD ----- 5
...
12 ------- GOOD ----- 3
Thanks in advance! :)

Why doesn't model-query return query results? (clj-plaza)

I am using clj-plaza (0.0.5-SNAPSHOT) to query a Sesame/Jena Model. The function model-query does not appear to execute the query. It returns the internal representation of a clj-plaza query instead.
(init-jena-framework)
(def *m* (build-model))
(with-model *m*
(model-add-triples
(model-to-triples
(document-to-model "http://www.rdfdata.org/dat/rdfdata.rdf"
:rdf))))
(def all-subjects-query
(defquery
(query-set-vars [:?subject])
(query-set-pattern (make-pattern [[:?subject ?p ?o]]))
(query-set-type :select))
;; As expected
(model-query-triples *m* all-subjects-query)
=> clojure.lang.LazySeq#2e1e8502
;; Does not execute query (?)
(model-query *m* all-subjects-query)
=> {:kind :select, :pattern [[:?object :?p :?o]], :vars [:?object]}
The official tutorial claims model-query returns a list of bindings from the query:
({:?object "http://randomurl.com/asdf"}
{:?object "http://asdf.com/qwer"})
This is a bug.
Here is a fix. Until it is merged back and updated on clojars, feel free to use my fork.
A workaround would be to use (query model query) (instead of model-query)after importing the corresponding Jena or Sesame implementation.
For Sesame:
(use 'plaza.rdf.implementations.sesame)
(init-sesame-framework)
(def *m* (build-model))
(with-model *m*
(model-add-triples
(model-to-triples
(document-to-model "http://www.rdfdata.org/dat/rdfdata.rdf"
:rdf))))
(def all-subjects-query
(defquery
(query-set-vars [:?subject])
(query-set-pattern (make-pattern [[:?subject ?p ?o]]))
(query-set-type :select))
(query *m* all-subjects-query)
=> [{:?s #<SesameResource http://www.rdfdata.org/dat/rdfdata.rdf>}
{:?s #<SesameResource http://www.rdfdata.org/dat/rdfdata.rdf>}
{:?s #<SesameResource http://www.rdfdata.org/dat/rdfdata.rdf>}
{:?s #<SesameResource http://www.rdfdata.org/dat/rdfdata.rdf>}
{:?s #<SesameResource http://rdfweb.org/topic/FOAFBulletinBoard>}
{:?s #<SesameResource http://rdfweb.org/topic/FOAFBulletinBoard>} ...