How to access by() output? - list

I have a large data.frame containing different forest sites, tree species and their dimensions. For some trees I have height and dbh data, for some I only have dbh. I need to calculate missing heights for additional evaluation. Height is site and species specific which is why I used the by() function on a with_height subset:
tmp <- with(with_height,
by(with_height, with_height[,1:2], #with_height[,1:2] are site and species
function(x) lm(height~log(dbh), data = x)))
This works out and creates a large list (1144 unnamed elements, 9.8Mb).
How do I access this list? I need either the lm() or the coefficients for each real combination of site and species (without NULL/ZERO responses if a species did not occur).
I found that
tmp[[1]]$coefficients
returns
tmp[[1]]$coefficients
(Intercept) log(dbh)
-16.36298 11.18222
But how do I know to which site-species combination this is related to? And is there a way to do this for all real site-species combinations simultanously?
I already spent hours on that question and would be very thankfull for any help and advices!

Related

Vector embeddings to mimic a ranking algorithm

Consider a search system where the user submits a query ‘query’ and retrieves products based on some ranking algorithm. Assume that these products are ordered according to their quality such that p_0, p_1, …, p_10 and so on.
I would like to generate vector embeddings that mimic this ranking algorithm. The closest product vector to a query vector should ideally be p_0, the next one should be p_1 and so on.
I have tried to building word2vec embeddings for products by feeding products that have appeared in the same search session as sentences. Then, I have calculated the weighted average of product vectors to find query vectors to make the query vector closer to the top result. Although the closest result is usually the best result for a given query, the subsequent results include some results that would never appear as a top result.
Is there a trick that the word2vec can learn the ranking algorithm or any other techniques that I can try? I have looked into multi-dimensional vector scaling with non-metric distances but it did not seem scalable to me for more than 100Ks of products.
There's no one trick – just iteratively improving your representations, & training set, & ranking methods to better meet your goals.
Word2vec-based representations can often help, but are still fairly simple & centered on individual words – whose senses may vary based on context & position in ways that a simple weighted-average-of-tokens fails to capture.
You may want to represent 'products' by more than just a string-of-word-tokens – to include other properties, as well. These could be scalar values like prices or various other kinds of ratings/properties, or extra synthetic labels, such as the result of other salient groupings (whether hand-edited or learned).
And even if just working with natural-language product descriptions – like product names, or descriptions, or reviews – there are other more-sophisticated text-representations that can be trained or used – such as sentence/document embeddings using deeper-networks than plain word2vec.
Most generically, if you have a bunch of quantitative representations of candidate results, and a query, and want to use some initial examples of "good" results to bootstrap more generalizable rules for scoring top results, you are attempting a "learning-to-rank" process:
https://en.wikipedia.org/wiki/Learning_to_rank
To suggest more specific steps would require a more specific description of inputs/outputs/goals, & what's been tried, and how what's been tried has failed.
For example, are your queries always just textual product names? In such a case, maybe plain keyword search is the central technology required – with things like word-vector-modelling just a tweak for handling some tough cases, like expanding the results, or adding more contrast to the rankings, when results are too few or to many.
Or, can you detect key gaps in the modeling related to exactly those cases where "results include some results that would [ideally] never appear as a top result"? If certain things like rare (poorly-modeled) words, or important qualities not yet captured in the model, seem to be to blame for such cases, that will guide the potential set of corrective changes.

Elasticsearch scoring on multiple indexes: dfs_query_then_fetch returns the same scores as query_then_fetch

I have multiple indices in Elasticsearch (and the corresponding documents in Django created using django-elasticsearch-dsl). All of the indices have these settings:
settings = {'number_of_shards': 1,
'number_of_replicas': 0}
Now, I am trying to perform a search across all the 10 indices. In order to retrieve consistent scoring between the results from different indices, I am using dfs_query_then_fetch:
search = Search(index=['mov*'])
search = search.params(search_type='dfs_query_then_fetch')
objects = search.query("multi_match", query='Tom & Jerry', fields=['title', 'actors'])
I get bad results due to inconsistent scoring. A book called 'A story of Jerry and his friend Tom' from one index can be ranked higher than the cartoon 'Tom & Jerry' from another index. The reason is that dfs_query_then_fetch is not working. When I remove it or substitute with the simple query_then_fetch, I get absolutely the same results with the identical scoring.
I have tested it on URI requests as well, and I always get the same scores for both search types.
What can be the reason for it?
UPDATE: The results are actually not the same, but they are only really slightly different, e.g. a score of 50.1 with dfs and 50.0 without dfs, while the same model within one index has a score of 80.0.
If the number of shards is 1, then dfs_query_then_fetch and query_then_fetch will return the same result. DFS query will do a query to all shards and then show you results based on the scores computed, but in this case there is only one shard.
Regarding the scoring, you might wanna have a look at your actors field too. Also, do let us know what are the analyzer and tokenizer if you have used custom ones?

How can one create a single table from multiple datasets?

I'm trying to create a descriptive table by treatment group. For my analysis, I have 3 different partitions of the data (because I'm running 3 separate analyses) from a complete data set, but I only have one statistic from each subset that I am trying to describe, so I think it'd look better in one complete table. At the end, I'd like an output that can convert to latex (as I'm using bookdown).
I've been using the compareGroups package to easily create each table individually. I know that there is an rbind function that allows to create a stacked table, but it won't let me combine them because the n of each separate data frame is different (due to missingness). For instance, I'm trying to study marriage in one of my analyses, and later divorce (which is a separate analysis), and so the n's of these two data frames differ, but the definition of treatment group is the same.
Ideally, I'd have two columns, one for the treatment group and one for the control group. There would be two rows, one that has age of first marriage, and the second row which would have length of that first marriage, and then the respective ns of the cells.
library(compareGroups)
d1 <- compareGroups(treat ~ time1mar,
data = nlsy.mar,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d2 <- compareGroups(treat ~ time1div,
data = nlsy.div,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d.tot <- rbind(`First Age at Marriage` = d1, `Length of First Marriage` = d2)
This is the error that I get:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 6626, 5057
Any suggestions?
The problem might be that you're using na.omit which delets the cases/rows with NAs from both of your datasets. Probably a different amount of cases get removed from each data set. But actually different numbers of row should only be a problem with cbind. However you might try to change the na.action option.
I'm just guessing. As said by joshpk without sample data is difficult to reproduce your problem.

Fast R sliding window function using a RANGE rather than a PHYSICAL partition

I am trying to solve a problem: run a statistic (count; sum; mean) over an irregular time series data set, where the window-size for each line is within a given date range (preferably over a grouping column).
I have found that ORACLE SQL supports this through:
COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
And in R I have built functions that use lists to collect vectors of values for each row, but this is expensive and slow. The best solution I have found is by user: mgahan is his package boRingTrees:
R: fast sliding window with given coordinates
library("devtools")
install_github("boRingTrees","mgahan")
library("boRingTrees")
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Target)
data[,Roll:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",
target="Target",lower=0,upper=31,incbounds=T,stat=sum,na.rm=T,cores=1)]
However, when I run this against larger data sets, it also runs quite slowly.
What I have tried:
To use lists in loops to return window partitions, but this is very slow.
Importing user's functions, such as boRingTrees, which encapsulate
the problem well - but are also slow.
What I have learnt:
There is good support in R for physical partitions (up one row, group into days/weeks, etc) through zoo and rollapply, but limited support for Ranged partitions (all lines within this number of hours from a timestamp).
What I think I need:
I have come to the conclusion that I need a C function to more speedily run a sliding window over a range of dates. I have started playing with C++ in R, and these two Rcpp efforts come close (in technique) to what I think I need:
R: Rolling window function with adjustable window and step-size for irregularly spaced observations
R: fast sliding window with given coordinates
I hope this summary is useful collation of information for people trying to solve similar problems (I found searching on this topic difficult - sparse information and very different ways to describe similar things). Hopefully someone can assist me in building a faster C++ solution I can run in R (inline or .cpp). Here is a sample data set (again, courtesy of mgahan):
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Val <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Val)
e.g:
data[,RowRollCount31:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates", target="Val",lower=0,upper=31,incbounds=T,stat=length,na.rm=T)]
Ideally, the solution would use the 'interval' option as in the Oracle example (i.e windows within 'x' & 'hours' of each row), and also the 'group by'/'by_list' and 'stat' options that mgahan cleverly catered for.
Further reading /a good explanation of the problem:
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/
Many thanks in advance!

highlight buildings based on value and show in browser

I want to build a website with a map based on openstreetmap that colors buildings based on a their potential average annual yield of solar power. I have the energy data for individual houses.
My question is now, can I assign each house (identified by street name and number) a value and the house can then be colored based on this value in the browser?
I have little to no experience with openstreetmap and would be happy about hints into the right direction.
So you need a OSM dataset and filter it for building=* ways to get the building outlines (e.g. with osmosis). Then you do create a second run to filter for addr:= tags of nodes and merge them with the building outlines from step 1. Be aware of conflicts and that one building can have multiple addresses. So now you have a dataset with normalized addresses and need to create a lookup structure like hashmap to get a mapping for your solar data: addr:street x addr:housenumber -> building id
(very raw idea on how to do it)
IMHO the mixing of external datasources to the copyleft open database license makes that you need to relicense your dataset also under ODbL.
Also keep in mind that not every address is currently at OSM and the existing ones can be wrong!