Hugging face classifier speed (list comprehension) - list

Hope you are doing well!
I am trying to use the transfomers library of huggingface.
I want to optimize the speed
Screen of speed test using list comprehension:
The text in the list is around 500 characters each.
I don't understand why this
[classifier(review, labels,
multi_label=True) for review in list]
is faster than
classifier(list, labels, multi_label=True)
Do you have any idea why it does this?
Thank you !

Related

How to access by() output?

I have a large data.frame containing different forest sites, tree species and their dimensions. For some trees I have height and dbh data, for some I only have dbh. I need to calculate missing heights for additional evaluation. Height is site and species specific which is why I used the by() function on a with_height subset:
tmp <- with(with_height,
by(with_height, with_height[,1:2], #with_height[,1:2] are site and species
function(x) lm(height~log(dbh), data = x)))
This works out and creates a large list (1144 unnamed elements, 9.8Mb).
How do I access this list? I need either the lm() or the coefficients for each real combination of site and species (without NULL/ZERO responses if a species did not occur).
I found that
tmp[[1]]$coefficients
returns
tmp[[1]]$coefficients
(Intercept) log(dbh)
-16.36298 11.18222
But how do I know to which site-species combination this is related to? And is there a way to do this for all real site-species combinations simultanously?
I already spent hours on that question and would be very thankfull for any help and advices!

Memory Error when using combination and intersection in python

I am coding for large graph sampling and meet some memory problems.
possible_edges = set(itertools.combinations(list(sampled_nodes), 2))
sampled_graph = list(possible_edges.intersection(ori_edges))
The code is supposed to find all combinations of nodes in sampled_nodes, which provided all possible edges formed by these nodes. Then take the intersection with original_edges to find which edge exactly exists.
The problem is when the graph is enormous, the itertools.combinations function would cause memory error.
I've thought to write for loop to iteratively calculate the intersection but takes too much time.
Any help from you guys would be appreciated. Thank you!
Found one solution, instead of take intersection of 2 lists, I choose not to create all combinations for possible_edges.
I pick edges in ori_edges to see if both of the node exists in sampled_nodes.
sampled_graph = list(set([(e[0], e[1]) for e in ori_edges if int(e[0]) in result_nodes and int(e[1]) in result_nodes]))
This code won't create the huge list and the speed is acceptable.

Fast R sliding window function using a RANGE rather than a PHYSICAL partition

I am trying to solve a problem: run a statistic (count; sum; mean) over an irregular time series data set, where the window-size for each line is within a given date range (preferably over a grouping column).
I have found that ORACLE SQL supports this through:
COUNT(*) OVER (
ORDER BY payment_date
RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW
)
And in R I have built functions that use lists to collect vectors of values for each row, but this is expensive and slow. The best solution I have found is by user: mgahan is his package boRingTrees:
R: fast sliding window with given coordinates
library("devtools")
install_github("boRingTrees","mgahan")
library("boRingTrees")
set.seed(1)
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Target <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Target)
data[,Roll:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates",
target="Target",lower=0,upper=31,incbounds=T,stat=sum,na.rm=T,cores=1)]
However, when I run this against larger data sets, it also runs quite slowly.
What I have tried:
To use lists in loops to return window partitions, but this is very slow.
Importing user's functions, such as boRingTrees, which encapsulate
the problem well - but are also slow.
What I have learnt:
There is good support in R for physical partitions (up one row, group into days/weeks, etc) through zoo and rollapply, but limited support for Ranged partitions (all lines within this number of hours from a timestamp).
What I think I need:
I have come to the conclusion that I need a C function to more speedily run a sliding window over a range of dates. I have started playing with C++ in R, and these two Rcpp efforts come close (in technique) to what I think I need:
R: Rolling window function with adjustable window and step-size for irregularly spaced observations
R: fast sliding window with given coordinates
I hope this summary is useful collation of information for people trying to solve similar problems (I found searching on this topic difficult - sparse information and very different ways to describe similar things). Hopefully someone can assist me in building a faster C++ solution I can run in R (inline or .cpp). Here is a sample data set (again, courtesy of mgahan):
Trans_Dates <- as.Date(c(31,33,65,96,150,187,210,212,240,273,293,320,
32,34,66,97,151,188,211,213,241,274,294,321,
33,35,67,98,152,189,212,214,242,275,295,322),origin="2010-01-01")
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
Val <- rpois(36,3)
require("data.table")
data <- data.table(Trans_Dates,Cust_ID,Val)
e.g:
data[,RowRollCount31:=rollingByCalcs(data=data,bylist="Cust_ID",dates="Trans_Dates", target="Val",lower=0,upper=31,incbounds=T,stat=length,na.rm=T)]
Ideally, the solution would use the 'interval' option as in the Oracle example (i.e windows within 'x' & 'hours' of each row), and also the 'group by'/'by_list' and 'stat' options that mgahan cleverly catered for.
Further reading /a good explanation of the problem:
https://blog.jooq.org/2016/10/31/a-little-known-sql-feature-use-logical-windowing-to-aggregate-sliding-ranges/
Many thanks in advance!

SegNet results of train set (test via test_segmentation.py)

I run SegNet on my own dataset (by Segnet tutorial). I see great results via test_segmentation.py.
my problem is that I want to see the real net results and not test_segmentation own colorisation (via classes).
for example, if I have trained net with 2 classes, so after the train I will see not only 2 colors (as we see with the classes), but we will see the real net color segmentation ([0.22,0.19,0.3....) lighter and darker as the net see it]
I hope that I explained myself well. thanks for helping.
You could use a python script to achieve what you want. Take a look at this script.
The command out = out['argmax'], extracts the raw output, so you can get a segmentation map with 'lighter and darker' values as you wanted.
When you say the 'real' net color segmentation I will assume that you mean the probability maps. Effectively the last layer will have one map for every class; and if you check the function predict in inference.py, they take the argmax; that is the channel (which represents the class) with the highest probability. If you want to get these maps, you just have to get the data without computing the argmax; something like:
predicted = net.blobs['prob'].data
I solve it. the solution is to range cmin and cmax from 0 to 1 in the scipy saving method. for example: scipy.misc.toimage(output, cmin=0.0, amax=1).save(/path/.../image.png)

TF-IDF vectorizer doesn't work better than countvectorizer (sci-kit learn

I am working on a multilabel text classification problem with 10 labels.
The dataset is small, +- 7000 items and +-7500 labels in total. I am using python sci-kit learn and something strange came up in the results. As a baseline I started out with using the countvectorizer and was actually planning on using the tfidf vectorizer which I thought would work better. But it doesn't.. with the countvectorizer I get a performance of a 0,1 higher f1score. (0,76 vs 0,65)
I cannot wrap my head around why this could be the case?
There are 10 categories and one is called miscellaneous. Especially this one gets a much lower performance with tfidf.
Does anyone know when tfidf could perform worse than count?
The question is, why not ? Both are different solutions.
What is your dataset, how many words, how are they labelled, how do you extract your features ?
countvectorizer simply count the words, if it does a good job, so be it.
There is no reason why idf would give more information for a classification task. It performs well for search and ranking, but classification needs to gather similarity, not singularities.
IDF is meant to spot the singularity between one sample vs the rest of the corpus, what you are looking for is the singularity between one sample vs the other clusters. IDF smoothens the intra-cluster TF similarity.