Code Example:
# BLOCKING by "userID"
task$col_roles$group = "userID"
# Remove "userID" from features
task$col_roles$feature = setdiff(task$col_roles$feature, "userID")
# STRATIFICATION (by Target Variable!)
task$col_roles$stratum = "answer_code"
# Instantiate the resampling on the task:
rsmp_scheme$instantiate(task)
Problem:
Trying to combine the resampling procedures "Stratified resampling" (stratified by target) and 2) "Block resampling" (grouping by userID) (see mlr3 gallery example: https://mlr3gallery.mlr-org.com/posts/2020-03-30-stratification-blocking/) , the following error occurs:
Error: Cannot combine stratification with grouping
Background Information:
My data set includes several repeated measurements per user (different number available repeated measurements per person) -> therefore, blocking or grouping per userID would be appropriate.
In addition, the distribution of the target variable is very imbalanced, which is why a stratification by target would be appropriate.
Question: How can I implement both resampling methods in mlr3?
Thanks for your help! :-)
Related
I have awsService.log logs being sent to CloudWatch and I want to create a metric filter to extract the error value.
Example:
06/13/2020 07:35:33 : 578 : 3 : error occurs
05/13/2020 07:35:33 : 3 : 3 : error occurs
The error value I would like to extract is : 3
I tried with many regrex expressions like * : * : 3 : but it doesnot work.
Any help would be appreciated.
Unfortunately no complex patterning (such as Regex) is currently supported with Metric Filters.
According to the documentation you have 3 choices
Trying to match based on an exact string ([": 3 :"])
Using JSON metric filters (not possible for your example as it requires JSON)
Filtering based on condition of this being a space separated event ([date, time, seperator1, int1, seperator2, int2=3, ...])
Regarding extracting the error value, Metrics Filters provide a count for every time this event occurs, they don't count values from the query itself.
I have a vector of nominal values and I need to know the probability of occurring each of the nominal values. Basically, I need those to obtain the min, max, mean, std of the probability of observing the nominal values and to get the Class Entropy value.
For example, lets assume there is a data-set in which the target is predicting 0, 1, or 2. In the training data-set. We can count the number of records which their target is 1, and call it n_1 and similarly we can define n_0 and n_2. Then, the probability of observing class 1 in the training data-set is simply p_1=n_1/(n_0 + n_2). Once p_0, p_1, and p_2 are obtained, one can get min, max, mean, and std of the these probabilitis.
It is easy to get that in python by pandas, but want to avoid reading the data-set twice. I was wondering if there is any CAS-action in SAS that can provide it to me. Note that I use the Python API of SAS through swat and I need to have the API in python.
I found the following solution and it works fine. It uses s.dataPreprocess.highcardinality to get the number of classes and then uses s.dataPreprocess.binning to obtain the number of observations within each class. Then, there is just some straightforward calculation.
import swat
# create a CAS server
s = swat.CAS(server, port)
# load the table
tbl_name = 'hmeq'
s.upload("./data/hmeq.csv", casout=dict(name=tbl_name, replace=True))
# call to get the number of classes
cardinality_result = s.dataPreprocess.highcardinality(table=tbl_name, vars=[target_var])
cardinality_result_df = pd.DataFrame(cardinality_result["HighCardinalityDetails"])
number_of_classes = int(cardinality_result_df["CardinalityEstimate"])
# call dataPreprocess.binning action to get the probability of each class
s.loadactionset(actionset="dataPreprocess")
result_binning = s.dataPreprocess.binning(table=tbl_name, vars=[target_var], nBinsArray=[number_of_classes])
result_binning_df = pd.DataFrame(result_binning["BinDetails"])
probs = result_binning_df["NInBin"]/result_binning_df["NInBin"].sum()
prob_min = probs.min()
prob_max = probs.max()
prob_mean = probs.mean()
prob_std = probs.std()
entropy = -sum(probs*np.log2(probs))
My data is zero inflated so I'm running a zero-inflated model using glmmamdb:
Model3z <- glmmadmb(Count3 ~ Light3 + (1|Site3), zeroInflation = T, family= "poisson", data = dframe3)
However, when I try and do pairwise comparisons of the different light types in this model pwcs3 <- lsmeans(Model3z, "Light") I get the error message:
Error in ref_grid(object, ...) :
Can't handle an object of class “glmmadmb”
Use help("models", package = "emmeans") for information on supported models.
When I go on the emmeans package website it says that glmmadmb is no longer supported.
I've switched to pscl and the zeroinfl function but am unsure on how to restructure my code to fit the pscl format. Typing in P <- zeroinfl(Count3 ~ Light3 + (1|Site3), family = poisson, data = dframe3) gets the error message:
Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(1, Site3) : ‘|’ not meaningful for factors
Is there another way of using glmmadmb with lsmeans? If not, does anyone know how a zero-inflated model code in pscl is supposed to look? Thanks.
I am trying to integrate influxdb with my application and process the output. I am importing InfluxDBClient package to connect to influx instance running on my local machine. Using query() that returns data in 'influxdb.resultset.ResultSet' format.
However, I want to be able to pick each element specifically from the Resultset for my computations. I was using different functions like keys(), items() and values() from the influxdb-python manual here but of no use:
http://influxdb-python.readthedocs.io/en/latest/api-documentation.html
This is the sample output of the query():
Result: ResultSet({'(u'cpu', None)': [{u'usage_guest_nice': 0, u'usage_user': 0.90783871790308868, u'usage_nice': 0, u'usage_steal': 0, u'usage_iowait': 0.056348610076366427, u'host': u'xxx.xxx.hostname.com', u'usage_guest': 0, u'usage_idle': 98.184322579062794, u'usage_softirq': 0.0062609566755314457, u'time': u'2016-06-26T16:25:00Z', u'usage_irq': 0, u'cpu': u'cpu-total', u'usage_system': 0.84522915123660536}]})
I am also finding it hard to get the data in JSON format using Raw mentioned in the above link. Would be great to have any pointers to process the above output.
items() returns a tuple in below format, ((u'cpu', None), ), where the generator can be used to loop and get the actual data in Dictionary format. Took some time for me to figure out but it was fun!!
According to the docs you could use the get_points() function to retrieve results from an InfluxDB resultset. The function allows you to filter by either measurement, tag, both measurement AND tag, or simply get all the results without any filtering.
Getting all points
Using rs.get_points() will return a generator for all the points in the ResultSet.
Filtering by measurement
Using rs.get_points('cpu') will return a generator for all the points that are in a serie with measurement name cpu, no matter the tags.
rs = cli.query("SELECT * from cpu")
cpu_points = list(rs.get_points(measurement='cpu'))
Filtering by tags
Using rs.get_points(tags={'host_name': 'influxdb.com'}) will return a generator for all the points that are tagged with the specified tags, no matter the measurement name.
rs = cli.query("SELECT * from cpu")
cpu_influxdb_com_points = list(rs.get_points(tags={"host_name": "influxdb.com"}))
Filtering by measurement and tags
Using measurement name and tags will return a generator for all the points that are in a serie with the specified measurement name AND whose tags match the given tags.
rs = cli.query("SELECT * from cpu")
points = list(rs.get_points(measurement='cpu', tags={'host_name': 'influxdb.com'}))
I am using the following function to calculate the t-stat for data in data frame (x):
wilcox.test.all.genes<-function(x,s1,s2) {
x1<-x[s1]
x2<-x[s2]
x1<-as.numeric(x1)
x2<-as.numeric(x2)
wilcox.out<-wilcox.test(x1,x2,exact=F,alternative="two.sided",correct=T)
out<-as.numeric(wilcox.out$statistic)
return(out)
}
I need to write a for loop that will iterate a specific number of times. For each iteration, the columns need to be shuffled, the above function performed and the maximum t-stat value saved to a list.
I know that I can use the sample() function to shuffle the columns of the data frame, and the max() function to identify the maximum t-stat value, but I can't figure out how to put them together to achieve a workable code.
You are trying to generate empiric p-values, corrected for the multiple comparisons you are making because of the multiple columns in your data. First, let's simulate an example data set:
# Simulate data
n.row = 100
n.col = 10
set.seed(12345)
group = factor(sample(2, n.row, replace=T))
data = data.frame(matrix(rnorm(n.row*n.col), nrow=n.row))
Calculate the Wilcoxon test for each column, but we will replicate this many times while permuting the class membership of the observations. This gives us an empiric null distribution of this test statistic.
# Re-calculate columnwise test statisitics many times while permuting class labels
perms = replicate(500, apply(data[sample(nrow(data)), ], 2, function(x) wilcox.test(x[group==1], x[group==2], exact=F, alternative="two.sided", correct=T)$stat))
Calculate the null distribution of the maximum test statistic by collapsing across the multiple comparisons.
# For each permuted replication, calculate the max test statistic across the multiple comparisons
perms.max = apply(perms, 2, max)
By simply sorting the results, we can now determine the p=0.05 critical value.
# Identify critical value
crit = sort(perms.max)[round((1-0.05)*length(perms.max))]
We can also plot our distribution along with the critical value.
# Plot
dev.new(width=4, height=4)
hist(perms.max)
abline(v=crit, col='red')
Finally, comparing a real test statistic to this distribution will give you an empiric p-value, corrected for multiple comparisons by controlling the family-wise error to p<0.05. For example, let's pretend a real test stat was 1600. We could then calculate the p-value like:
> length(which(perms.max>1600))/length(perms.max)
[1] 0.074