Passing Values from table to CTE - common-table-expression

I have a table of slot numbers, i.e. warehouse slot numbers.
0110H
0310H
0311H
0312H
0313H
0314H
The table is called WarehouseLocationLimits and the column name is “F1.” The 3rd and 4th digits of the of the above slot numbers represent a maximum number of bays in an aisle of the warehouse.
I have the following code
DECLARE #I INT
SET #I = 1;
WITH CTS(BAY) AS (
SELECT #I
UNION ALL
SELECT BAY + 1 FROM CTS
WHERE BAY < 5
)
SELECT F1, LEFT(F1, 2) AISLE, CAST(SUBSTRING(F1, 3, 2) AS INTEGER) BAYMAX, SUBSTRING(F1, 5, 1) LEVEL, BAY
FROM WarehouseLocationLimits WL, CTS
Where F1 IS NOT NULL
order by F1, BAY
Which generates something like the following:
0110H 01 10 H 1
0110H 01 10 H 2
0110H 01 10 H 3
0110H 01 10 H 4
0110H 01 10 H 5
0310H 03 10 H 1
0310H 03 10 H 2
0310H 03 10 H 3
0310H 03 10 H 4
0310H 03 10 H 5
Note: for each “slot” the CTE is generating 5 lines because of the literal ‘5’ in the WHERE clause of the CTE. I need to pass the value CAST(SUBSTRING(F1, 3, 2) AS INTEGER) of each slot to the CTE. How can I do that?
Thanks in advance for your help
Clyde

Ok. I got it. The solution is the following:
WITH CTS(SLOTNO, AISLE, BAY, LEVEL)
AS (
SELECT F1, LEFT(F1, 2), SUBSTRING(F1, 3, 2), ASCII(SUBSTRING(F1, 5, 1))
FROM WarehouseLocationLimits WL1
WHERE F1 IS NOT NULL
UNION ALL
SELECT WL2.F1, LEFT(WL2.F1, 2), BAY, LEVEL - 1
FROM WarehouseLocationLimits WL2, CTS
WHERE WL2.F1 = CTS.SLOTNO AND LEVEL > ASCII('A')
)
SELECT *, CHAR(LEVEL)
FROM CTS
order by SLOTNO, LEVEL DESC
I didn't understand that the first query (i.e., that one atop the "UNION ALL" is a "seed" query and runs only once for each row of the table. Having understood that, it all fell into place for me.

Related

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562
Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

python pandas: set a value of column based on another value of a column in a list

I have a dataframe like as following:
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 NaN rv
..................................................................
and there is a list named lst_n as following:
lst_n = ['rv', 'ag', 'rg', ...........]
I want to set the the value of class column of this dataframe to 'class_a' if the value of n is in the lst_n. For example the fifth rows, the n is rv which is in the n list(lst_n), so the value of class is set to 'class_a'.
My code is following, but there is error:
lst_n = ['rv', 'ag', 'rg', ...........]
df.loc[df.n is in lst_n, 'class'] = 'class_a'
but there is error:
df.loc[df.n is in lst_n, 'class'] = 'class_a'
^
SyntaxError: invalid syntax
thanks!
You need isin for mask:
lst_n = ['rv', 'ag', 'rg']
df.loc[df['n'].isin(lst_n), 'class'] = 'class_a'
print (df)
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv
Another solution with Series.mask:
df['class'] = df['class'].mask(df.n.isin(lst_n), 'class_a')
print (df)
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv
If you need a bit of performance, you can use np.where.
df['class'] = np.where(df.n.isin(lst_n), 'class_a', df['class'])
df
Out[942]:
f1 f2 class n
0 weekly_return 0.155796 ab weekly
1 monthly_return 0.153907 ab monthly
2 volume_ratio 0.123844 NaN volume
3 margin_selling_balance 0.115411 ad margin
4 margin_debt_balance 0.107883 ae margin
5 rv_ratio 0.077373 class_a rv

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?
This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.
You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64
SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

Which pattern occurs the most in a matrix - R (UPDATE)

UPDATE 2
*I've added some code (and explanation) I wrote myself at the end of this question, this is however a suboptimal solution (both in coding efficiency as resulting output) but kind of manages to make a selection of items that adhere to the constraints. If you have any ideas on how to improve it (again both in efficiency as resulting output) please let me know.
1. Updated Post
Please look below for the initial question and sample code. Thx to alexis_laz his answer the problem was solved for a small number of items. However when the number of items becomes to large the combn function in R cannot calculate it anymore because of the invalid 'ncol' value (too large or NA) error. Since my dataset has indeed a lot of items, I was wondering whether replacing some of his code (shown after this) with C++ provides a solution to this, and if this is the case what code I should use for this? Tnx!
This is the code as provided by alexis_laz;
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
2. Initial Post
Currently I'm working on a set of data coming from adaptive measurement, which means that not all persons have made all of the same items. For my analysis however I need a dataset that contains only items that have been made by all persons (or a subset of these persons).
I have a matrix object in R with rows = persons (100000), and columns = items(220), and a 1 in a cell if the person has made the item and a 0 if the person has not made the item.
How can I use R to determine which combination of at least 15 items, is made by the highest amount of persons?
Hopefully the question is clear (if not please ask me for more details and I will gladly provide those).
Tnx in advance.
Joost
Edit:
Below is a sample matrix with the items (A:E) as columns and persons (1:5) as rows.
mat <- matrix(c(1,1,1,0,0,1,1,0,1,1,1,1,1,0,1,0,1,1,0,0,1,1,1,1,0),5,5,byrow=T)
colnames(mat) <- c("A","B","C","D","E")
rownames(mat) <- 1:5
> mat
A B C D E
"1" 1 1 1 0 0
"2" 1 1 0 1 1
"3" 1 1 1 0 1
"4" 0 1 1 0 0
"5" 1 1 1 1 0
mat[1,1] = 1 means that person 1 has given a response to item 1.
Now (in this example) I'm interested in finding out which set of at least 3 items is made by at least 3 people. So here I can just go through all possible combinations of 3, 4 and 5 items to check how many people have a 1 in the matrix for each item in a combination.
This will result in me choosing the item combination A, B and C, since it is the only combination of items that has been made by 3 people (namely persons 1, 3 and 5).
Now for my real dataset I want to do this but then for a combination of at least 10 items that a group of at least 75 people all responded to. And since I have a lot of data preferably not by hand as in the example data.
I'm thus looking for a function/code in R, that will let me select the minimal amount of items, and questions, and than gives me all combinations of items and persons that adhere to these constraints or have a greater number of items/persons than the constrained.
Thus for the example matrix it would be something like;
f <- function(data,no.items,no.persons){
#code
}
> f(mat,3,3)
no.item no.pers items persons
1 3 3 A, B, C 1, 3, 5
Or in case of at least 2 items that are made by at least 3 persons;
> f(mat,2,3)
no.item no.pers items persons
1 2 4 A, B 1, 2, 3, 5
2 2 3 A, C 1, 3, 5
3 2 4 B, C 1, 3, 4, 5
4 3 3 A, B, C 1, 3, 5
Hopefully this clears up what my question actually is about. Tnx for the quick replies that I already received!
3. Written Code
Below is the code I've written today. It takes each item once as a starting point and then looks to the item that has been answered most by people who also responded to the start item. It the takes these two items and looks to a third item, and repeats this until the number of people that responded to all selected questions drops below the given limit. One drawback of the code is that it takes some time to run, (it goes up somewhat exponentially when the number of items grows). The second drawback is that this still does not evaluate all possible combinations of items, in the sense that the start item, and the subsequently chosen item may have a lot of persons that answered to these items in common, however if the chosen item has almost no similarities with the other (not yet chosen) items, the sample might shrink very fast. While if an item was chosen with somewhat less persons in common with the start item, and this item has a lot of connections to other items, the final collection of selected items might be much bigger than the one based on the code used below. So again suggestions and improvements in both directions are welcome!
set.seed(512)
mat <- matrix(rbinom(1000000, 1, .6), 10000, 100)
colnames(mat) <- 1:100
fff <- function(data,persons,items){
xx <- list()
for(j in 1:ncol(data)){
d <- matrix(c(j,length(which(data[,j]==1))),1,2)
colnames(d) <- c("item","n")
t = persons+1
a <- j
while(t >= persons){
b <- numeric(0)
for(i in 1:ncol(data)){
z <- c(a,i)
if(i %in% a){
b[i] = 0
} else {
b[i] <- length(which(rowSums(data[,z])==length(z)))
}
}
c <- c(which.max(b),max(b))
d <- rbind(d,c)
a <- c(a,c[1])
t <- max(b)
}
print(j)
xx[[j]] = d
}
x <- y <- z <- numeric(0)
zz <- matrix(c(0,0,rep(NA,ncol(data))),length(xx),ncol(data)+2,byrow=T)
colnames(zz) <- c("n.pers", "n.item", rep("I",ncol(data)))
for(i in 1:length(xx)){
zz[i,1] <- xx[[i]][nrow(xx[[i]])-1,2]
zz[i,2] <- length(unname(xx[[i]][1:nrow(xx[[i]])-1,1]))
zz[i,3:(zz[i,2]+2)] <- unname(xx[[i]][1:nrow(xx[[i]])-1,1])
}
zz <- zz[,colSums(is.na(zz))<nrow(zz)]
zz <- zz[which((rowSums(zz,na.rm=T)/rowMeans(zz,na.rm=T))-2>=items),]
zz <- as.data.frame(zz)
return(zz)
}
fff(mat,110,8)
> head(zz)
n.pers n.item I I I I I I I I I I
1 156 9 1 41 13 80 58 15 91 12 39 NA
2 160 9 2 27 59 13 81 16 15 6 92 NA
3 158 9 3 59 83 32 25 80 14 41 16 NA
4 160 9 4 24 27 71 32 10 63 42 51 NA
5 114 10 5 59 66 27 47 13 44 63 30 52
6 158 9 6 13 56 61 12 59 8 45 81 NA
#col 1 = number of persons in sample
#col 2 = number of items in sample
#col 3:12 = which items create this sample (NA if n.item is less than 10)
to follow up on my comment, something like:
set.seed(1618)
mat <- matrix(rbinom(1000, 1, .6), 100, 10)
colnames(mat) <- sample(LETTERS, 10)
rownames(mat) <- sprintf('person%s', 1:100)
mat1 <- mat[rowSums(mat) > 5, ]
head(mat1)
# A S X D R E Z K P C
# person1 1 1 1 0 1 1 1 1 1 1
# person3 1 0 1 1 0 1 0 0 1 1
# person4 1 0 1 1 1 1 1 0 1 1
# person5 1 1 1 1 1 0 1 1 0 0
# person6 1 1 1 1 0 1 0 1 1 0
# person7 0 1 1 1 1 1 1 1 0 0
table(rowSums(mat1))
# 6 7 8 9
# 24 23 21 5
tab <- table(sapply(1:nrow(mat1), function(x)
paste(names(mat1[x, ][mat1[x, ] == 1]), collapse = ',')))
data.frame(tab[tab > 1])
# tab.tab...1.
# A,S,X,D,R,E,P,C 2
# A,S,X,D,R,E,Z,P,C 2
# A,S,X,R,E,Z,K,C 3
# A,S,X,R,E,Z,P,C 2
# A,S,X,Z,K,P,C 2
Here is another idea that matches your output:
ff = function(x, No_items, No_persons)
{
do.call(rbind,
lapply(No_items:ncol(x),
function(n) {
col_combs = combn(seq_len(ncol(x)), n, simplify = F)
persons = lapply(col_combs, function(j) rownames(x)[rowSums(x[, j, drop = F]) == n])
keep = unlist(lapply(persons, function(z) length(z) >= No_persons))
data.frame(persons = unlist(lapply(persons[keep], paste, collapse = ", ")),
items = unlist(lapply(col_combs[keep], function(z) paste(colnames(x)[z], collapse = ", "))))
}))
}
ff(mat, 3, 3)
# persons items
#1 1, 3, 5 A, B, C
ff(mat, 2, 3)
# persons items
#1 1, 2, 3, 5 A, B
#2 1, 3, 5 A, C
#3 1, 3, 4, 5 B, C
#4 1, 3, 5 A, B, C

R function(): how to pass parameters which contain characters and regular expression

my data as follows:
>df2
id calmonth product
1 101 01 apple
2 102 01 apple&nokia&htc
3 103 01 htc
4 104 01 apple&htc
5 104 02 nokia
Now i wanna calculate the number of ids whose products contain both 'apple' and 'htc' when calmonth='01'. Because what i need is not only 'apple' and 'htc', also i need 'apple' and 'nokia',etc.
So i want to realize this by a function like this:
xandy=function(a,b) data.frame(product=paste(a,b,sep='&'),
csum=length(grep('a.*b',x=df2$product))
)
also, i make a parameters list like this:
para=c('apple','htc','nokia')
but the problem is here. When i pass parameters like
xandy(para[1],para[2])
the results is as follows:
product csum
1 apple&htc 0
What my expecting result should be
product csum calmonth
1 apple&htc 2 01
2 apple&htc 0 02
So where is wrong about the parameters passing?
and, how can i add the calmonth in to the function() xandy correctly?
FYI.This question stems from my another question before
What's the R statement responding to SQL's 'in' statement
EDIT AFTER COMMENT
My predictive result will be:
product csum calmonth
1 apple&htc 2 01
2 apple&htc 0 02
May answer is another way how to tackle your problem.
library(stringr)
The function contains will split up the elements of a string vector according to the split character and evaluate if all target words are contained.
contains <- function(x, target, split="&") {
l <- str_split(x, split)
sapply(l, function(x, y) all(y %in% x), y=target)
}
contains(d$product, c("apple", "htc"))
[1] FALSE TRUE FALSE TRUE FALSE
The rest is just subsetting and summarizing
get_data <- function(a, b) {
e <- subset(d, contains(product, c(a, b)))
e$product2 <- paste(a, b, sep="&")
ddply(e, .(calmonth, product2), summarise, csum=length(id))
}
Using the data below, order does not play a role now anymore (see comment below).
get_data("apple", "htc")
calmonth product2 csum
1 1 apple&htc 1
2 2 apple&htc 2
get_data("htc", "apple")
calmonth product2 csum
1 1 htc&apple 1
2 2 htc&apple 2
I know this is not a direct answer to your question but I find this approach quite clean.
EDIT AFTER COMMENT
The reason that you get csum=0 is simply that you are searching for the wrong regex pattern, i.e. a something in between b not for apple ... htc. You need to construct the correct regex pattern,i.e. paste0(a, ".*", b).
Here a complete solution. I would not call it beautiful code, but anyway (note that I change the data to show that it generalizes for months).
library(plyr)
df2 <- read.table(text="
id calmonth product
101 01 apple
102 01 apple&nokia&htc
103 01 htc
104 02 apple&htc
104 02 apple&htc", header=T)
xandy <- function(a, b) {
pattern <- paste0(a, ".*", b)
d1 <- df2[grep(pattern, df2$product), ]
d1$product <- paste0(a,"&", b)
ddply(d1, .(calmonth), summarise,
csum=length(calmonth),
product=unique(product))
}
xandy("apple", "htc")
calmonth csum product
1 1 1 apple&htc
2 2 2 apple&htc