How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)? - word2vec

I have another question in the word2vec universe.
I am using the 'sparklyr'-package. Within this package I call the ft_word2vec() function. I have some trouble understanding the output:
For each number of sentences/paragraphs I am providing to the ft_word2vec() function, I always get the same amount of vectors. Even, if I have more sentences/paragraphs than words. For me, that looks like I get the paragraph-vectors. Maybe a Code-example helps to understand my problem?
# add your spark_connection here as 'spark_connection = '
# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))
# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)
# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")
# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test
# given a trainings data set (FK_train) with a column "tokens" (for each row = a list of strings)
mymodel = ft_word2vec(
FK_train,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))
# I tried to get the data from spark with:
myemb = mymodel %>% sparklyr::collect()
Has somebody had similar experiences? Can someone explain what exactly the ft_word2vec() function returns? Do you have an example on how to get the word embedding vectors with this function? Or does the returned column indeed contain the paragraph vectors?

my colleague found a solution! If you know how to do it, the instructions really begin to make sense!
# add your spark_connection here as 'spark_connection = '
# create example data frame
FK_data = data.frame(sentences = c("This is my first sentence",
"It is followed by the second sentence",
"At the end there is the last sentence"))
# move the data to spark
sc_FK_data <- copy_to(spark_connection, FK_data, name = "FK_data", overwrite = TRUE)
# prepare data for ft_word2vec (sentences have to be tokenized [=list of words instead of one string in each row])
sc_FK_data <- ft_tokenizer(sc_FK_data, input_col = "icd_long", output_col = "tokens")
# split data into test and trainings sets
partitions <- sc_FK_data %>%
sdf_random_split(training = 0.7, test = 0.3, seed = 123456)
FK_train <- partitions$training
FK_test <- partitions$test
# CHANGES FOLLOW HERE:
# We have to use the spark connection instead of the data. For me this was the confusing part, since i thought no data -> no model.
# maybe we can think of this step as an initialization
mymodel = ft_word2vec(
spark_connection,
input_col = "tokens",
output_col = "word2vec",
vector_size = 15,
min_count = 1,
max_sentence_length = 4444,
num_partitions = 1,
step_size = 0.1,
max_iter = 10,
seed = 123456,
uid = random_string("word2vec_"))
# now that we have our model initialized, we add the word-embeddings to the model
w2v_model = ml_fit(w2v_model, sc_FK_EMB)
# now we can collect the embedding vectors
emb = word2vecmodel$vectors %>% collect()

Related

How do I fit a pymc3 model when each person has multiple data points?

I'm trying to practice using pymc3 on the kinds of data I come across in my research, but I'm having trouble thinking through how to fit the model when each person gives me multiple data points, and each person comes from a different group (so trying a hierarchical model).
Here's the practice scenario I'm using: Suppose we have 2 groups of people, N = 30 in each group. All 60 people go through a 10 question survey, where each person can response ("1") or not respond ("0") to each question. So, for each person, I have an array of length 10 with 1's and 0's.
To model these data, I assume each person has some latent trait "theta", and each item has a "discrimination" a and a "difficulty" b (this is just a basic item response model), and the probability of responding ("1") is given by: (1 + exp(-a(theta - b)))^(-1). (Logistic applied to a(theta - b) .)
Here is how I tried to fit it using pymc3:
traces = {}
for grp in range(2):
group = prac_data["Group ID"] == grp
data = prac_data[group]["Response"]
with pm.Model() as irt:
# Priors
a_tmp = pm.Normal('a_tmp',mu=0, sd = 1, shape = 10)
a = pm.Deterministic('a', np.exp(a_tmp))
# We do this transformation since we must have a >= 0
b = pm.Normal('b', mu = 0, sd = 1, shape = 10)
# Now for the hyperpriors on the groups:
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0)
theta = pm.Normal('theta', mu = theta_mu,
sd = theta_sigma, shape = N)
p = getProbs(Disc, Diff, theta, N)
y = pm.Bernoulli('y', p = p, observed = data)
traces[grp] = pm.sample(1000)
The function "getProbs" is supposed to give me an array of probabilities for the Bernoulli random variable, as the probability of responding 1 changes across trials/survey questions for each person. But this method gives me an error because it says to "specify one of p or logit_p", but I thought I did with the function?
Here's the code for "getProbs" in case it's helpful:
def getProbs(Disc, Diff, THETA, Nprt):
# Get a large array of probabilities for the bernoulli random variable
n = len(Disc)
m = Nprt
probs = np.array([])
for th in range(m):
for t in range(n):
p = item(Disc[t], Diff[t], THETA[th])
probs = np.append(probs, p)
return probs
I added the Nprt parameter because if I tried to get the length of THETA, it would give me an error since it is a FreeRV object. I know I can try and vectorize the "item" function, which is just the logistic function I put above, instead of doing it this way, but that also got me an error when I tried to run it.
I think I can do something with pm.Data to fix this, but the documentation isn't exactly clear to me.
Basically, I'm used to building models in JAGS, where you loop through each data point, but pymc3 doesn't seem to work like that. I'm confused about how to build/index my random variables in the model to make sure that the probabilities change how I'd like them to from trial-to-trial, and to make sure that the parameters I'm estimating correspond to the right person in the right group.
Thanks in advance for any help. I'm pretty new to pymc3 and trying to get the hang of it, and wanted to try something different from JAGS.
EDIT: I was able to solve this by first building the array I needed by looping through the trials, then transforming the array using:
p = theano.tensor.stack(p, axis = 0)
I then put this new variable in the "p" argument of the Bernoulli instance and it worked! Here's the updated full model: (below, I imported theano.tensor as T)
group = group.astype('int')
data = prac_data["Response"]
with pm.Model() as irt:
# Priors
# Item parameters:
a = pm.Gamma('a', alpha = 1, beta = 1, shape = 10) # Discrimination
b = pm.Normal('b', mu = 0, sd = 1, shape = 10) # Difficulty
# Now for the hyperpriors on the groups: shape = 2 as there are 2 groups
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1, shape = 2)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0, shape = 2)
# Individual-level person parameters:
# group is a 2*N array that lets the model know which
# theta_mu to use for each theta to estimate
theta = pm.Normal('theta', mu = theta_mu[group],
sd = theta_sigma[group], shape = 2*N)
# Here, we're building an array of the probabilities we need for
# each trial:
p = np.array([])
for n in range(2*N):
for t in range(10):
x = -a[t]*(theta[n] - b[t])
p = np.append(p, x)
# Here, we turn p into a tensor object to put as an argument to the
# Bernoulli random variable
p = T.stack(p, axis = 0)
y = pm.Bernoulli('y', logit_p = p, observed = data)
# On my computer, this took about 5 minutes to run.
traces = pm.sample(1000, cores = 1)
print(az.summary(traces)) # Summary of parameter distributions

Shiny implementation

Hello my question is about user interactivity with my code, I developed this simple loop for time series using the great forecast Hybrid package, here is the rmd for it
The strange language is Portuguese sorry I was too lazy to translate the whole thing it shouldn't matter anyway.
```{r Primeira Vez, clique no play, include=FALSE, include=FALSE}
setwd("~/R")
install.packages("forecastHybrid")
setwd("~/R")
library(forecastHybrid)
```
```{r Inputs}
# A dataset variable in the global enviroment
Data<- SemZero
#How to save a witout dummies regression
ComoSalvar<- "Exemplo2.csv"
#How to save a with dummies regression
ComoSalvarReg<- "ExemploparaDummies.csv"
#Where to save them.
OndeSalvar <- "~/R"
#Month,Year, and Day variable for the ts
Mes<- 9
Ano<- 2012
Dia<- 1
#Frequency
Freq<- 12
#Forecast Period.
Forecast = 12
#Confidance intervals
IC<- c(0)
#Variables in the Data dataset that will be used for the lapply regression, usually I would read an excel file with headers.
VStart = 1
VFinish = 2
#Simple regressor dataset, they can be matrix as well as line since de data.fram combines everyone example data.frame(OddMonths,Christmas,WasTrumpPresident?,RainInThatSeason)
Regressores<- data.frame(0)
```
```{r Logic for rolling test and data simplification}
setwd(OndeSalvar)
if(IC[1] > 0) pi<-TRUE else
pi<- FALSE
if (nrow(Data) >= Freq*3 + Forecast*2) {
Weights <- "cv.error"
Multiplicador<-floor((nrow(Data)-Forecast*2)/Freq) } else
if (nrow(Data) <= Freq*3 + Forecast*2) {
Weights <- "cv.error"
Multiplicador<-3 } else
if (nrow(Data) >= Freq*2.5 + Forecast*2){
Weights <- "cv.error"
Multiplicador = 2.5} else
Weights <- "equal"
```
```{r The bunk of the regression process}
if (nrow(Regressores) < nrow(Data)) {
my_forecast1 <- try({function(x){
print(x)
print(summary(x))
names(x)
x[is.na(x)] <- 0; x
if(sum(abs(x)) < Freq){
Model<- "aenst"} else
if(mean(x[1:Freq]) == 0)Model<- "aenst" else Model<- "aenstf"
x<- ts(x, start = c(Ano, Mes,Dia), frequency = Freq)
hm<-hybridModel(x, models = Model, lambda = NULL,
a.args = list(trace = FALSE,test = "kpss", ic ="aicc", max.P = 2, max.p = 9,max.q=9,max.Q = 2,max.d = 2,max.D = 2,start.p = 9,start.P = 2,start.Q = 2,start.q = 9,allowdrift = TRUE,allowmean = TRUE
#Se tiver tempo apague o # abaixo para uma maior qualidade no modelo arima.
#,stepwise = FALSE,parallel = TRUE,num.cores = NULL
),
e.args = list(ic = "aicc"),
n.args = list(repeats = nrow(Data)),
s.args = NULL,
t.args = NULL,
weights = Weights,
errorMethod = "RMSE",cvHorizon = Forecast,windowSize = frequency(x)*Multiplicador, horizonAverage = FALSE,
verbose = TRUE)
lapply(seq_along(x), function(i) paste(names(x)[[i]], x[[i]]))
fcast1<- forecast(hm,h = Forecast,level = IC,PI = pi)
return(fcast1)
}})
Listas<- lapply(Data[,VStart:VFinish], try(my_forecast1))
if (pi == FALSE) ListaResultado<- as.data.frame(lapply(Listas, '[[', 'mean')) else
ListaResultado <- Listas
write.csv(ListaResultado , file = ComoSalvar)
} else {
my_forecastreg <- function(x){
print(x)
print(summary(x))
names(x)
x[is.na(x)] <- 0; x
x<- ts(x, start = c(Ano, Mes,Dia), frequency = Freq)
hmreg <- hybridModel(x ,models = "ans",
a.args = list(xreg = Regressores[1:nrow(Data),],trace = TRUE,test = "kpss", ic ="aicc", max.P = 2, max.p = 9,max.q=9,max.Q = 2,max.d = 2,max.D = 2,start.p = 9,start.P = 2,start.Q = 2,start.q = 9,allowdrift = TRUE,allowmean = TRUE
#Se tiver tempo apague o # abaixo para uma maior qualidade no modelo arima.
#,stepwise = FALSE,parallel = TRUE,num.cores = NULL
),
n.args = list(xreg = Regressores[1:nrow(Data),], repeats= nrow(Regressores)),
s.args = list(xreg = Regressores[1:nrow(Data),], method = "arima"))
fcast2<- forecast(hmreg,h = Forecast,level = IC,PI = pi, xreg = Regressores[nrow(Data):(nrow(Data)+Forecast-1),])
return(fcast2)
}
Listas2<- lapply(Data[,VStart:VFinish], try(my_forecastreg))
if (pi == FALSE) ListaResultado2<- as.data.frame(lapply(Listas2, '[[', 'mean')) else
ListaResultado2 <- Listas2
write.csv(ListaResultado2 , file = ComoSalvarReg)
}
```
I want to develop something to get the user input and run the regressions, my end user doesn't usually like how ugly a R Markdown file looks, I was looking into using shiny, but i dont know a few details
Who runs this regressions if I upload the whole thing successfully to shiny? My computer,the server, the user, I have no idea?
Can the user input go into the users own global environment so that the whole thing could be kept as a strictly offline process(using Shiny as a beautification app that substitutes this input chunk?)
Can someone please give an example of an Shiny app that does something similar?
Can the user read.xlsm into shiny server, or use his global environment to define a data for the shiny app to use as input?
Also is the thief package possible to implement on this lapply function as a way to increase the forecast quality, I would of course Drop the stlm and theta option from the model as they behave rather poorly in a wide range of simulations that I performed with toy sets, the stlm crashes on cross validation with a few observations and the theta model just doesn't work.
Can someone teach me how to on error inside the function ignore the variable and just keep applying the function to the next variable? or change the model to something less problematic my solution was to try to catch these cases where the model would crash and drop the theta model before it happens but it is just an ugly hack to the underlying problem.
Also if you see something ugly in the code itself feel free to criticize.

Python - reading text file delimited by semicolon, ploting chart using openpyxl

I have copied the text file to excel sheet separating cells by ; delimiter.
I need to plot a chart using the same file which I achieved. Since all the values copied are type=str my chart gives me wrong points.
Please suggest to overcome this. Plot is should be made of int values
from datetime import date
from openpyxl import Workbook,load_workbook
from openpyxl.chart import (
LineChart,
Reference,
Series,
)
from openpyxl.chart.axis import DateAxis
excelfile = "C:\Users\lenovo\Desktop\how\openpychart.xlsx"
wb = Workbook()
ws = wb.active
f = open("C:\Users\lenovo\Desktop\sample.txt")
data = []
num = f.readlines()
for line in num:
line = line.split(";")
ws.append(line)
f.close()
wb.save(excelfile)
wb.close()
wb = load_workbook(excelfile, data_only=True)
ws = wb.active
c1 = LineChart()
c1.title = "Line Chart"
##c1.style = 13
c1.y_axis.title = 'Size'
c1.x_axis.title = 'Test Number'
data = Reference(ws, min_col=6, min_row=2, max_col=6, max_row=31)
series = Series(data, title='4th average')
c1.append(series)
data = Reference(ws, min_col=7, min_row=2, max_col=7, max_row=31)
series = Series(data, title='Defined Capacity')
c1.append(series)
##c1.add_data(data, titles_from_data=True)
# Style the lines
s1 = c1.series[0]
s1.marker.symbol = "triangle"
s1.marker.graphicalProperties.solidFill = "FF0000" # Marker filling
s1.marker.graphicalProperties.line.solidFill = "FF0000" # Marker outline
s1.graphicalProperties.line.noFill = True
s2 = c1.series[1]
s2.graphicalProperties.line.solidFill = "00AAAA"
s2.graphicalProperties.line.dashStyle = "sysDot"
s2.graphicalProperties.line.width = 100050 # width in EMUs
##s2 = c1.series[2]
##s2.smooth = True # Make the line smooth
ws.add_chart(c1, "A10")
##
##from copy import deepcopy
##stacked = deepcopy(c1)
##stacked.grouping = "stacked"
##stacked.title = "Stacked Line Chart"
##ws.add_chart(stacked, "A27")
##
##percent_stacked = deepcopy(c1)
##percent_stacked.grouping = "percentStacked"
##percent_stacked.title = "Percent Stacked Line Chart"
##ws.add_chart(percent_stacked, "A44")
##
### Chart with date axis
##c2 = LineChart()
##c2.title = "Date Axis"
##c2.style = 12
##c2.y_axis.title = "Size"
##c2.y_axis.crossAx = 500
##c2.x_axis = DateAxis(crossAx=100)
##c2.x_axis.number_format = 'd-mmm'
##c2.x_axis.majorTimeUnit = "days"
##c2.x_axis.title = "Date"
##
##c2.add_data(data, titles_from_data=True)
##dates = Reference(ws, min_col=1, min_row=2, max_row=7)
##c2.set_categories(dates)
##
##ws.add_chart(c2, "A61")
### setup and append the first series
##values = Reference(ws, (1, 1), (9, 1))
##series = Series(values, title="First series of values")
##chart.append(series)
##
### setup and append the second series
##values = Reference(ws, (1, 2), (9, 2))
##series = Series(values, title="Second series of values")
##chart.append(series)
##
##ws.add_chart(chart)
wb.save(excelfile)
wb.close()
I have modified below code in for loop and it worked.
f = open("C:\Users\lenovo\Desktop\sample.txt")
data = []
num = f.readlines()
for line in num:
line = line.split(";")
new_line=[]
for x in line:
if x.isdigit():
x=int(x)
new_line.append(x)
else:
new_line.append(x)
ws.append(new_line)
f.close()
wb.save(excelfile)
wb.close()
For each list,for each value check if its a digit, if yes converts to integer and store in another list.
Using x=map(int,x) didnt work since I have character values too.
I felt above is much more easy than using x=map(int,x) with try and Except
Thanks
Basha

python: Finding min values of subsets of a list

I have a list that looks something like this
(The columns would essentially be acct, subacct, value.):
1,1,3
1,2,-4
1,3,1
2,1,1
3,1,2
3,2,4
4,1,1
4,2,-1
I want update the list to look like this:
(The columns are now acct, subacct, value, min of the value for each account)
1,1,3,-4
1,2,-4,-4
1,3,1,-4
2,1,1,1
3,1,2,2
3,2,4,2
4,1,1,-1
4,2,-1,-1
The fourth value is derived by taking the min(value) for each account. So, for account 1, the min is -4, so col4 would be -4 for the three records tied to account 1.
For account 2, there is only one value.
For account 3, the min of 2 and 4 is 2, so the value for col 4 is 2 where account = 3.
I need to preserve col3, as I will need to use the value in column 3 for other calculations later. I also need to create this additional column for output later.
I have tried the following:
with open(file_name, 'rU') as f: #opens PW file
data = zip(*csv.reader(f, delimiter = '\t'))
# data = list(list(rec) for rec in csv.reader(f, delimiter='\t'))
#reads csv into a list of lists
#print the first row
uniqAcct = []
data[0] not in used and (uniqAcct.append(data[0]) or True)
But short of looping through and matching on each unique count and then going back through and adding a new column, I am stuck. I think there must be a pythonic way of doing this, but I cannot figure it out. Any help would be greatly appreciated!
I cannot use numpy, pandas, etc as they cannot be installed on this server yet. I need to use just basic python2
So the problem here is your data structure, it's not trivial to index.
Ideally you'd change it to something readible and keep it in those containers. However if you insist on changing it back into tuples I'd go with this construction
# dummy values
data = [
(1, 1, 3),
(1, 2,-4),
(1, 3, 1),
(2, 1, 1),
(3, 1, 2),
(3, 2, 4),
(4, 1, 1),
(4, 2,-1),
]
class Account:
def __init__(self, acct):
self.acct = acct
self.subaccts = {} # maps sub account id to it's value
def as_tuples(self):
min_value = min(val for val in self.subaccts.values())
for subacct, val in self.subaccts.items():
yield (self.acct, subacct, val, min_value)
def accounts_as_tuples(accounts):
return [ summary for acct_obj in accounts.values() for summary in acct_obj.as_tuples() ]
accounts = {}
for acct, subacct, val in data:
if acct not in accounts:
accounts[acct] = Account(acct)
accounts[acct].subaccts[subacct] = val
print(accounts_as_tuples(accounts))
But ideally, I'd keep it in the Account objects and just add a method that extracts the minimal value of the account when it's needed.
Here is another way using your initial approach.
Modify the way you import your data, so you can easily handle it in python.
import csv
mylist = []
with open(file_name, 'rU') as f: #opens PW file
data = csv.reader(f, delimiter = '\t')
for row in data:
splitted = row[0].split(',')
# this is in case you need integers
splitted = [int(i) for i in splitted]
mylist += [splitted]
Then, add the fourth column
updated = []
for acc in set(zip(*mylist)[0]):
acclist = [x for x in mylist if x[0] == acc]
m = min(i for sublist in acclist for i in sublist)
[l.append(m) for l in acclist]
updated += acclist

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here