Storing variables as local files? - python-2.7

I am currently setting up a CNN in which most of my debugging time cosist of waiting for the preprocessing step to be done. My input/output data is stored into variables, is it in python (like in R) possible to store the variables as local files, such that I instead keep spending time on generating and structuring my input/output data, just can do this by loading a file?...
My attempt in trying to use pickle:
if feature == 0:
print "Save!"
input_train = open("input_train_0.pckl",'wb')
input_test = open("input_test_0.pckl",'wb')
output_train = open("output_train_0.pckl",'wb')
output_test = open("output_test_0.pckl",'wb')
pickle.dump(data_train_input,input_train)
pickle.dump(data_test_input,input_test)
pickle.dump(data_train_output,output_train)
pickle.dump(data_test_output,output_test)

Related

HDFS File replace while other applications are accessing data

I have been working on an approach to Refresh HDFS files while other consumers/applications accessing data. I have a HDFS directory which has file accessible for the users which i need to replace with latest incoming data everyday, My refresh process few seconds/milli seconds only. But still the challenge is the jobs which already reads this data for analytics get effected due to this refresh process. My approach to refresh files is instead of writing the spark jobs resulted data into actual data locations where users access, I will first write the data a temporary location and then replace with hdfs file replace API. But still my problem is not solved. Please suggest any solution or a workaround to handle HDFS file replace with no impact on downstream.
val conf: Configuration = new Configuration()
val fs: FileSystem = FileSystem.get(conf)
val currentDate = java.time.LocalDate.now
val destPath = outputPath + "/data"
val archivePath = outputPath + "/archive/" + currentDate
val dataTempPath = new Path(destPath + "_temp")
val dataPath = new Path(destPath)
if(fs.exists(dataPath)){
fs.delete(dataPath, true)
}
if(fs.exists(dataTempPath)){
fs.rename(dataTempPath, dataPath)
}
val archiveTempData = new Path(archivePath+"_temp")
val archive = new Path(archivePath)
if(fs.exists(archive)){
fs.delete(archive,true)
}
if(fs.exists(archiveTempData)){
fs.rename(archiveTempData, archive)
}
Simpler approach
Use 2 HDFS locations cyclically per source or target for loading with table defs t1_x and t2_x respectively, and, use a view_x to switch between t1_x and t2_x likewise.
Queries should always use the view_x.
You can clean up the no longer used HDFS locations in a timely manner prior to next cycle.
The clue is to leave the new and old data around for a while.
Comment to make
Only drawback is if a set of queries need to run against the old versions of data. If the changed data is of the nature "added to", then no issue, but if it can overwrite, then there is an issue.
More complicated approach
In the latter case, not sure if an issue or not, you need to apply an annoying solution as outlined hereunder.
Which is to version the data (via partitioning) with some value.
And have a control table with current_version and pick this value up and use it in all related queries until you can use the new current_version.
And then do your maintenance.

How to feed in and retrieve state of LSTM in tensorflow C/ C++

I'd like to build and train a multi-layer LSTM model (stateIsTuple=True) in python, and then load and use it in C++. But I'm having a hard time figuring out how to feed and fetch states in C++, mainly because I don't have string names which I can reference.
E.g. I put the initial state in a named scope such as
with tf.name_scope('rnn_input_state'):
self.initial_state = cell.zero_state(args.batch_size, tf.float32)
and this appears in the graph as below, but how can I feed to these in C++?
Also, how can I fetch the current state in C++? I tried the graph construction code below in python but I'm not sure if it's the right thing to do, because last_state should be a tuple of tensors, not a single tensor (though I can see that the last_state node in tensorboard is 2x2x50x128, which sounds like it just concatenated the states as I have 2 layers, 128 rnn size, 50 mini batch size, and lstm cell - with 2 state vectors).
with tf.name_scope('outputs'):
outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if infer else None)
output = tf.reshape(tf.concat(outputs, 1), [-1, args.rnn_size], name='output')
and this is what it looks like in tensorboard
Should I concat and split the state tensors so there is only ever one state tensor going in and out? Or is there a better way?
P.S. Ideally the solution won't involve hard-coding the number of layers (or rnn size). So I can just have four strings input_node_name, output_node_name, input_state_name, output_state_name, and the rest is derived from there.
I managed to do this by manually concatenating the state into a single tensor. I'm not sure if this is wise, since this is how tensorflow used to handle states, but is now deprecating that and switching to tuple states. Instead of setting state_is_tuple=False and risking my code being obsolete soon, I've added extra ops to manually stack and unstack the states to and from a single tensor. Saying that, it works fine both in python and C++.
The key code is:
# setting up
zero_state = cell.zero_state(batch_size, tf.float32)
state_in = tf.identity(zero_state, name='state_in')
# based on https://medium.com/#erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40#.zhg4zwteg
state_per_layer_list = tf.unstack(state_in, axis=0)
state_in_tuple = tuple(
# TODO make this not hard-coded to LSTM
[tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
for idx in range(num_layers)]
)
outputs, state_out_tuple = legacy_seq2seq.rnn_decoder(inputs, state_in_tuple, cell, loop_function=loop if infer else None)
state_out = tf.identity(state_out_tuple, name='state_out')
# running (training or inference)
state = sess.run('state_in:0') # zero state
loop:
feed = {'data_in:0': x, 'state_in:0': state}
[y, state] = sess.run(['data_out:0', 'state_out:0'], feed)
Here is the full code if anyone needs it
https://github.com/memo/char-rnn-tensorflow

Python multiprocessing with arrays and multiple arguments

So I am trying to read in a bunch of very large data files and each one takes quite some time to load. I am trying to figure out how to load them in the quickest way and without running into memory problems. Once the data files are loaded into the array the correct way I do not need to write to them, but just need to read. I've been trying to parallelize this for some time, but can't figure it out.
Let's say I have 400 time files. Each of these files is tab separated and has 30 variables each with 40,000 data points. I would like to create a 400x30x40000 array so that I can access the points easily.
The data file is set up so that the first 40k points is for variable 1, the second 40k is for var 2, and so on.
I have written a function that loads in a time file correctly and stores it in my array correctly. What I'm having trouble with is parallelizing it. This does work if I put it in a for loop and iterate over i.
import h5py
import pandas as pd
h5file = h5py.File('data.h5','a')
data = h5file.create_dataset("default",(len(files),len(header),numPts))
# is shape 400x30x40000
def loadTimes(files,i,header,numPts,data):
# files has 400 elements
# header has 30 elements
# numPts is an integer
allData = pd.read_csv(files[i],delimiter="\t",skiprows=2,header=None).T
for j in range(0,len(header)):
data[i,j,:] = allData[0][j*numPts:(j+1)*numPts]
del allData
files is the list of time files loaded by subprocess.check_output (has about 400 elements), header is the list of variables, loaded from another file (has 30 elements in it). numPts is the number of points per variable (so around 40k).
I've tried using pool.map to load the data but found it didn't like multiple arguments. I also tried using partial, zip, and the lambda function, but none of those seem to like my arrays.
I am not set in stone about this method. If there is a better way to do it I will greatly appreciate it. It will just take too long to load in all this data one at a time. My calculations show that it would take ~3hrs to load on my computer using one core. And I will use up A LOT of my memory. I have access to another machine with a lot more cores, which is actually where I will be doing this, and I'd like to utilize them properly.
So how I solved this was using the h5 file format. What I did was write the loops so that they only had the iter
def LoadTimeFiles(i):
from pandas import read_csv
import h5py as h5
dataFile = h5.File('data.h5','r+')
rFile = dataFile['files'][i]
data = dataFile['data']
lheader = len(data[0,:,0])
numPts = len(data[0,0,:])
allData = read_csv(rFile,delimiter="\t",skiprows=2,header=None,low_memory=False).T
for j in range(0,lheader):
data[i,j,:] = allData[0][j*numPts:(j+1)*numPts]
del allData
dataFile.close()
def LoadTimeFilesParallel(np):
from multiprocessing import Pool, freeze_support
import h5py as h5
files = h5.File('data.h5','r')
numFiles = len(files['data'][:,0,0])
files.close()
pool = Pool(np)
freeze_support
pool.map(LoadTimeFiles,range(numFiles))
if __name__ == '__main__':
np = 5
LoadTimeFilesParallel(np)
So since I was storing the data in the h5 format anyway I thought I'd be tricky and load it up in each loop (I can see no time delays in reading the h5 files). I added the option low_memory=False to the read_csv command because it made it go faster. The j loop was really fast so I didn't need to speed it up.
Now each LoadTimeFile loop takes about 20-30 secs and we do 5 at once without order mattering. My ram never hits above 3.5Gb (total system usage) and drops back to under a gig after runs.

custom kernel in e1071

i am currently trying to tune the svm function in the e1071 package for R. my input is genomic data (that is each attribute takes a value in the set {-1, 0, 1}) and none of the four kernels currently offered in the package is really good for this kind of data --- i would like to use Hamming distance as my kernel instead.
the svm function, it seems, is written in C++. i have downloaded the source via
download.packages(pkgs = "e1071",
destdir = ".",
type = "source")
found the svm.cpp file containing code for the function and the corresponding kernel portion, where i can potentially add my own custom kernel. has anyone tried doing this? is it possible to do this? once i've finished modifying svm.cpp (provided i figure out how..), how do i make the package "see" the modified file?
You can modify the existing kernel.
I changed the return statement of radial kernel to make the changes..
You can try with that

How to change configuration of network during simulation in OMNeT++?

I want to modify some parameters of element's .ini file in OMNeT++, say a node's transmission rate, during the simulation run, e.g. when a node receives some control message.
I found information saying that it's possible to somehow loop the configuration stated as: some_variable = ${several values}, but there are no conditional clauses in .ini files and no way to pass to those files any data from C++ functions (as far as I'm concerned).
I use INET, but maybe some other models' users already bothered with such a problem.
I found information saying that it's possible to somehow loop the configuration stated as: some_variable = ${several values}, but there are no conditional clauses in .ini files and no way to pass to those files any data from C++ functions (as far as I'm concerned).
In fact you can use the built-in constraint expression in the INI file. This will allow you to create runs for the given configuration while respecting the specified constraint (condition).
However, this constraint will only apply to the parameters that are specified in the .ini file, i.e. this won't help you if the variable which you are trying to change is computed dynamically as part of the code
Below, I give you a rather complicated "code-snippet" from the .ini file which uses many of the built-in functions that you have mentioned (variable iteration, conditionals etc.)
# Parameter assignment using iteration loops and constrains #
# first define the static values on which the others depend #
scenario.node[*].application.ADVlowerBound = ${t0= 0.1}s
scenario.node[*].application.aggToServerUpperBound = ${t3= 0.9}s
#
## assign values to "dependent" parameters using variable names and loop iterations #
scenario.node[*].application.ADVupperBound = ${t1= ${t0}..${t3} step 0.1}s # ADVupperBound == t1; t1 will take values starting from t0 to t3 (0.1 - 0.9) iterating 0.1
scenario.node[*].application.CMtoCHupperBound = ${t2= ${t0}..${t3} step 0.1}s
#
## connect "dependent" parameters to their "copies" -- this part of the snippet is only variable assignment.
scenario.node[*].application.CMtoCHlowerBound = ${t11 = ${t1}}s
scenario.node[*].application.joinToServerLowerBound = ${t12 = ${t1}}s
#
scenario.node[*].application.aggToServerLowerBound = ${t21 = ${t2}}s
scenario.node[*].application.joinToServerUpperBound = ${t22 = ${t2}}s
#
constraint = ($t0) < ($t1) && ($t1) < ($t2) && ($t2) < ($t3)
# END END END #
The code above creates all the possible combinations of time values for t0 to t3, where they can take values between 0.1 and 0.9.
t0 and t3 are the beginning and the end points, respectively. t1 and t2 take values based on them.
t1 will take values between t0 and t3 each time being incremented by 0.1 (see the syntax above). The same is true for t2 too.
However, I want t0 to always be smaller than t1, t1 smaller than t2, and t2 smaller than t3. I specify these conditions in the constraint section.
I am sure, a thorough read through this section of the manual, will help you find the solution.
If you want to change some value during the simulation you can just do that in your C++ code. Something like:
handleMessage(cMessage *msg){
if(msg->getKind() == yourKind){ // replace yourKind with the one you are using for these messages
transmission_rate = new_value;
}
What you are refering to as some_variable = ${several values} can be used to perform multiple runs with different parameters. For example one run with a rate of 1s, one with 2s and one with 10s. That would then be:
transsmission_rate = ${1, 2, 10}s
For more detailed information how to use such values (like to do loops) see the relevant section in the OMNeT++ User Manual
While you can certainly manually change volatile parameters, OMNeT++ (as far as I am aware) offers no integrated support for automatic changing of parameters at runtime.
You can, however, write some model code that changes volatile parameters programatically.