So I am trying to read in a bunch of very large data files and each one takes quite some time to load. I am trying to figure out how to load them in the quickest way and without running into memory problems. Once the data files are loaded into the array the correct way I do not need to write to them, but just need to read. I've been trying to parallelize this for some time, but can't figure it out.
Let's say I have 400 time files. Each of these files is tab separated and has 30 variables each with 40,000 data points. I would like to create a 400x30x40000 array so that I can access the points easily.
The data file is set up so that the first 40k points is for variable 1, the second 40k is for var 2, and so on.
I have written a function that loads in a time file correctly and stores it in my array correctly. What I'm having trouble with is parallelizing it. This does work if I put it in a for loop and iterate over i.
import h5py
import pandas as pd
h5file = h5py.File('data.h5','a')
data = h5file.create_dataset("default",(len(files),len(header),numPts))
# is shape 400x30x40000
def loadTimes(files,i,header,numPts,data):
# files has 400 elements
# header has 30 elements
# numPts is an integer
allData = pd.read_csv(files[i],delimiter="\t",skiprows=2,header=None).T
for j in range(0,len(header)):
data[i,j,:] = allData[0][j*numPts:(j+1)*numPts]
del allData
files is the list of time files loaded by subprocess.check_output (has about 400 elements), header is the list of variables, loaded from another file (has 30 elements in it). numPts is the number of points per variable (so around 40k).
I've tried using pool.map to load the data but found it didn't like multiple arguments. I also tried using partial, zip, and the lambda function, but none of those seem to like my arrays.
I am not set in stone about this method. If there is a better way to do it I will greatly appreciate it. It will just take too long to load in all this data one at a time. My calculations show that it would take ~3hrs to load on my computer using one core. And I will use up A LOT of my memory. I have access to another machine with a lot more cores, which is actually where I will be doing this, and I'd like to utilize them properly.
So how I solved this was using the h5 file format. What I did was write the loops so that they only had the iter
def LoadTimeFiles(i):
from pandas import read_csv
import h5py as h5
dataFile = h5.File('data.h5','r+')
rFile = dataFile['files'][i]
data = dataFile['data']
lheader = len(data[0,:,0])
numPts = len(data[0,0,:])
allData = read_csv(rFile,delimiter="\t",skiprows=2,header=None,low_memory=False).T
for j in range(0,lheader):
data[i,j,:] = allData[0][j*numPts:(j+1)*numPts]
del allData
dataFile.close()
def LoadTimeFilesParallel(np):
from multiprocessing import Pool, freeze_support
import h5py as h5
files = h5.File('data.h5','r')
numFiles = len(files['data'][:,0,0])
files.close()
pool = Pool(np)
freeze_support
pool.map(LoadTimeFiles,range(numFiles))
if __name__ == '__main__':
np = 5
LoadTimeFilesParallel(np)
So since I was storing the data in the h5 format anyway I thought I'd be tricky and load it up in each loop (I can see no time delays in reading the h5 files). I added the option low_memory=False to the read_csv command because it made it go faster. The j loop was really fast so I didn't need to speed it up.
Now each LoadTimeFile loop takes about 20-30 secs and we do 5 at once without order mattering. My ram never hits above 3.5Gb (total system usage) and drops back to under a gig after runs.
Related
I use C++ code to read pictures from WMTS server using DGAL.
First I initialize GDAL once:
...
OGRRegisterAll();
etc.
But new connection is opened every time I want to read new image (different urls):
gdalDataset = GDALOpen(my_url, GA_ReadOnly);
URL example: https://sampleserver6.arcgisonline.com/arcgis/rest/services/Toronto/ImageServer/tile/12/1495/1145
Unfortunately I didn't find a way to read multiply images by same connection.
Is there such option in GDAL or in WMTS?
Are there other ways to improve timing (I read thousands of images)?
While GDAL can read PNG files, it doesn't add much since those lack any geographical metadata.
You probably want to interact with the WMS server instead, not the images directly. You can for example run gdalinfo on the main url to see the subdatasets:
gdalinfo https://sampleserver6.arcgisonline.com/arcgis/services/Toronto/ImageServer/WMSServer?request=GetCapabilities&service=WMS
The first layer seems to have an issue, I'm not sure, but the other ones seem to behave fine.
I hope you don't mind me using some Python code, but the c++ api should be similar. Or you could try using the command-line utilities first (gdal_translate), to get familiar with the service.
See the WMS driver for more information and examples:
https://gdal.org/drivers/raster/wms.html
You can for example retrieve a subset and store it with:
from osgeo import gdal
url = r"WMS:https://sampleserver6.arcgisonline.com:443/arcgis/services/Toronto/ImageServer/WMSServer?SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&LAYERS=Toronto%3ANone&SRS=EPSG:4326&BBOX=-79.454856,43.582524,-79.312167,43.711781"
bbox = [-79.35, 43.64, -79.32, 43.61]
filename = r'D:\Temp\toronto_subset.tif'
ds = gdal.Translate(filename, url, xRes=0.0001, yRes=0.0001, projWin=bbox)
ds = None
Which looks like:
import numpy as np
import matplotlib.pyplot as plt
ds = gdal.OpenEx(filename)
img = ds.ReadAsArray()
ds = None
mpl_extent = [bbox[i] for i in [0,2,3,1]]
fig, ax = plt.subplots(figsize=(5,5), facecolor="w")
ax.imshow(np.moveaxis(img, 0, -1), extent=mpl_extent)
Note that the data in native resolution for these type of services is often ridiculously large, so usually you want to specify a subset and/or limited resolution as the output.
UPDATE: I had each node write to a separate file, and when the separate files were concatenated together the result was correct. I also updated the code to attempt a channel flush and file sync after each write of a single record, but there are still issues between nodes 0 and 1, now. If I make Node 0 sleep for a few seconds before it starts its iteration of the coforall loop, the records come out correct. If not, the last few hundred bytes of Node 0's records seem to be reliably overwritten with NULL bytes, up to the start of Node 1's records. The issues between Node 1 and Node 2, and Node 2 and Node 3, seem to not show up anymore.
Additionally, if I suppress either Node 0 or Node 1 from writing, I see the fully-formed records from the un-suppressed node written correctly to the file. In the case that Node 1 is suppressed, I see 9,997 100B records (or 999,700) correct bytes followed by NULL bytes in the file where Node 1's suppressed records would go. In the case that Node 0 is suppressed, I see exactly 999,700 NULL bytes in the file, after which Node 1's records begin.
Original Post:
I'm trying to troubleshoot an issue with parallel writes from different nodes to a shared NFS-backed file on disk. At the moment, I suspect that something is wrong with the way writes to the disk happen on the NFS server.
I'm working on adapting MPI+C code that uses pwrite to write to coordinated chunks of a file. If I try to have the equivalent locales in Chapel write to the file inside of a coforall loop, I end up with the bits of the file around the node boundaries messed up - usually the final few hundred bytes of each node's data are garbled. However, if I have just one locale iterate through the data on all locales and write it, the data comes out correctly. That is, I use the same data structures to calculate the offsets, but only Locale 0 seeks to that offset and performs the writes.
I've verified that the offsets into the file that each locale runs do not overlap, and I'm using a single channel per task, defined from within the on loc do block, so that tasks don't share a single channel.
Are there known issues with writing to a file from different locales? A lot of the documentation makes it seem like this is known to be safe, but an unsubstantiated guess seems to indicate that there are issues with caching of file contents; when examining the incorrect data, the bits that are incorrect seem to be the original data from the file in that location at the beginning of the program.
I've included the relevant routine below, in case you easily spot something I missed. To make this serial, I convert the coforall loc in Locales and on loc do block into a for j in 0..numLocales-1 loop, and replace here.id with j. Please let me know what else would help get to the bottom of this. Thanks!
proc write_share_of_data(data_filename: string, ref record_blocks) throws {
coforall loc in Locales {
on loc do {
var data_file: file = open(data_filename, iomode.cwr);
var data_writer = data_file.writer(kind=ionative, locking=false);
var line: [1..100] uint(8);
const indices = record_blocks[here.id].D;
var local_record_offset = + reduce record_blocks[0..here.id-1].D.size;
writeln("Loc ", here.id, ": record offset is ", local_record_offset);
var local_bytes_offset = terarec.terarec_width_disk * local_record_offset;
data_writer.seek(start=local_bytes_offset);
for i in indices {
var write_rec: terarec_t = record_blocks[here.id].records[i];
line[1..10] = write_rec.key;
line[11..98] = write_rec.value;
line[99] = 13; // \r
line[100] = 10; // \n
data_writer.write(line);
lines_written += 1;
}
data_file.fsync();
data_writer.close();
data_file.close();
}
}
return;
}
Adding an answer here that solved my particular problem, though it doesn't explain the behavior seen. I ended up changing the outer loop from coforall loc in Locales to for loc in Locales. This isn't too big of an issue since it's all writing to one file anyway - I doubt that multiple locales can actually make much headway in all attempting to write concurrently to a single file on an NFS server. As a result, the change still allows nodes to write the data they have locally to NFS, rather than forcing Node 0 to collect and then write the data on behalf of all locales. This amounts to only adding idle time to the write operation commensurate with the time it takes Locale 0 to start the remote task on other nodes when the previous node has finished writing, which for the application at hand is not a concern.
Have you tried specifying start/end in file.writer instead of using seek? Does that change anything? What about specifying the end offset for the channel.seek call? Does it matter if the file is created and has the appropriate size before you start?
Other than that, I wonder if this issue would appear for both NFS and Lustre. If it appears for both it might well be a Chapel bug. It sounds from your description that the C program was using this pattern, which points to it being a bug. But, have you run C code doing this on your setup? If it being a Chapel bug seems most likely after further investigation, we would appreciate a bug report issue with a reproducer.
I know that NFS does not always do what one would like, in terms of data consistency. It's my understanding that it has "close to open" semantics but it's unclear to me what that means in the context of opening a file and writing to a particular region within it, in parallel from different locales.
From Why NFS Sucks by Olaf Kirch:
An NFS client is permitted to cache changes locally and send them to
the server whenever it sees fit. This sort of lazy write-back greatly
helps write performance, but the flip side is that everyone else will
be blissfully unaware of these change before they hit the server. To
make things just a little harder, there is also no requirement for a
client to transmit its cached write in any particular fashion, so
dirty pages can (and often will be) written out in random order.
I read two implications from this paragraph that are relevant to your situation here:
The writes you do on different locales can be observed by the NFS server in an arbitrary order. (However as I understand it, the data should be sent to the server by the time your fsync call returns).
These writes are done at an OS page granularity (usually 4k). (Note that this is more a hypothesis I am making than it is a fact. It should be tested or further investigated).
It would be interesting to check if 2. is a plausible explanation for the behavior you are seeing. For example, you could explore having each locale operate on a multiple of 4096 records (or potentially try writing records of 4096 bytes each) and see if that changes the behavior. If 2 is indeed the explanation, it should be possible to create a C program that demonstrates the behavior as well.
I am new to python, and the multiprocessing module. I created a far simplified version of what I am trying to accomplish to distill my problem. The issue is that the variables don't seem to update when called outside of the function where they are appended/the worker processes .
After researching I thought it might have something to do with queues? However, I believe queues to be more about sharing memory between the processes which I don't believe is required in my situation, since each list could be appended independently.
from multiprocessing import Pool
def build(array):
array.append("hello")
return array
if __name__== '__main__':
x=["yo","sup"]
y=["blah", "blah"]
z=["apple","banana"]
w=["cats", "dogs"]
p=Pool(4)
p.map(build,[x,y,z,w])
p.close()
p.join()
print x, y, z, w
When I run the code above, it simply returns x,y,z,w as imputed without appending "hello" to each list and I cannot figure out why. I know that if I put the print statement at the end of the function build that it will output the appended lists. I also realize that I could do the following:
results = p.map(build,[x,y,z,w])
print results
However, in my actual project I need to utilize x, y, z, w later and would prefer not to index results to get the list I am looking for. Is there anyway to have the changes made to the lists stick, so to speak, outside of the worker processes?
Each process has its own memory heap so your lists are copied to the Process Pool workers memory and are changed only there
I am currently setting up a CNN in which most of my debugging time cosist of waiting for the preprocessing step to be done. My input/output data is stored into variables, is it in python (like in R) possible to store the variables as local files, such that I instead keep spending time on generating and structuring my input/output data, just can do this by loading a file?...
My attempt in trying to use pickle:
if feature == 0:
print "Save!"
input_train = open("input_train_0.pckl",'wb')
input_test = open("input_test_0.pckl",'wb')
output_train = open("output_train_0.pckl",'wb')
output_test = open("output_test_0.pckl",'wb')
pickle.dump(data_train_input,input_train)
pickle.dump(data_test_input,input_test)
pickle.dump(data_train_output,output_train)
pickle.dump(data_test_output,output_test)
I'd like to build and train a multi-layer LSTM model (stateIsTuple=True) in python, and then load and use it in C++. But I'm having a hard time figuring out how to feed and fetch states in C++, mainly because I don't have string names which I can reference.
E.g. I put the initial state in a named scope such as
with tf.name_scope('rnn_input_state'):
self.initial_state = cell.zero_state(args.batch_size, tf.float32)
and this appears in the graph as below, but how can I feed to these in C++?
Also, how can I fetch the current state in C++? I tried the graph construction code below in python but I'm not sure if it's the right thing to do, because last_state should be a tuple of tensors, not a single tensor (though I can see that the last_state node in tensorboard is 2x2x50x128, which sounds like it just concatenated the states as I have 2 layers, 128 rnn size, 50 mini batch size, and lstm cell - with 2 state vectors).
with tf.name_scope('outputs'):
outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if infer else None)
output = tf.reshape(tf.concat(outputs, 1), [-1, args.rnn_size], name='output')
and this is what it looks like in tensorboard
Should I concat and split the state tensors so there is only ever one state tensor going in and out? Or is there a better way?
P.S. Ideally the solution won't involve hard-coding the number of layers (or rnn size). So I can just have four strings input_node_name, output_node_name, input_state_name, output_state_name, and the rest is derived from there.
I managed to do this by manually concatenating the state into a single tensor. I'm not sure if this is wise, since this is how tensorflow used to handle states, but is now deprecating that and switching to tuple states. Instead of setting state_is_tuple=False and risking my code being obsolete soon, I've added extra ops to manually stack and unstack the states to and from a single tensor. Saying that, it works fine both in python and C++.
The key code is:
# setting up
zero_state = cell.zero_state(batch_size, tf.float32)
state_in = tf.identity(zero_state, name='state_in')
# based on https://medium.com/#erikhallstrm/using-the-tensorflow-multilayered-lstm-api-f6e7da7bbe40#.zhg4zwteg
state_per_layer_list = tf.unstack(state_in, axis=0)
state_in_tuple = tuple(
# TODO make this not hard-coded to LSTM
[tf.contrib.rnn.LSTMStateTuple(state_per_layer_list[idx][0], state_per_layer_list[idx][1])
for idx in range(num_layers)]
)
outputs, state_out_tuple = legacy_seq2seq.rnn_decoder(inputs, state_in_tuple, cell, loop_function=loop if infer else None)
state_out = tf.identity(state_out_tuple, name='state_out')
# running (training or inference)
state = sess.run('state_in:0') # zero state
loop:
feed = {'data_in:0': x, 'state_in:0': state}
[y, state] = sess.run(['data_out:0', 'state_out:0'], feed)
Here is the full code if anyone needs it
https://github.com/memo/char-rnn-tensorflow