syn1neg & syn0 created as output - word2vec

as an output for creating a Word2Vec model on ~1GB of corpus I got 3 files as an output:
word2vec_model
word2vec_model.syn1neg.npy
word2vec_model.wv.syn0.npy
I used to have only the first file on (when training a smaller corpus).
how should I treat the last 2 files when loading the model?
Should I load only the first one and run queries on it as usual?

When internal arrays of a gensim model outgrow a certain threshold, they'll be save()d as separate files, for both efficiency and to avoid limitation of plain-pickle()ing.
You should keep these files alongside the main file – for example moving them along with the main file. But you only need to load() the main filename – the name you originally provided to save(). It will then find the subsidiary files automatically.

Related

How can I explicitly specify the size of the files to be split or the number of files?

Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

save Large RasterBrick to file for later use

I have a Large RasterBrick, created through compiling a large number of .nc files and then manipulating in a few ways (cropping, collapsing, naming layers). I want to save this brick to a file on my laptop, so that I can access it without having to import all data and manipulate anew.
How do I do this? I think it should involve writeRaster, but I'm not sure how to specify the options.
My RasterBrick is 18 by 25, with 14975 layers, each named with the relevant date.
I tried this code from Save multi layer RasterBrick to harddisk:
outfile <- writeRaster(windstack_mn, filename='dailywindgrid.tif', format="GTiff", overwrite=TRUE,options=c("INTERLEAVE=BAND","COMPRESS=LZW"))
However, this code produce a tif file that holds a single 18 by 25 layer. I think it saved only the 1st layer of my RasterBrick, because if I bring in the saved .tif file and plot it, it looks identical to plotting the 1st layer of the original RasterBrick.
Did you look at outfile? Can you show it to us?
You should show what you do to "bring in the saved .tif". I am guessing that you do
raster('dailywindgrid.tif')
whereas you should be doing
brick('dailywindgrid.tif')
The comment/answer fr/ Robert solves my issue, with the one addition that one needs to specify the raster format. So I am now saving the file with this code:
writeRaster(StackName, filename='FileNAme.grd', format="raster", overwrite=TRUE,options=c("INTERLEAVE=BAND","COMPRESS=LZW"))
And that .grd file can later be opened using this code:
ImportName <- brick("FileNAme.grd")

Caffe GoogleNet classification.cpp gives random outputs

I used Caffe GoogleNet model to train my own data (10k images, 2 classes). I stop it at 400000th iteration with an accuracy of ~80%.
If I run the below command:
./build/examples/cpp_classification/classification.bin
models/bvlc_googlenet/deploy.prototxt
models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
data/ilsvrc12/imagenet_mean.binaryproto
data/ilsvrc12/synset_words.txt
1.png
it gives me a different -- apparently random -- result each time (i.e. if I run it n times, then I get n different results). Why? Does my training fail? Does it still use the old data from the reference model?
I don't think it is a problem with the training. Even if the training data wasn't, it should give the same (possibly wrong) output every time. If you are getting random results, it indicates that the weights are not being loaded properly.
When you load a .caffemodel against a .prototxt, caffe will load the weights of all the layers in the prototxt whose names match with the ones in the caffemodel. For the other layers, it will do a random initialisation (gaussian xavier, etc according to the specification in the prototxt).
So the best thing for you to do now is to check if the model was trained using the same prototxt you are using now.
I see that you are using GoogleNet prototxt and reference_caffenet caffemodel. Is this intentional?
When you want to deploy the fine-tuned model, you should check two main things:
Inputs:
Input image uses a BGR channel instead of RGB (e.g. opencv)
Mean file: Is same as mean file when training?
Prototxt:
When fine-tuning the model, you will change some layers' name in the original prototxt, and you should check whether the same layer name used?
And there are some
Fine-tune tricks and CS231n_transfer_learning which are very useful for fine-tuning.

Searching for means to get smaller rdf (n3) dataset

I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.
The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.
If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.
A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page
One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.
Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.
Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.
I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.