Generate dictionary file from Stata data - stata

I know that I can create a dta file if I have dat file and dictionary dct file. However, I want to know whether the reverse is also possible. In particular, if I have a dta file, is it possible to generate dct file along with dat file (Stata has an export command that allows export as ASCII file but I haven't found a way to generate dct file). StatTransfer does generate dct and dat file, but I was wondering if it is possible without using StatTransfer.

Yes. outfile will create dictionaries as well as export data in ASCII (text) form.
If you want dictionaries and dictionaries alone, you would need to delete the data part.
If you really want two separate files, you would need to split each file produced by outfile.
Either is programmable in Stata, or you could just use your favourite text editor or scripting language.
Dictionaries are in some ways a very good idea, but they are not as important to Stata as they were in early versions.

Related

How can i manipulate csv's from within c++

I am trying to create a program that can read out to a csv (comma separated). Is there a way to manipulate say the column width or whether a cell is left or right justified internally from my code so that when i open up the file in excel it looks better than a bunch of strings cramped into tiny cells. My goal is for the user to do as little thinking as possible. If they open up the file and have to size everything right just to see it that seems a little crummy.
CSV is a plain text file format. It doesn't support any visual formatting. For that, you need to write the data to another file format such as .xlsx or .ods.

Armadillo: Save multiple datasets in one hdf5 file

I am trying to save multiple datasets into a single hdf5 file using armadillo's new feature to give custom names to datasets (using armadillo version 8.100.1).
However, only the last saved dataset will end up in the file. Is there any way to append to an existing hdf5 file with armadillo instead of replacing it?
Here is my example code:
#define ARMA_USE_HDF5
#include <armadillo>
int main(){
arma::mat A(2,2, arma::fill::randu);
arma::mat B(3,3, arma::fill::eye);
A.save(arma::hdf5_name("multi-hdf5.mat", "dataset1"), arma::hdf5_binary);
B.save(arma::hdf5_name("multi-hdf5.mat", "dataset2"), arma::hdf5_binary);
return 0;
}
The hdf5 file is read out using the h5dump utility.
Unfortunately, I don't think you can do that. I'm an HDF5 developer, not an armadillo developer, but I took a peek at their source for you.
The save functions look like they are designed to dump a single matrix to a single file. In the function save_hdf5_binary() (diskio_meat.hpp:1255 for one version) they call H5Fcreate() with the H5F_ACC_TRUNC flag, which will clobber any existing file. There's no 'open if file exists' or clobber/non-clobber option. The only H5Fopen() calls are in the hdf5_binary_load() functions and those don't keep the file open for later writing.
This clobbering is what is happening in your case, btw. A.save() creates a file containing dataset1, then B.save() clobbers that file with a new file containing dataset2.
Also, for what it's worth, 'appending to an HDF5 file' is not really the right way to think about that. HDF5 files are not byte/character streams like a text file. Appending to a dataset, yes. Files, no. Think of it like a relational database: You might append data to a table, but you probably wouldn't say that you were appending data to the database.
The latest version of Armadillo already covers that possibility.
You have to use hdf5_opts::append in the save method so if you want to save
a matrix A then you can write
A.save(hdf5_name(filename, dataset, hdf5_opts::append) ).

How to split gz csv file by it's data field efficiently

I have a very large gzipped csv file. I would like to split it into two gz files based on the string pattern in a particular column. I know it is possible to loop through the content and create two files, but is there a better way to do it in python in terms of efficiency?
In addition, the original file has one line header. I would like to either have the headers in each of the two result files, or remove headers altogether.

Scikit Learn - Working with datasets

Reading through some stackoverflow questions and I could not find what I was looking for, at least, I didn't think it was when I read various posts.
I have some Training data set up like described here
So, I am using sklearn.datasets.load_files to read those it as it was a perfect match on set up.
BUT my files are tsv as bag of words already (aka each line is a word and it's frequency count separated by a tab).
To be honest, I am not sure how to proceed. The data pulled in by load_files is set up as a list where each element is the contents of each file, including the new line characters. I am not even 100% sure how the Bunch data type is tracking which files belong to which classifier folder.
I have worked with scikit-learn before with tsvs, but it was a single tsv file that had all the data so i used pandas to read it in and then used numpy.array to fetch what I needed from it, which is one of the things I attempted to do, but I am not sure how to do it with multiple files where the classifier is the folder name, as in that single tsv file i worked with before, each line of training data was individually
Some help on getting the data to a format that is useable for training classifiers would be appreciated.
You could loop over the files and read them, to create a list of dictionaries where each dictionary will contain the features and the frequencies of each document. Assume the file 1.txt:
import codecs
corpus = []
#make a loop over the files here and repeat the following
f = codecs.open("1.txt", encoding='utf8').read().splitlines()
corpus.append({line.split("\t")[0]:line.split("\t")[1] for line in f})
#exit the loop here
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
X=vec.fit_transform(measurements)
You can find more here for DictVectorizer

Library for data storage and analysis

So, I have this program that collects a bunch of interesting data. I want to have a library that I can use to sort this data into columns and rows (or similar), save it to a file, and then use some other program (like OpenOffice Spreadsheet, or MATLAB since I own it, or maybe some other spreadsheet/database grapher that I don't know of) to analyse and graph the data however I want. I prefer this library to be open source, but it's not really a requirement.
Ok so my mistake, you wanted a writer. Writing a CSV is simple and apparently reading them into matlab is simple too.
http://www.mathworks.com.au/help/techdoc/ref/csvread.html
A CSV has a simple structure. For each row you seperate by newline. and each column is seperated by a comma.
0,10,15,12
4,7,0,3
So all you really need to do is grab your data, seperate it by rows then write a line out with each column seperated by a comma.
If you need a code example I can edit again but this shouldn't be too difficult.