Multi-index column csv is
Its size is (8, 8415).
This csv file was made from pandas multi-index dataframe (python).
Its columns are [codes X financial items].
codes are
financial items are
How can I use this csv file to use its year(2014, 2015, ....) as index and codesXfinancial items as multi columns?
What kind of output you want is unclear. There are not many libraries to imitate pandas in C++. A very messy, convoluted and inelegant way of doing it is declaring a structure and then put it into a list. Something like,
struct dataframe{
double data;
int year;
int code;
char item[]; //or you can use "string item;"
}
Make a list of this structure either by a custom class or C++ native "list" class.
If you can provide a more detailed explanation of what kind of data structure you want in the program or what do you want to do with it, I would try to provide a better solution.
Related
I am new into Python, I've been using Matlab for a long time. Most of the features that Python offers outperform those from Matlab, but I still miss some of the features of matlab structures!
Is there a similar way of grouping independent pandas dataframes into a single object? This would be of my convenience since sometimes I have to read data of different locations and I would like to obtain as many independent dataframes as locations, and if possible into a single object.
Thanks!
I am not sure that I fully understand your question, but this is where I think you are going.
You can use many of the different python data structures to organize pandas dataframes into a single group (List, Dictionary, Tuple). The list is most common, but a dictionary would also work well if you need to call them by name later on rather than position.
**Note: This example uses csv files, these files could be any io that pandas supports (csv, excel, txt, or even a call to a database)
import pandas as pd
files = ['File1.csv', 'File2.csv', 'File3.csv']
frames = [frames.append(pd.read_csv(file)) for file in files]
single_df = pd.concat(frames)
You can use each frame independently by calling it from the list. The following would return the File1.csv dataframe
frames[0]
I wrote this list comprehension to export pandas Data Frames to CSV files (each data frame is written to a different file):
[v.to_csv(str(k)+'.csv') for k,v in df_dict.items()]
The pandas Data Frames are the values of a dictionary where the keys will be the part of the CSV file names. So in the code above v are the Data Frames, and k are strings to which the Data Frames are mapped to.
A colleague said that using list comprehensions is not a good idea for writing to output files. Why would that be? Moreover, he said that using a for loop for this would be more reliable. If true, why is that so?
A colleague said that using list comprehensions is not a good idea for writing to output files. Why would that be?
List comprehensions are usually more performant and readable than for loops when you are building a list (i.e., using append to generate a list with a for loop).
In other cases, like yours, a for loop is preferred when you want the "side effect" of an iteration.
Moreover, he said that using a for loop for this would be more reliable. If true, why is that so?
A for loop is more readable and relevant for this use case, IMHO, and should therefore be preferred:
for k,v in df_dict.items():
v.to_csv(str(k)+'.csv')
I have some data to feed to a C/C++ program and I could easily convert it in CSV format. However I would need a couple of extensions to the CSV standard, or the parts I know about it.
Data are heterogeneous, there are different parameters of different sizes. They could be 1-valued, vectors or multidimensional arrays. My ideal format would be like this one
--+ Size1
2
--+ Size2
4
--+Table1
1;2;3;4
5;6;7;8
--+Table2
1;2
"--+" is some sort of separator. I have two 1-valued parameters named symbolically Size1 and Size2 and two other multidimensional parameters Table1 and Table2. In this case the dimensions of Table1 and Table2 are given by the other two parameters.
Also rows and columns could be named, i.e. there could be a table like
--+Table3
A;B
X;1;2
Y;4;5
Where element ("A","X") is 1 and ("B","X") is 2 and so forth.
In other terms it's like a series of appended CSV files with names for tables, rows and columns.
The parsers should be able to exploit the structure of the file allowing me to write code like this:
parse(my_parser,"Size1",&foo->S1); // read Size1 value and write it in &foo.S1
parse(my_parser,"Size2",&foo->S2); // read Size2 value and write it in &foo.S2
foo->T2=malloc(sizeof(int)*(foo->S1));
parse(my_parser,"Table2",foo->T2); // read Table2
If it was able to store rows and columns name it would be a bonus.
I don't think it would take much to write such a library, but I have more important things to do ATM.
Is there an already defined format like this one? With open-source libraries for C++? Do you have other suggestions for my problem?
Thanks in advance.
A.
I would use JSON, which boost will readily handle. A scalar is a simple case of an array
[ 2 ]
The array is easy
[ 1, 2]
Multidimensional
[ [1,2,3,4], [5,6,7,8] ]
It's been a while since I've done this sort of thing, so I'm not sure how the code will break down for you. Definitely by expanding on this you could add row/column names. The code will be very nice, perhaps not quite as brainless as in python, but it should be simple.
Here's a link for the JSON format: http://json.org
Here's a stackoverflow link for reading JSON with boost: Reading json file with boost
A good option could be YAML.
It's a well known, human friendly data serialization standard for programming languages.
It fits quite well your needs: YAML syntax is designed to be easily mapped to data types common to most high-level languages: vector, associative array and scalar:
Size1: 123
---
Table1: [[1.0,2.0,3.0,4.0], [5.0,6.0,7.0,8.0]]
There are good libraries for C, C++ and many other languages.
To get a feel for how it can be used see the C++ tutorial.
For interoperability you could also consider the way OpenCV uses YAML format:
%YAML:1.0
frameCount: 5
calibrationDate: "Fri Jun 17 14:09:29 2011\n"
cameraMatrix: !!opencv-matrix
rows: 3
cols: 3
dt: d
data: [ 1000., 0., 320., 0., 1000., 240., 0., 0., 1. ]
Since JSON and YAML have many similarities, you could also take a look at: What is the difference between YAML and JSON? When to prefer one over the other
Thanks everyone for the suggestions.
The data is primarily numeric, with lots of dimensions and, given its size, it could be slow to parse with those text formats, I found that the quickest and cleanest way is to use a database, for now.
I still think it may be overkill but there are no clearly better alternatives now IMHO.
I am trying to write an HDF5 file from C++. The file basically contains a large timeseries matrix in the following format
TimeStamp Property1 Property2
I have managed to write the data successfully, I created a dset and used the H5Dwrite function.
Now my question is how do I create a file header, in other words, if I want to write the following array to the file...
['TimeStamp', 'Property1', 'Property2']
...and tag it to the columns for ease of later use ( I am planning to analyze the matrix in Python). How to do that?
I tried to use H5Dwrite to write a string array but failed, I guess it wanted consistent datatypes, so it just wanted floats, which is the datatype for my data. Then I read about this metadata thing, but I am a bit lost as to how to use it? Any help would be much appreciated.
A related side question is can the first row of a matrix be a string and the others rows contain doubles?
Clean solution(s)
If you store your data as a 1D array of a compound datatype with members TimeStamp, Property1, Property2, etc. then the field names will be stored as metadata and it should be easy to read in Python.
I think there is another clean option but I will just mention it since I never used it myself: HDF5's Table Interface. Read the docs to see if you would prefer to use that.
Direct answers to your question
Now the dirty options: you could add string attributes to your existing dataset. There are multiple ways to do that. You could have a single string attribute with all the field names separated by semicolons, or one attribute per column. I don't recommend it since that would be terribly non-standard.
A related side question is can the first row of a matrix be a string and the others rows contain doubles?
No.
Example using a compound datatype
Assuming you have a struct defined like this:
struct Point { double timestamp, property1, property2; };
and a vector of Points:
std::vector<Point> points;
as well as a dataset dset and appropriate memory and file dataspaces, then you can create a compound datatype like this:
H5::CompType type(sizeof(DataPoint));
type.insertMember("TimeStamp", HOFFSET(Point, timestamp), H5::PredType::NATIVE_DOUBLE);
type.insertMember("Property1", HOFFSET(Point, property1), H5::PredType::NATIVE_DOUBLE);
type.insertMember("Property2", HOFFSET(Point, property2), H5::PredType::NATIVE_DOUBLE);
and write data to file like this:
dset.write(&points[0], type, mem_space, file_space);
So, I have this program that collects a bunch of interesting data. I want to have a library that I can use to sort this data into columns and rows (or similar), save it to a file, and then use some other program (like OpenOffice Spreadsheet, or MATLAB since I own it, or maybe some other spreadsheet/database grapher that I don't know of) to analyse and graph the data however I want. I prefer this library to be open source, but it's not really a requirement.
Ok so my mistake, you wanted a writer. Writing a CSV is simple and apparently reading them into matlab is simple too.
http://www.mathworks.com.au/help/techdoc/ref/csvread.html
A CSV has a simple structure. For each row you seperate by newline. and each column is seperated by a comma.
0,10,15,12
4,7,0,3
So all you really need to do is grab your data, seperate it by rows then write a line out with each column seperated by a comma.
If you need a code example I can edit again but this shouldn't be too difficult.