parquet-cpp writer failed to generate readable parquet file if there are too many binary column - c++

I am trying to use the writer example in parquet-cpp to convert a CSV files with about 36 columns, they are all string columns, so set the column type as variable length byte array. I set the row group size to 1024.
It can writer the schema out successfully, and I can read the meta/header using parquet-tools, but the data part always fail.
depend on the source data, I am getting following errors
can not read class parquet.format.PageHeader: don't know what type: 14
Can not read value at 0 in block -1 in file
can anyone share lights on how to use parquet-cpp correctly for this cases?

Related

how to load multiple files into one file using informatica

I am new to informatics, I have created a mapping that using expression and sorter transformation to load multiple files into one single file which have 2 columns
1 data
2 seq number
All 10 files have random sequence numbers Like
example:
file1
erfef 3
abcdn 1
file 2
wewewr 4
wderfv 5
and so on till 10 files.
Expression transformation code is :
INTEGER(LTRIM(RTRIM(seq_num)),TRUE)
what I want is to load the file into one big file and sort it according to the seq number.
Got data in output file but number with incorrect seq number.
How to get data in the final table with a correct sequence number.
doing exactly what is mention in the below solution but still getting wrong output. getting output like:
erfef 3
abcdn 1
wewewr 4
wderfv 5
where as it should be like below:
where as it should be like
abcdn 1
erfef 3
wewewr 4
wderfv 5
Thanks in Advance !!!
Use indirect file load using a list of files to load all files together. Then use sorter on col2 to order the data. Finally use a target file to store data.
Whole mapping should be like this -
SQ --> EXP--> SRT(key = col2) --> Target
Few things to note -
In the session, use indirect file and use a list file name - mention filelist1.txt
Use ls -1 file* >filelist1.txt in pre session command task to create a file list with all required files.
Expression transformation- convert the col2 to INTEGER if its coming up as string in SQ.
Sorter transformation- use col2 as key column.
Using indirect file source is one way.
Another way is to use command as source and specify a command that will spit out data from all the files, like cat file*.csv.
Just change the Input Type to Command and provide the command - all this can be set by editing session -> mapping tab -> Source -> properties.
Here's an example session:

Format array content into a series of strings and output to .csv in C++?

I have been trying to figure out what the best way to do this would be, but haven't quite found an answer yet. I have a float array full of data collected from an inertial sensor and I would like to put it into the right format and output it to a CSV file. I'm using an mbed microcontroller with a local file system to store the file. It's the part about getting the format right that is confusing me at the minute.
I'd like my gyroscope/accelerometer values to be displayed in rows such as:
gx1, gy1, gz1, ax1, ay1, az1
gx2, gy2, gz2, ax2, ay2, az2
gx3, gy3, gz3, ax3, ay3, az3
I think these values first need to be converted to char before being written to the file, so I will need to do that and store them in a new array of type char. That's where I get confused, because I don't just want to copy the data into this new array all at once (thinking of using a for loop and spritf()) but I also want it to be formatted as displayed above, with the right breaks between rows.
The function that writes the content of the array to the file takes the array, its types size, the array size and the file object.
fwrite(converted_array, sizeof(char), sizeof(converted_array), FileObject);
What would be the best way to make sure that the array content is formatted like I want it to be?

AWS Redshift: How to store text field with size greater than 100K

I have a text field in parquet file with max length 141598. I am loading the parquet file to redshift and got the error while loading as the max a varchar can store is 65535.
Is there any other datatype I can use or another alternative to follow?
Error while loading:
S3 Query Exception (Fetch). Task failed due to an internal error. The length of the data column friends is longer than the length defined in the table. Table: 65535, Data: 141598
No, the maximum length of a VARCHAR data type is 65535 bytes and that is the longest data type that Redshift is capable of storing. Note that length is in bytes, not characters, so the actual number of characters stored depends on their byte length.
If the data is already in parquet format then possibly you don't need to load this data into a Redshift table at all, instead you could create a Spectrum external table over it. The external table definition will only support a VARCHAR definition of 65535, the same as a normal table, and any query against the column will silently truncate additional characters beyond that length - however the original data will be preserved in the parquet file and potentially accessible by other means if needed.

Fixed Length Flat file Parsing

I have a flat file tables say, student.tbl and employee.tbl. Both files are fixed length files. I have a supporting files for both files with the information field code, field description, field Offset and field size.
for example,
ename string 0 10
eage number 10 2
ecity string 12 10
I wrote code to fetch data from the flat files using STL in c++. I am using vector to load those data.
My simple algorithm to load data from Fixed Length file.
1) Read Supporting file.
2) Load supporting file data into a 2D vector string say,
column_records;
3) Read Table file.
4) Get First Line from the Table File, say Data Line.
5) Get First Column Information from the supporting Table Which is
First Row of column_records.
6) Chop Data Line based on the column_record
7) Push the chopped data into a One Dimensional Vector say,
record_vector.
8) Do Step 5, Until the Last Column Information has processed.
9) Push record_vector into 2D vector say,Table_Vector.
10) Do Step 4, Until the last line of the Fixed File has reached.
Well. I did it well. It works fine. But my problem is, in Step 5.
For every fixed length data, I feel there was some repeat Iterations.
I know for a fact, First Fixed Length data itself can have retain the column descriptions for other fixed length data. But I repeatedly doing the Iteration N*M. I wish to my iteration should be 1*M.
I know that I can store my column description in a Structure array. But I have many type of tables. say students.tbl and employee.tbl. Both have different set of columns. So I think it will be bad Idea to have 'N'-struct declaration to load 'N'-supporting Tables.
I wish to use same routine to fetch data from the both tables or 'N' tables. My supporting table format will not be changed. It is in tab delimited format. This is my case.
How do I fetch data from table with 1*M iteration?
I hope I can use enumeration to do this. But I don't know how to do that? will using enumeration or macro solve this issue?
I hope my working source code will not be needed for this Question. If you think source code are needed to answer this question, definitely I will update this question with that source code. I have medium level of English Knowledge. So Sorry for Inconvenience.
Thank You.

tar.Z file format, structure, header

I am trying to figure out the file layout of
tar.Z file. (so called .taz file. compressed tar file).
this file can be produced with tar -Z option or
using unix compress utility(result are same)
I tried to google some document about this file structure
but there is no documentation about this file structure.
I know that this is LZW compressed file and starts with
its magic number "1F 9D" but thats all I can figure out.
someone please tell me more details about the file header or
anything.
I am not interested about how to uncompress this file, or
what linux command can process this file.
I want to know is internal file structure/header/format/layout.
thank you in advance
A .Z file is compressed using compress and can be uncompressed with uncompress (or on some machines this is called uncompress.real). This .Z file can hold any data. .tar.Z or .taz is just a .tar file that is compressed with compress.
The first 2 bytes (MAGIC_1 and MAGIC_2) are used to check if the .Z file really is a .Z file, and not something else with accidentally the same extension. These bytes are hardcoded in the sources.
The third byte is a settings byte and holds 2 values:
The most significant bit is the block mode.
The last 5 bits indicate the maximum size of the code table (the code table is used for lzw compression).
From the original code: BLOCK_MODE=0x80; byte3=(BIT|BLOCK_MODE); and BIT is in an if/else block where it is 12..16.
If block mode is turned on, in the code table a entity will be added at place 256 (remember 0..255 are filled with the values 0..255) and this will contain the CLEAR sign. So whenever the CLEAR sign is gotten from the data stream from the file, the code table has to be reverted to it's initial state (so it has only 0..256 in it).
The maximum code size indicates the amount of bits the code table can be. When the maximum is hit, there are no entities added to the code table anymore. So if the maximum code size is 0b00001100, it means that the code table can only hold 12 bits, so a maximum of 2^12=4096 entities.
The highest amount possible that is used by compress is 16 bit. That means that there are 2 bits in this settings field that are unused.
After these 3 bytes the raw LZW data starts. Because the LZW table starts at 9 bits, the 4th byte will be the same as the first byte of the input (in case of a .tar.Z file, or taz file, this byte will be the first byte of the uncompressed .tar file).
A tar.Z file is just a compressed tar file, so you will only find the 1F 9D magic number telling you to uncompress it.
When uncompressed you can read the tar file header:
http://www.fileformat.info/format/tar/corion.htm
Q: this file can be produced with tar -Z option or using unix compress utility(result are same)
A: Yes. "tar -cvf myfile.tar myfiles; compress myfile.tar" is equivalent to using "-Z". An even better choice is often "j" (using BZip, instead of Zip)
Q: What is the layout of a tar file?
A: There are many references, and much freely available source. For example:
http://en.wikipedia.org/wiki/Tar_%28file_format%29
Q: What is the format of a Unix compressed file?
A: Again: many references; easy to find sample source code:
http://en.wikipedia.org/wiki/Compress
Fot a .tgz (compressed tar file) you'll need both formats: you must first uncompress it, then untar it. The "tar" utility will do both for you, automagically :)