Read Partial Parquet file

Read Partial Parquet file - c++

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?

Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files

You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

Access random line in large file on Google Cloud Storage

I'm trying to read a random line out of a large file stored in a public cloud storage bucket.
My understanding is that I can't do this with gsutil and have looked into FUSE but am not sure it will fill my use case:
https://cloud.google.com/storage/docs/gcs-fuse
There are many files, which are ~50GB each -- for a total of several terabytes. If possible I would like to avoid downloading these files. They are all plain text files -- you can see them here:
https://console.cloud.google.com/storage/browser/genomics-public-data/linkage-disequilibrium/1000-genomes-phase-3/ldCutoff0.4_window1MB
It would be great if I could simply get a filesystem handle using FUSE so I could place the data directly into other scripts -- but I am okay with having to re-write them to read line by line if that is what is necessary. The key thing is -- under no circumstances should the interface download the entire file.

The Range header allows you to download specific byte offsets from within a file using the XML API.
There is no direct way to retrieve a specific line, as GCS doesn't know where in the file any given line begins/ends. Tools to find a specific line generally read a whole file in order to count line-breaks to find the desired line.
If the file has line-numbers in it then you could do a binary search to look for the desired line. You would requesting small chunks, check the line number, and then try a different location based on that until you find the desired line.
if the file doesn't have line-numbers, you could do pre-processing to make it possible. Before the initial file upload, you could scan the file and record the byte location of each Nth line. Then to get the desired line, you look up the byte location in that index and can make a range request for the relevant section.

Reading specific elements from a CSV file in C++

I'm trying to create a reference program which I think will use an excel spreadsheet to hold information for reading only. I want the user to be able to select a topic from an option list and have the information in the appropriate cell be fed back to them. The program is being written in C++. My question is, how do I access specific cells from a spreadsheet from my program? I've researched it a little and I've seen that I want to save my file as a csv and use fscanf to read the contents, but I'm at a loss as to how I would do this part. I googled it and found this thread:
http://www.daniweb.com/software-development/cpp/threads/204808/parsing-a-csv-file-separated-by-semicolons
but I think it reads in all of the data from the CSV? From what I can tell anyways. And I only want to pull specific elements. Is that possible?

If you only want specific elements, you would still have to parse all contents of the file until you reach those elements. You don't have to store values you don't need, but you do need to parse them to advance in the file.

Are you invoking the program from Excel? If you are, a little VBA goes a long way. You could always only export the cells of interest ready for your C++ program to read in.
Otherwise, other answers are correct. However, you don't need to load the entire file into memory at once. You can use std::fstream to open the file and read in each line of the file, parsing in the required information for each line.

Read csv file from website into c++

I'd like to read the contents of a .csv file from a website, into a c++ program. Specifically, it is financial data of the form from google finance.
http://www.google.com/finance/historical?cid=22144&startdate=Nov+1%2C+2011&enddate=Nov+14%2C+2011
(If you append "&output=csv" to the above link it will download the data as a csv file)
I know that I can use something like libcurl to download the file and then read it in from there, but I wanted to read it directly into the program without having to write it to a file first.
Can I get some suggestions on the best way to do this? I was thinking boost.asio but I have no experience with it (or network programming in general).

If you are trying to download it from a web resource you will need to implement at least some part of the HTTP protocol. libcurl will do this for you.
You don't need to save it as a file. This example will show you how to download and store it in a memory buffer.

Creating metadata for binary file

I have a binary file I'm creating in C++, I'm tasked to create a metadata format to describe the data that it can be read in Java using the metadata.
One record in the data file has Time, then 64 bytes of data, then a CRC, then a new line delimiter. How should the metadata look to describe what is in the 64 bytes? I've never created a metadata file before.

Probably you want to generate a file which describes how many entries there are in the data file, and maybe the time range. Depending on what kind of data you have, the metadata might contain either a per-record entry (RawData, ImageData, etc.) or one global entry (data stored as float.)
It totally depends on what the Java-code is supposed to do, and what use-cases you have. If you want to know whether to open the file at all depending on date, that should be part of the metadata, etc.

I think that maybe you have the design backwards.
First, think about the end.
What result do you want to see? A Java program will create some kind of .csv file?
What kind(s) of file(s)?
What information will be needed to do this?
Then design the metadata to provide the information that is needed to perform the necessary tasks (and any extra tasks you anticipate).
Try to make the metadata extensible so that adding extra metadata in the future will not break the programs that you are writing now. e.g. if the Java program finds metadata it doesn't understand, it just skips it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Read Partial Parquet file - c++

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

Access random line in large file on Google Cloud Storage

Reading specific elements from a CSV file in C++

Read csv file from website into c++

Creating metadata for binary file

Categories

Resources