How can I read a compressed RDF zip using clj-plaza ? - clojure

I have recently got hold of Freebase's RDF dump. It is a compressed zip file which is around 25GB, but the uncompressed version can go upto 250GB.
I have all set it up on an EC2 instance.
There is a note that reads:
If you're writing your own code to parse the RDF dumps its often more efficient to read directly from GZip file rather than extracting the data first and then processing the uncompressed data.
I just started looking into clj-plaza to query RDFs, now I am wondering how do I read this data without unzipping the file ?

Something like this:
(with-open [stream (java.util.zip.GZIPInputStream.
(clojure.java.io/input-stream
(clojure.java.io/file "my-file.zip")))]
(document-to-model stream :ntriple))
Having referred plaza.rdf.core
Should do the trick?
(stream the data).

Related

Read Partial Parquet file

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?
Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

zlib gzopen() returns a compressed file stream. Does it decompress the file?

zlib gzopen() returns a compressed file stream.
gzFile data_file;
data_file = gzopen(filename.c_str(), "r");
where data_file is the compressed file stream.
I may have missed this from zlib docs. But, does this decompress the opened file?
Or is gzopen() able to directly parse the gzipped file without the need
to decompress it?
Thanks in advance!
gzopen doesn't decompress the data, gzread decompresses as you read it.
I haven't found a simple statement of this fact in the zlib docs, but if you want to "prove" it to yourself create a large (several GB) compressed file, and then measure how quickly gzopen returns. It "obviously" doesn't take the time required to decompress the whole file. If you look into how gzip compression is defined, you'll see that it's designed to be written and read as a stream, that is to say you don't need to decompress the whole file at once.

Chunk download with OneDrive Rest API

this is the first time I write on StackOverflow. My question is the following.
I am trying to write a OneDrive C++ API based on the cpprest sdk CasaBlanca project:
https://casablanca.codeplex.com/
In particular, I am currently implementing read operations on OneDrive files.
Actually, I have been able to download a whole file with the following code:
http_client api(U("https://apis.live.net/v5.0/"), m_http_config);
api.request(methods::GET, file_id +L"/content" ).then([=](http_response response){
return response.body();
}).then([=]( istream is){
streambuf<uint8_t> rwbuf = file_buffer<uint8_t>::open(L"test.txt").get();
is.read_to_end(rwbuf).get();
rwbuf.close();
}).wait()
This code is basically downloading the whole file on the computer (file_id is the id of the file I am trying to download). Of course, I can extract an inputstream from the file and using it to read the file.
However, this could give me issues if the file is big. What I had in mind was to download a part of the file while the caller was reading it (and caching that part if he came back).
Then, my question would be:
Is it possible, using the OneDrive REST + cpprest downloading a part of a file stored on OneDrive. I have found that uploading files in chunks seems apparently not possible (Chunked upload (resumable upload) for OneDrive?). Is this true also for the download?
Thank you in advance for your time.
Best regards,
Giuseppe
OneDrive supports byte range reads. And so you should be able to request chunks of whatever size you want by adding a Range header.
For example,
GET /v5.0/<fileid>/content
Range: bytes=0-1023
This will fetch the first KB of the file.

Memory consumption during parsing of YAML with yaml-cpp

I am developing Qt application for embedded system with limited memory. I need to receive a few megabytes of JSON data and parse it as fast as possible and without using too much memory.
I was thinking about using streams:
JSON source (HTTP client) ---> ZIP decompressor ---> YAML parser ----> objects mapped to database
Data will arrive from network much slower than I can parse it.
How much memory yaml-cpp need to parse 1MB of data?
I would like already parsed raw data from decompressor and internal memory used for that data by YAML parser to be released as soon as object mapped to database is created. Is it possible?
Does yaml-cpp support asynchronous parsing? So as soon as JSON object is parsed, I can store it in database without waiting for full content from HTTP source.
Since you have memory constraints and your data is already in JSON, you should use a low-memory JSON parser instead of a YAML parser. Try jsoncpp - although I'm not sure what their support for streaming is (since JSON doesn't have the concept of documents).
yaml-cpp is designed for streaming, so it won't block if there are documents to parse but the stream is still open; however, there is an outstanding issue in yaml-cpp where it reads more than a single document at a time, so it really isn't designed for extremely low memory usage.
As for how much memory it takes to parse 1 MB of data, it is probably on the order of 3 MB (the raw input stream, plus the parsed stream, plus the resulting data structure), but it may vary dramatically depending on what kind of data you're parsing.

How to compress ascii text without overhead

I want to compress small text (400 bytes) and decompress it on the other side. If I do it with standard compressor like rar or zip, it writes metadata along with the compressed file and it's bigger that the file itself..
Is there a way to compress the file without this metadata and open it on the other side with known ahead parameters?
You can do raw deflate compression with zlib. That avoids even the six-byte header and trailer of the zlib format.
However you will find that you still won't get much compression, if any at all, with just 400 bytes of input. Compression algorithms need much more history than that to get rolling, in order to build statistics and find redundancy in the data.
You should consider either a dictionary approach, where you build a dictionary of representative strings to provide the compressor something to work with, or you can consider a sequence of these 400-byte strings to be a single stream that is decompressed as a stream on the other end.
You can have a look at compression using Huffman codes. As an example look at here and here.