How can I randomly access text from bzip2 files? - compression

I've been using James Taylor's bzip-seek2 script for randomly accessing the content of .bz2 files. Wikipedia dumps all the articles in a single XML file which is compressed using bzip2.
I was trying to create a memory map for all Wikipedia articles compressed in bzip format but I was getting "data error".
There is no documentation present for this. Is there any way to create the memory map for this file?

Related

How to access zip central directory with c++

I have a task to create a c++ console application that can access various data (editing the comment, reading names of files in the zip archive, reading data in the header or central directory etc.) from a zip file. I am allowed to use only basic libraries
I did some googling and found the zip file structure, then i made a simple code to check if i can read any data with ifstream, it returned various characters (which after further googling seem to represent hexadecimal values in UTF-8 encoding). That's where my fairly limited knowledge and what i can find in google ends.
How do i properly read the various info included in the zip file then?

Read Partial Parquet file

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?
Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

Text file recovery on corrupted file system of Flashdrive

I am able to read raw data of the corrupted file system of USB drive.
Is there any simple way for me to recover only text and docx files by using these raw data? (Programming Language: C++)
It might be possible to do it, but it won't be simple.
First of all you will need to parse the file system (i assume it's fat32 from the tags). In fact you will need to parse File Allocation Table (if it's corrupted and mirror copy of FAT was enabled on your drive, then you can try with it). Depending on corruption you it might be possible to extract some files. Read this article for more info about FAT32 structure and you can use this Microsoft specification as more strict guide. Good approach to understand the filesystem is to make some small usb or logical drive with sample file and parse it manually using some hex editor (free wxHexEditor or proprietary WinHex for example).
You can try to search sequences of ASCII characters in your Hex image, but then you will need to sort them manually.
As for docx, this format internally is a collection of XML files and resources, compressed in zip. So it will be way to complicated task to restore it from raw hex image

How to translate PDF file?

I have a PDF file with tables, images and so on. I want to translate text of this PDF file into another language and create a PDF file that is similar to the first file but contains translated text (it should have images, tables, ... like first file).
How can I write a program in C++ that does this work?
I have a program that extracts text from PDF file and converts text but I can't create output PDF file with tables and images in special positions. How can I create a PDF file that has the layout as the original file?
Your program should read the PDF in a memory structure (like a tree of objects) then translate the text leafs in memory, then dump the memory structure back to PDF.
To do so, you need a pdf parsing library which allows you to manipulate the object representation.
I am not a C++ dev, so I don't know the C++ library universe; but from a quick search on google, it looks like PoDoFo can do this job.

Read csv file from website into c++

I'd like to read the contents of a .csv file from a website, into a c++ program. Specifically, it is financial data of the form from google finance.
http://www.google.com/finance/historical?cid=22144&startdate=Nov+1%2C+2011&enddate=Nov+14%2C+2011
(If you append "&output=csv" to the above link it will download the data as a csv file)
I know that I can use something like libcurl to download the file and then read it in from there, but I wanted to read it directly into the program without having to write it to a file first.
Can I get some suggestions on the best way to do this? I was thinking boost.asio but I have no experience with it (or network programming in general).
If you are trying to download it from a web resource you will need to implement at least some part of the HTTP protocol. libcurl will do this for you.
You don't need to save it as a file. This example will show you how to download and store it in a memory buffer.