how to use information provided in wiki download's index file? - wiki

I am trying to do some research about chinese persons by using wiki data. Other than using dbpedia (as info about chinese person is bit limited comparing to zh.wikipedia.org), I found that I can download directly from zhwiki http://download.wikipedia.com/zhwiki/20150301/.
I see there is an index file, from the file I can see row such as:
966576:291:人物
Which I assume is a lookup key? Can someone tell me how to use this lookup key to search the main file or database?

There are two files
zhwiki-20150301-pages-articles-multistream.xml.bz2 1.1 GB - it has
multiple bz2 streams, 100 pages per stream
zhwiki-20150301-pages-articles-multistream-index.txt.bz2 18.8 MB - index file
index file has lines
offset1:pageId1:title1
offset1:pageId2:title2
..
offset2:pageId101:title101
and so on.
offset is starting offset of bz2 stream. You need to read bytes from offset1 to offset2 from bz2 file and pass them to bz2 decoder and it will give you xml dump of 100 pages from that stream

Related

Byte offset notation for a 900 mb XML file

I am building a search engine in c++ (using a 900 mb rapidXML file that contains pages from wikiBooks) and my objective is to parse the ~900 MB XML document using rapidXML so that the user can just enter one word in the search bar and receive the ACTUAL XML DOCUMENTS that contain that word (link).
I need to figure out how to store index of each token (aka each word within of each document) so that when the user wants to see the page numbers a certain word occurs, I can jump to that specific page.
I have been told to do the "file io offset" (where you store where in the file a word is so that you can jump to it) and I am having a hard time understanding what to do.
Questions:
Do I use the "seekg" and "tellg" in the istream library (to find the byte location that each document PAGE is stored at)? And if so, how?
How do I return the actual document back to the user (that contains many occurances of the searched word)?

pyfits: read compressed fits file

How does one open a compressed fits file with pyfits?
The code below reads in the primary hdu, which is an image. The result is a NoneType object.
# read in file
file_input_fit = "myfile.fits.fz"
hdulist = pyfits.open(file_input_fit)
img = hdulist[0].data
Usage of keyword in pyfits.open() "disable_image_compression=True" appears ineffective.
If the .data attribute on the primary HDU is None that means the primary HDU contains no data. You can confirm this by checking the file info:
hdulist.info()
Chances are you're trying to read a multi-extension FITS file, and the data you're looking for is in another castle, I mean, HDU. disable_image_compression=True wouldn't help since that disables support for compressed images :)
ETA: In fact, a tile-compressed FITS image can never be in the primary HDU, since it's stored internally as a binary table, which can only be an extension HDU.
This would be better as a comment but I don't have the reputation to make a comment so I'm forced to write an answer. However, the answer is the same -- namely that the compressed data is stored in the second HDU. The comment was just to show what this looks like on a compressed image I have here (after using the exact lines of the OP to open the file):
>>> hdulist.info()
Filename: /tmp/test.fits.fz
No. Name Type Cards Dimensions Format
0 PRIMARY PrimaryHDU 6 ()
1 CompImageHDU 9 (24576, 6160) float32

writing to gz using fstream

How can I write the output to a compressed file (gz, bz2, ...) using fstream? It seems that Boost library can do that, but I am looking for a non Boost solution. I saw example only for reading from a compressed file.
To write compressed data to a file, you would run your uncompressed data through a compression library such as zlib (for DEFLATE, the compression algorithm used with .zip and .gz files) or xz utils (for LZMA, the compression algorithm used with 7zip and .xz files), then write the result as usual using ofstream or fwrite.
The two major pieces to implement are the encoding/compression and framing/encapsulation/file format.
From wikipedia, the DEFLATE algorithm:
Stream format
A Deflate stream consists of a series of blocks. Each block is
preceded by a 3-bit header: 1 bit: Last-block-in-stream marker: 1:
this is the last block in the stream. 0: there are more blocks to
process after this one. 2 bits: Encoding method used for this block
type: 00: a stored/raw/literal section, between 0 and 65,535 bytes in
length. 01: a static Huffman compressed block, using a pre-agreed
Huffman tree. 10: a compressed block complete with the Huffman table
supplied. 11: reserved, don't use. Most blocks will end up being
encoded using method 10, the dynamic Huffman encoding, which produces
an optimised Huffman tree customised for each block of data
individually. Instructions to generate the necessary Huffman tree
immediately follow the block header. Compression is achieved through
two steps The matching and replacement of duplicate strings with
pointers. Replacing symbols with new, weighted symbols based on
frequency of use.
From wikipedia, the gzip file format:
"gzip" is often also used to refer to the gzip file format, which is:
a 10-byte header, containing a magic number, a version number and a
timestamp optional extra headers, such as the original file name, a
body, containing a DEFLATE-compressed payload an 8-byte footer,
containing a CRC-32 checksum and the length of the original
uncompressed data

tar.Z file format, structure, header

I am trying to figure out the file layout of
tar.Z file. (so called .taz file. compressed tar file).
this file can be produced with tar -Z option or
using unix compress utility(result are same)
I tried to google some document about this file structure
but there is no documentation about this file structure.
I know that this is LZW compressed file and starts with
its magic number "1F 9D" but thats all I can figure out.
someone please tell me more details about the file header or
anything.
I am not interested about how to uncompress this file, or
what linux command can process this file.
I want to know is internal file structure/header/format/layout.
thank you in advance
A .Z file is compressed using compress and can be uncompressed with uncompress (or on some machines this is called uncompress.real). This .Z file can hold any data. .tar.Z or .taz is just a .tar file that is compressed with compress.
The first 2 bytes (MAGIC_1 and MAGIC_2) are used to check if the .Z file really is a .Z file, and not something else with accidentally the same extension. These bytes are hardcoded in the sources.
The third byte is a settings byte and holds 2 values:
The most significant bit is the block mode.
The last 5 bits indicate the maximum size of the code table (the code table is used for lzw compression).
From the original code: BLOCK_MODE=0x80; byte3=(BIT|BLOCK_MODE); and BIT is in an if/else block where it is 12..16.
If block mode is turned on, in the code table a entity will be added at place 256 (remember 0..255 are filled with the values 0..255) and this will contain the CLEAR sign. So whenever the CLEAR sign is gotten from the data stream from the file, the code table has to be reverted to it's initial state (so it has only 0..256 in it).
The maximum code size indicates the amount of bits the code table can be. When the maximum is hit, there are no entities added to the code table anymore. So if the maximum code size is 0b00001100, it means that the code table can only hold 12 bits, so a maximum of 2^12=4096 entities.
The highest amount possible that is used by compress is 16 bit. That means that there are 2 bits in this settings field that are unused.
After these 3 bytes the raw LZW data starts. Because the LZW table starts at 9 bits, the 4th byte will be the same as the first byte of the input (in case of a .tar.Z file, or taz file, this byte will be the first byte of the uncompressed .tar file).
A tar.Z file is just a compressed tar file, so you will only find the 1F 9D magic number telling you to uncompress it.
When uncompressed you can read the tar file header:
http://www.fileformat.info/format/tar/corion.htm
Q: this file can be produced with tar -Z option or using unix compress utility(result are same)
A: Yes. "tar -cvf myfile.tar myfiles; compress myfile.tar" is equivalent to using "-Z". An even better choice is often "j" (using BZip, instead of Zip)
Q: What is the layout of a tar file?
A: There are many references, and much freely available source. For example:
http://en.wikipedia.org/wiki/Tar_%28file_format%29
Q: What is the format of a Unix compressed file?
A: Again: many references; easy to find sample source code:
http://en.wikipedia.org/wiki/Compress
Fot a .tgz (compressed tar file) you'll need both formats: you must first uncompress it, then untar it. The "tar" utility will do both for you, automagically :)

How to extract album cover from a mp3 file without download the whole file

I'm using TabLib for extraction, but i need to know how many bytes should i download from the mp3 file, in order to be able to extract TagLib.
I've looked into mp3 specs, but i didn't found anything relevant.
In 99% of cases, if you pull down first the first 10 bytes, you'd then have the ID3v2 header, of which the last 4 bytes will be the size of the ID3v2 tag, which will contain the cover art.
The ID3v2 size is a "sync-safe integer", but TagLib has a function to decode that to a normal integer:
TagLib::ID3v2::SynchData::toUInt(const ByteVector &data)
So, basically the algorithm would be:
Grab the first 10 bytes
Sanity check those bytes that they start with "ID3"
Read the last 4 bytes of those 10 and pass them through the function above to get the ID3v2 tag length
Grab that much additional data from the stream
Pass that block of data to TagLib
Extract the cover art
The mp3 specification doesn't really have meta-data like song name, or album art. It's part of id3, and it's normally placed at the end of the file.