Access random line in large file on Google Cloud Storage - google-cloud-platform

I'm trying to read a random line out of a large file stored in a public cloud storage bucket.
My understanding is that I can't do this with gsutil and have looked into FUSE but am not sure it will fill my use case:
https://cloud.google.com/storage/docs/gcs-fuse
There are many files, which are ~50GB each -- for a total of several terabytes. If possible I would like to avoid downloading these files. They are all plain text files -- you can see them here:
https://console.cloud.google.com/storage/browser/genomics-public-data/linkage-disequilibrium/1000-genomes-phase-3/ldCutoff0.4_window1MB
It would be great if I could simply get a filesystem handle using FUSE so I could place the data directly into other scripts -- but I am okay with having to re-write them to read line by line if that is what is necessary. The key thing is -- under no circumstances should the interface download the entire file.

The Range header allows you to download specific byte offsets from within a file using the XML API.
There is no direct way to retrieve a specific line, as GCS doesn't know where in the file any given line begins/ends. Tools to find a specific line generally read a whole file in order to count line-breaks to find the desired line.
If the file has line-numbers in it then you could do a binary search to look for the desired line. You would requesting small chunks, check the line number, and then try a different location based on that until you find the desired line.
if the file doesn't have line-numbers, you could do pre-processing to make it possible. Before the initial file upload, you could scan the file and record the byte location of each Nth line. Then to get the desired line, you look up the byte location in that index and can make a range request for the relevant section.

Related

How to identify whether a file is DICOM or not which has no extension

I have few files in my GCP bucket folder like below:
image1.dicom
image2.dicom
image3
file1
file4.dicom
Now, I want to even check the files which has no extension i.e image3, file1 are dicom or not.
I use pydicom reader to read dicom files to get the data.
dicom_dataset = pydicom.dcmread("dicom_image_file_path")
Please suggest is there a way to validate the above two files are dicom or not in one sentence.
You can use the pydicom.misc.is_dicom function or do:
try:
ds = dcmread(filename)
except InvalidDicomError:
pass
Darcy has the best general answer, if you're looking to check file types prior to processing them. Apart from checking is the file a dicom file, it will also make sure the file doesn't have any dicom problems itself.
However, another way to quickly check that may or may not be better, depending on your use case, is to simply check the file's signature (or magic number as they're sometimes known.
See https://en.wikipedia.org/wiki/List_of_file_signatures
Basically if the bytes from position 128 to 132 in the file are "DICM" then it should be a dicom file.
If you only want to check for 'is it dicom?' for a set of files, and do it very quickly, this might be another approach

Read Partial Parquet file

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?
Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

clusters the file is occupying [duplicate]

I need to get any information about where the file is physically located on the NTFS disk. Absolute offset, cluster ID..anything.
I need to scan the disk twice, once to get allocated files and one more time I'll need to open partition directly in RAW mode and try to find the rest of data (from deleted files). I need a way to understand that the data I found is the same as the data I've already handled previously as file. As I'm scanning disk in raw mode, the offset of the data I found can be somehow converted to the offset of the file (having information about disk geometry). Is there any way to do this? Other solutions are accepted as well.
Now I'm playing with FSCTL_GET_NTFS_FILE_RECORD, but can't make it work at the moment and I'm not really sure it will help.
UPDATE
I found the following function
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364952(v=vs.85).aspx
It returns structure that contains nFileIndexHigh and nFileIndexLow variables.
Documentation says
The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.
I don't really understand what is this. I can't connect it to the physical location of file. Is it possible later to extract this file ID from MFT?
UPDATE
Found this:
This identifier and the volume serial number uniquely identify a file. This number can change when the system is restarted or when the file is opened.
This doesn't satisfy my requirements, because I'm going to open the file and the fact that ID might change doesn't make me happy.
Any ideas?
Use the Defragmentation IOCTLs. For example, FSCTL_GET_RETRIEVAL_POINTERS will tell you the extents which contain file data.

Best way to parse a complex log file?

I need to parse a log file that consist in many screenshot of real-time OS stdout.
In particular, every section of my log_file.txt is a text version of what appear on screen. In this machine there's not monitor, so the stdout is written on a downloadable log_file.txt.
The aim would be to create a .csv of this file for data mining purpose but I'm still wondering what could be the best method to compute this file.
I would the first csv file line with the description (string) of the value and from the second line I would the respective values (int).
I was thinking about a parser generator (JavaCC, ANTLR, etc..) but before starting with them I would get some opinions.
Thank you.
P.S.
I put a short version of my log at the following link: pastebin.com/r9t3PEgb

How to store a file once in a zip file instead of duplicating it in 50 folders

I have a directory structure that I need to write into a zip file that contains a single file that is duplicated in 50 sub directories. When users download the zip file, the duplicated file needs to appear in every directory. Is there a way to store the file once in a zip file, yet have it downloaded into the subdirectories when it is extracted? I cannot use shortcuts.
It would seem like Zip would be smart enough to recognize that I have 50 duplicate files and automatically store the file once... It would be silly to make this file 50 times larger than necessary!
It is possible within the ZIP specification to have multiple entries in the central directory point to the same local header offset. The ZIP application would have to precalculate the CRC of the file it was going to add and find a matching entry in the central directory of the existing ZIP file. A query for the CRC lookup against a ZIP file that contains a huge number of entries would be an expensive operation. It would also be costly to precalculate the CRC on huge files (CRC calculations are usually done during the compression routine).
I have not heard of a specific ZIP application that makes this optimization. However, it does look like StuffIt X format supports duplicate file optimization:
The StuffIt X format supports "Duplicate Detection". When adding files to an archive, StuffIt detects if there are duplicate items (even if they have the different file names), and only compresses the duplicates once, no matter how many copies there are. When expanded, StuffIt recreates all the duplicates from that one instance. Depending on the data being compressed, it can offer significant reductions in size and compression time.
I just wanted to clarify that the Suffit solution only removes duplicate files when compressing to their own proprietary format and not ZIP.