As stated, will renaming a file change the CRC? I've checked in plain text files and it didn't. Does this apply to all files of all formats?
CRC is calculated on the contents of the file. The file name is just an entry in the file system that allows access to the file. It's not part of the file itself, so it's not part of the CRC.
CRC is normally calculated on file content, but there's no prison sentence prescribed for writing a CRC utility that includes the filename. Check your particular utility's documentation, or I'd say it's safe to trust the results implied by your experiment.
Related
I have few files in my GCP bucket folder like below:
image1.dicom
image2.dicom
image3
file1
file4.dicom
Now, I want to even check the files which has no extension i.e image3, file1 are dicom or not.
I use pydicom reader to read dicom files to get the data.
dicom_dataset = pydicom.dcmread("dicom_image_file_path")
Please suggest is there a way to validate the above two files are dicom or not in one sentence.
You can use the pydicom.misc.is_dicom function or do:
try:
ds = dcmread(filename)
except InvalidDicomError:
pass
Darcy has the best general answer, if you're looking to check file types prior to processing them. Apart from checking is the file a dicom file, it will also make sure the file doesn't have any dicom problems itself.
However, another way to quickly check that may or may not be better, depending on your use case, is to simply check the file's signature (or magic number as they're sometimes known.
See https://en.wikipedia.org/wiki/List_of_file_signatures
Basically if the bytes from position 128 to 132 in the file are "DICM" then it should be a dicom file.
If you only want to check for 'is it dicom?' for a set of files, and do it very quickly, this might be another approach
I'm trying to read a random line out of a large file stored in a public cloud storage bucket.
My understanding is that I can't do this with gsutil and have looked into FUSE but am not sure it will fill my use case:
https://cloud.google.com/storage/docs/gcs-fuse
There are many files, which are ~50GB each -- for a total of several terabytes. If possible I would like to avoid downloading these files. They are all plain text files -- you can see them here:
https://console.cloud.google.com/storage/browser/genomics-public-data/linkage-disequilibrium/1000-genomes-phase-3/ldCutoff0.4_window1MB
It would be great if I could simply get a filesystem handle using FUSE so I could place the data directly into other scripts -- but I am okay with having to re-write them to read line by line if that is what is necessary. The key thing is -- under no circumstances should the interface download the entire file.
The Range header allows you to download specific byte offsets from within a file using the XML API.
There is no direct way to retrieve a specific line, as GCS doesn't know where in the file any given line begins/ends. Tools to find a specific line generally read a whole file in order to count line-breaks to find the desired line.
If the file has line-numbers in it then you could do a binary search to look for the desired line. You would requesting small chunks, check the line number, and then try a different location based on that until you find the desired line.
if the file doesn't have line-numbers, you could do pre-processing to make it possible. Before the initial file upload, you could scan the file and record the byte location of each Nth line. Then to get the desired line, you look up the byte location in that index and can make a range request for the relevant section.
I am very new to c++ and I wanted to write a program that would read and extract data from files with/of different format (example: .dat). I just want to read and extract the data from it. Some people say something about file headers, structures and bodies, what are they actually ?
Basically, you need a different strategy (code) for each file format.
A file with extension .txt usually contains ASCII data and is simple to read.
A file with extension .doc contains binary data for MS Word and is virtually impossible to read with something other than MS Word.
All other file formats are somwhere inbetween these extremes.
The file extension will give you a hint about the files contents. Often people use the extension as a synonym for the actual file format. So we say "I have a .WAV file" when we actually mean "I have a binary file in RIFF/WAVE format with an .wav extension"
Some file formats (Like .WAV .MP3 .TIFF and so on) contain a (well documented) header which describes the file's structure in the first few bytes.
So Header means: The first few Bytes of a file which describe the contents/structure/layout of the file. For example in the first few bytes of a .WAV file you'll find number of channels, sampling rate, etc which explains how the rest of the file needs to be read in, interpreted and send to an audio device.
Some other popular extensions (like .dat .bin .hex) say not much more than "this is binary data in an unspecified format/structure." So you need (a lot of) additional information to read these files in a meaningful way.
Wikipedia article about file extensions
Wikipedia article about file formats
For each type of file there will be a specification defining the format. There may well be headers(information about the data stored in the file) and data structures(ways of organising the actual data in the file), others may just be plain text files where a new line character separates lines.
To write code to interpret a file, for instance .jpg you would need to get the file format specification for JPEG, read it, and then implement it in your code. You would do this for each file format you needed to read in your program.
The structure and content of common files like images, videos, sound, CAD data, text processing... is extremely complex. Mastering them would take you more than a lifetime.
Files often begin with a signature, i.e. a small number of bytes that is deemed to be unique and can be used to check the file type. But there is no standardization at all. For instance, a MS Bitmap image begins with the letters "BM", while xml content begins with a string like "?xml version="1.0" encoding="UTF-8"?".
A header is an initial section of the file that gives information about the data itself such as data type and size, allowing the interpret the subsequent data correctly. For instance, the TIFF image format has a complex header that can contain dozens of "tags" before the bitmap data.
Here is an example.
I want to split a big file into smaller ones without copying part of file, and without using filestream or functions which use it (if it is possible).
Imagine, we have big file which is consisted of 3 files:
[[File1bytes][File2bytes][File3bytes]]
In my opinion we can do this with these steps:
Use SetEndOfFile function to truncate the bytes of the last file ([File3bytes] in our example)
Somehow force our file system to recognize those truncated bytes ([File3bytes]) as a real file (maybe by adding some info to MFT table, or doing something with NTFS if it is possible, or using some function or method which can do all mentioned for us).
Any suggestions?
How about create a file system nesting over the existing file system where the very large file actually resides and define some IOCTL commands for splitting? Check this link:
How can I write my own 'filesystem' within Windows?
I have a text file (>50k lines) of ascii numbers, with string identifiers, that can be thought of as a collection of data vectors. Based on user input, the application only needs one of these data vectors at runtime.
As far as I can see, I have 3 options for getting the information from this text file:
Keep it as a text file, extract the required vector at run-time. I believe the downside is that you can't have a relative path in the code, so the user would have to point to the file's correct location (?). Or alternatively, get the configure script to inject the absolute path as a macro.
Convert it to a static unsigned char using xxd (as explained here) and then include the resulting file. Downside is that a 5MB file turns into a 25MB include file. Am I correct in thinking that this 25MB is loaded into memory for the duration of the runtime?
Convert it to an object and link using objcopy as explained here. This seems to keep the file size about the same -- are there other trade-offs?
Is there a standard/recommended method for doing this? I can use C or C++ if that makes a difference.
Thanks.
(Running on linux with gcc)
I would go with number 1 and pass the filepath into the program as an argument. There's nothing wrong with doing that and it is simple and straight-forward.
You should have a look at the answers here:
Directory of running program
The top voted answer gives you a glue how to handle your data file. But instead of the home folder I would suggest to save it under /usr/share as explained in the link.
I'd preffer to use zlib (and both ways are possible:side file or include with compressed data).