How to identify whether a file is DICOM or not which has no extension - google-cloud-platform

I have few files in my GCP bucket folder like below:
image1.dicom
image2.dicom
image3
file1
file4.dicom
Now, I want to even check the files which has no extension i.e image3, file1 are dicom or not.
I use pydicom reader to read dicom files to get the data.
dicom_dataset = pydicom.dcmread("dicom_image_file_path")
Please suggest is there a way to validate the above two files are dicom or not in one sentence.

You can use the pydicom.misc.is_dicom function or do:
try:
ds = dcmread(filename)
except InvalidDicomError:
pass

Darcy has the best general answer, if you're looking to check file types prior to processing them. Apart from checking is the file a dicom file, it will also make sure the file doesn't have any dicom problems itself.
However, another way to quickly check that may or may not be better, depending on your use case, is to simply check the file's signature (or magic number as they're sometimes known.
See https://en.wikipedia.org/wiki/List_of_file_signatures
Basically if the bytes from position 128 to 132 in the file are "DICM" then it should be a dicom file.
If you only want to check for 'is it dicom?' for a set of files, and do it very quickly, this might be another approach

Related

How to unzip carved files from disk in raw format

Sorry for making such an unclear title
I have a disk image disk.raw from which I carved deleted files using Sleuth Kit and its command blkls disk.raw 1-8000 > carved which put into a file the data in the unallocated blocks from 1 to 8000 (where I know my deleted files are)
So my output is a file containing some data and many empty spaces in between. For example, if I open it on a notepad, I get texts like these :
1 4 µ½;ÓóÆJv4éA°¿S*îÔy÷è„¡d:ÄÕԈȤÒX2ÛK]8øâ†+[ÛÖ7jiº;Îàdƒ”ÜRÒ€
¥¾‘…λ5y)‹F¹ž8rÀÉø±9ŸÎ:ÿf¤$cªW›
jȉ…j,ܬ3®°d¥²¥®:Þ FhãŽß[ÔÀZ
÷·Îâ#§B¢† Uƒ†=qÁ[е#Sy(JØš#œKÊÏ9êþáð0•›nÊÑ=q­¡ŽšOk'ë#ëÚšÚjN1V&l?Hù´m,0㼕(nôTúèªÎb4z
„áñP$¼YèÐ%É‚YSÄÔΔú%ió΃P¥ð"÷…ž8«¾oÀE‚f¤X§üS(‘Àº.8H§÷ëü1¥ãùBÁ
ÏÉε”˜Ê<wªf”œàºš¯+kô¨§
÷*ÎÛMøÈ”Âqú2>XME[9¿
[æÀ‹dJ¹—×™
#¦e³ž‹
&ýãY
™qA›¥ì„5šI‰h{?–hZ%"?mÓ{ƒÌ‡5mf
R‹sàì‰;½˜\E€Îñ$‡jYÀK%ØnDwí[=û Ú‘;1„LQP!ðè.¦(w
‘ªb,†ä‚ž8®©8¢BMMã×›Œx
£®‘ÚFëÖбgi·ÖŠ.O&ÂëR¹5–{íy˜÷æ¡žÜç¦^ñbj˜1Úî5G)©Äš¸#¡
? qâ1q[µ­£É>½¥f–#žÞPžR›#T3lÂ.DcSÚ˥ѹ‹e¬·!$ù­“àYž{¨Ü˜ÉbJ…8¬‘#"b 3Ø„¤Í
qµ~#©Á42û,èLE²‰Iv+Áƒ™MšÅÄ
$Bn×ÖXya1£²ŒJçj-Õ7 :AÚ0è#eP#sef}#NÈ­è?¸¯ãß8µ#Q?ÒY͆ۡ)†ë3F›Œ[ŽF®8©!PóÚª]p [íˆyÊn;ãÕ§rBvÏŸ`‚ȨŠöMë÷S¸50¦è€\¾i'7ÒÚTT•vˆ›™¸ë‹ƒÞS>ðºjû&]WÆ–˜ÚÔG•5ÚÑÎ¥Vñ`;´0æ6\wuo«íîÕµ¬t–‚Âþ‘)ü¨Òíi¼_¡•o_iùab›,âezkM#Þ­æ]–h6Š¨+S$"4”4^ÞóD*í£0Ìmk¼#G•¨pG‡Ï{∉ŒB™ƒ)1Y¸E1<1¼
S’5éà‚z[A¬TD‰·Ý¾é m2ËTÍÌšÛrvF€Â«j¤ô?ÿþ­¢Zh4œ6<劕n´öñ>ï9Ì}
I know that those bunch of data represents a compressed file. Is there a way for me to decode these and read the files inside ? Is there a tool that does that given this previous input ?
I'm really new to this and have basic knowledge :)
Thank you kindly in advance,
Sometimes carved files lose their headers format, repairing it might lead to the file being accessible again. As mentioned by Mark Setchell using a hex editor is better than a notepad. Make sure also to look for the correct header and save the file again with the correct format. Hopefully this is helpful.

Include static data/text file

I have a text file (>50k lines) of ascii numbers, with string identifiers, that can be thought of as a collection of data vectors. Based on user input, the application only needs one of these data vectors at runtime.
As far as I can see, I have 3 options for getting the information from this text file:
Keep it as a text file, extract the required vector at run-time. I believe the downside is that you can't have a relative path in the code, so the user would have to point to the file's correct location (?). Or alternatively, get the configure script to inject the absolute path as a macro.
Convert it to a static unsigned char using xxd (as explained here) and then include the resulting file. Downside is that a 5MB file turns into a 25MB include file. Am I correct in thinking that this 25MB is loaded into memory for the duration of the runtime?
Convert it to an object and link using objcopy as explained here. This seems to keep the file size about the same -- are there other trade-offs?
Is there a standard/recommended method for doing this? I can use C or C++ if that makes a difference.
Thanks.
(Running on linux with gcc)
I would go with number 1 and pass the filepath into the program as an argument. There's nothing wrong with doing that and it is simple and straight-forward.
You should have a look at the answers here:
Directory of running program
The top voted answer gives you a glue how to handle your data file. But instead of the home folder I would suggest to save it under /usr/share as explained in the link.
I'd preffer to use zlib (and both ways are possible:side file or include with compressed data).

Appropriate file upload validation

Background
In a targeted issue tracking application (in django) users are able add file attachments to internal messages. Files are mainly different image formats, office documents and spreadsheets (microsoft or open office), PDFs and PSDs.
A custom file field type (type extending FileField) currently validates that the files don't exceed a given size and that the file's content_type is in a the applications MIME Type 'white list'. But as the user base is very varied (multi national and multi platform) we are frequently having to adjust our white list as users using old or brand new application versions have different MIME types (even though they are valid files, and are opened correctly by other users within the business).
Note: Files are not 'executed' by apache, they are just stored (with unix permissions 600) and can be downloaded by users.
Question
What are the pro's and con's for the different types of validation?
A few options:
MIME type white list or black list
File extension while list or black list
Django file upload input validation and security even suggests "you have to actually read the file to be sure it's a JPEG, not an .EXE" (is that even viable when numerous types of files are to be accepeted?)
Is there a 'right' way to validate file uploads?
Edit
Let me clarify. I can understand that actually checking the entire file in the program that it should be opened with to ensure it works and isn't broken would be the only way to fully confirm that the file is what it says it is, and that it isn't corrupted.
But the files in question are like email attachments. we can't possibly verify that every PSD is a valid and working Photoshop image, same goes for JPG or any other type. Even if it is what it says it is, we couldn't guarantee that it's a fully functional file.
So What I was hoping to get at is: Is file magic absolutely crucial? What protection does it really add? And again does a MIME type whitelist actually add any protection that a file extension whitelist doesn't? If a file has an file extension of CSV, JPG, GIF, DOC, PSD is it really viable to check that it is what it says it is, even though the application itself doesn't depend on file?
Is it dangerous to use simple file extension whitelist excluding the obvious offenders (EXE, BAT, etc.) and, I think, disallowing files that are dangerous to the users?
The best way to validate that a file is what it says it is by using magic.
Er, that is, magic. Files can be identified by the first few bytes of their content. It's generally more accurate than extensions or mime types, since you're judging what a file is by what it contains rather than what the browser or user claimed it to be.
There's an article on FileMagic on the Python wiki
You might also look into using the python-magic package
Note that you don't need to get the entire file before using magic to determine what it is. You can read the first chunk of the file and send those bytes to be identified by file magic.
Clarification
Just to point out that using magic to identify a file really just means reading the first small chunk of a file. It's definitely more overhead then just checking the extension but not too mch work. All that file magic does is check that the file "looks" like it's the file you want. It's like checking the file extension only you're looking at the first few chars of the content instead of the last few chars of the filename. It's harder to spoof than just changing the filename. I'd recommend against a mime type whitelist. A file extension whitelist should work fine for your needs, just make sure that you include all possible extensions. Otherwise a perfectly valid file might be rejected just because it ends with .jpeg instead of .jpg.

C++ Importing and Renaming/Resaving an Image

Greetings all,
I am currently a rising Sophomore (CS major), and this summer, I'm trying to teach myself C++ (my school codes mainly in Java).
I have read many guides on C++ and gotten to the part with ofstream, saving and editing .txt files.
Now, I am interested in simply importing an image (jpeg, bitmap, not really important) and renaming the aforementioned image.
I have googled, asked around but to no avail.
Is this process possible without the download of external libraries (I dled CImg)?
Any hints or tips on how to expedite my goal would be much appreciated
Renaming an image is typically about the same as renaming any other file.
If you want to do more than that, you can also change the data in the Title field of the IPTC metadata. This does not require JPEG decoding, or anything like that -- you need to know the file format well enough to be able to find the IPTC metadata, and study the IPTC format well enough to find the Title field, but that's about all. Exactly how you'll get to the IPTC metadata will vary -- navigating a TIFF (for one example) takes a fair amount of code all by itself.
When you say "renaming the aforementioned image," do you mean changing metadata in the image file, or just changing the file name? If you are referring to metadata, then you need to either understand the file format or use a library that understands the file format. It's going to be different for each type of image file. If you basically just want to copy a file, you can either stream the contents from one file stream to another, or use a file system API.
std::ifstream infs("input.txt", std::ios::binary);
std::ofstream outfs("output.txt", std::ios::binary);
outfs << insfs.rdbuf();
An example of a file system API is CopyFile on Win32.
It's possible without libraries - you just need the image specs and 'C', the question is why?
Targa or bmp are probably the easiest, it's just a header and the image data as a binary block of values.
Gif, jpeg and png are more complex - the data is compressed

How to check if file is/isn't an image without loading full file? Is there an image header-reading library?

edit:
Sorry, I guess my question was vague. I'd like to have a way to check if a file is not an image without wasting time loading the whole image, because then I can do the rest of the loading later. I don't want to just check the file extension.
The application just views the images. By 'checking the validity', I meant 'detecting and skipping the non-image files' also in the directory. If the pixel data is corrupt, I'd like to still treat it as an image.
I assign page numbers and pair up these images. Some images are the single left or right page. Some images are wide and are the "spread" of the left and right pages. For example, pagesAt(3) and pagesAt(4) could return the same std::pair of images or a std::pair of the same wide image.
Sometimes, there is an odd number of 'thin' images, and the first image is to be displayed on its own, similar to a wide image. An example would be a single cover page.
Not knowing which files in the directory are non-images means I can't confidently assign those page numbers and pair up the files for displaying. Also, the user may decide to jump to page X, and when I later discover and remove a non-image file and reassign page numbers accordingly, page X could appear to be a different image.
original:
In case it matters, I'm using c++ and QImage from the Qt library.
I'm iterating through a directory and using the QImage constructor on the paths to the images. This is, of course, pretty slow and makes the application feel unresponsive. However, it does allow me to detect invalid image files and ignore them early on.
I could just save only the paths to the images while going through the directory and actually load them only when they're needed, but then I wouldn't know if the image is invalid or not.
I'm considering doing a combination of these two. i.e. While iterating through the directory, reading only the headers of the images to check validity and then load image data when needed.
So,
Will just loading the image headers be much faster than loading the whole image? Or is doing a bit of i/o to read the header mean I might as well finish off loading image in full? Later on, I'll be uncompressing images from archives as well, so this also applies to uncompressing just the header vs uncompressing the whole file.
Also, I don't know how to load/read just the image headers. Is there a library that can read just the headers of images? Otherwise, I'd have to open each file as a stream and code image header readers for all the filetypes on my own.
The Unix file tool (which has been around since almost forever) does exactly this. It is a simple tool that uses a database of known file headers and binary signatures to identify the type of the file (and potentially extract some simple information).
The database is a simple text file (which gets compiled for efficiency) that describes a plethora of binary file formats, using a simple structured format (documented in man magic). The source is in /usr/share/file/magic (in Ubuntu). For example, the entry for the PNG file format looks like this:
0 string \x89PNG\x0d\x0a\x1a\x0a PNG image
!:mime image/png
>16 belong x \b, %ld x
>20 belong x %ld,
>24 byte x %d-bit
>25 byte 0 grayscale,
>25 byte 2 \b/color RGB,
>25 byte 3 colormap,
>25 byte 4 gray+alpha,
>25 byte 6 \b/color RGBA,
>28 byte 0 non-interlaced
>28 byte 1 interlaced
You could extract the signatures for just the image file types, and build your own "sniffer", or even use the parser from the file tool (which seems to be BSD-licensed).
Just to add my 2 cents: you can use QImageReader to get information about image files without actually loading the files.
For example with the .format method you can check a file's image format.
From the official Qt doc ( http://qt-project.org/doc/qt-4.8/qimagereader.html#format ):
Returns the format QImageReader uses for reading images. You can call
this function after assigning a device to the reader to determine the
format of the device. For example: QImageReader reader("image.png");
// reader.format() == "png" If the reader cannot read any image from
the device (e.g., there is no image there, or the image has already
been read), or if the format is unsupported, this function returns an
empty QByteArray().
I don't know the answer about just loading the header, and it likely depends on the image type that you are trying to load. You might consider using Qt::Concurrent to go through the images while allowing the rest of the program to continue, if it's possible. In this case, you would probably initially represent all of the entries as an unknown state, and then change to image or not-an-image when the verification is done.
If you're talking about image files in general, and not just a specific format, I'd be willing to bet there are cases where the image header is valid, but the image data isn't. You haven't said anything about your application, is there no way you could add in a thread in the background that could maybe keep a few images in ram, and swap them in and out depending on what the user may load next? IE: a slide show app would load 1 or 2 images ahead and behind the current one. Or maybe have a question mark displayed next to the image name until the background thread can verify that validity of the data.
While opening and reading the header of a file on a local filesystem should not be too expensive, it can be expensive if the file is on a remote (networked) file system. Even worse, if you are accessing files saved with hierarchical storage management, reading the file can be very expensive.
If this app is just for you, then you can decide not to worry about those issues. But if you are distributing your app to the public, reading the file before you absolutely have to will cause problems for some users.
Raymond Chen wrote an article about this for his blog The Old New Thing.