Opening a compressed Dicom (MONOCHROME2) file in Python - compression

I tried to open a Dicom file which is a compressed version (JPEG 2000 Image Compression).
I used different free software tools as well as Python libraries (pydicom, VTK, OpenCV) to open it, but they were unsuccessful
In pydicom, decompress() adapts the transfer syntax of the data set, but not the Photometric Interpretation
It seems GDCM can also handle compressed versions, but there is no "one-click" solution to install GDCM on Windows [Solved!]
I would appreciate any suggestions on this.
Dicom Tags:
(0008, 0016) SOP Class UID UI:
(0008, 0018) SOP Instance UID UI: 1.2.276.0.75.2.2.42.114374073191699.20160725100733359.3844234830
(0008, 0020) Study Date DA: ''
(0008, 0021) Series Date DA: ''
(0008, 0022) Acquisition Date DA: ''
(0008, 002a) Acquisition DateTime DT: ''
(0008, 0030) Study Time TM: '100304'
(0008, 0031) Series Time TM: '100733'
(0008, 0032) Acquisition Time TM: '100733'
(0008, 0050) Accession Number SH: ''
(0008, 0060) Modality CS: ''
(0008, 0070) Manufacturer LO: ''
(0008, 0080) Institution Name LO: ''
(0008, 1010) Station Name SH: ''
(0008, 1030) Study Description LO: ''
(0008, 103e) Series Description LO: 'Model Acquistion'
(0008, 1070) Operators' Name PN: ''
(0008, 1072) Operator Identification Sequence 1 item(s) ----
(0008, 0080) Institution Name LO: ''
---------
(0008, 1090) Manufacturer's Model Name LO: '4000'
(0010, 0010) Patient's Name PN: ''
(0010, 0020) Patient ID LO: ''
(0010, 0021) Issuer of Patient ID LO: ''
(0010, 0030) Patient's Birth Date DA: ''
(0010, 0040) Patient's Sex CS: ''
(0010, 4000) Patient Comments LT: ''
(0018, 0088) Spacing Between Slices DS: "0.0"
(0018, 1000) Device Serial Number LO: '11307'
(0018, 1020) Software Versions LO: '6.5.0.772'
(0018, 1030) Protocol Name LO: ''
(0020, 000d) Study Instance UID UI: 1.2.276.0.75.2.2.42.114374073191699.20160725100304296.3844877810
(0020, 000e) Series Instance UID UI: 1.2.276.0.75.2.2.42.114374073191699.20160725100733328.3844960960
(0020, 0010) Study ID SH: '2016072510030428'
(0020, 0011) Series Number IS: "-2147483648"
(0020, 0012) Acquisition Number IS: "0"
(0020, 0013) Instance Number IS: "1"
(0020, 0052) Frame of Reference UID UI: 1.2.276.0.75.2.2.42.114374073191699.20160725100733328.3844960960
(0020, 0060) Laterality CS: 'OS'
(0020, 0200) Synchronization Frame of Reference UI: 1.2.276.0.75.2.2.42.114374073191699.20140417175345187.1980128920
(0020, 4000) Image Comments LT: ''
(0028, 0002) Samples per Pixel US: 1
(0028, 0004) Photometric Interpretation CS: 'MONOCHROME2'
(0028, 0008) Number of Frames IS: "2"
(0028, 0010) Rows US: 1024
(0028, 0011) Columns US: 1024
(0028, 0030) Pixel Spacing DS: '0.005865103,0.001955034'
(0028, 0100) Bits Allocated US: 8
(0028, 0101) Bits Stored US: 8
(0028, 0102) High Bit US: 7
(0028, 0103) Pixel Representation US: 1
(0028, 1050) Window Center DS: "29.0"
(0028, 1051) Window Width DS: "210.0"
(0028, 2110) Lossy Image Compression CS: '01'
(0028, 2112) Lossy Image Compression Ratio DS: "10.0"
(0032, 1060) Requested Procedure Description LO: ''
(0040, 0008) Scheduled Protocol Code Sequence 1 item(s) ----
(0008, 0100) Code Value SH: 'SD-E1'
(0008, 0102) Coding Scheme Designator SH: '99CZM'
(0008, 0103) Coding Scheme Version SH: '1.0'
(0008, 0104) Code Meaning LO: 'ALL SCANS'
(0008, 010d) Context Group Extension Creator UID UI: CZM
(0061, 0111) Private tag data UL: 1
(0061, 0113) Private tag data LO: '1'
(0061, 0115) Private tag data LO: ''
(0061, 0117) Private tag data LO: 'SD-E1.xml'
(0061, 0119) Private tag data LO: 'False'
(0061, 011b) Private tag data LO: ''
(0061, 011d) Private tag data LO: 'True'
---------
(0040, 0244) Performed Procedure Step Start Date DA: ''
(0040, 0245) Performed Procedure Step Start Time TM: '100733'
(0040, 0260) Performed Protocol Code Sequence 1 item(s) ----
(0008, 0100) Code Value SH: 'SD-S2'
(0008, 0102) Coding Scheme Designator SH: ''
(0008, 0103) Coding Scheme Version SH: '1.0'
(0008, 0104) Code Meaning LO: 'Macular Cube 512x128'
(0008, 010d) Context Group Extension Creator UID UI: CZM
(0061, 0111) Private tag data UL: 2
(0061, 0113) Private tag data LO: '1'
(0061, 0115) Private tag data LO: ''
(0061, 0117) Private tag data LO: 'SD-S2.xml'
(0061, 0119) Private tag data LO: 'False'
(0061, 011b) Private tag data LO: 'No Hires in center.No parameters can be modified.'
(0061, 011d) Private tag data LO: 'False'
---------
(0040, 1001) Requested Procedure ID SH: ''
(0057, 0001) Private Creator UI: 1.2.276.0.75.2.2.42.7
(0057, 1003) Private tag data UL: 1
(0057, 1015) Private tag data LO: 'CZMI'
(0057, 1021) Private tag data LO: ''
(0057, 1023) Private tag data LO: '116525681374'
(0059, 1000) Private tag data LO: 'DATAFILES/E039/3RT8Q85TM1Y7VT7X31WNLNBV99GA43BOJ27IOLS4X2ZU.EX.DCM'
(0059, 1005) Private tag data SL: 0
(0059, 3500) Private tag data SL: 1
(0063, 1000) Private tag data FL: 0.0
(0063, 1005) Private tag data FL: 6.0
(0063, 1010) Private tag data FL: 2.0
(0063, 1015) Private tag data FL: -64.0
(0063, 1020) Private tag data UL: 141
(0063, 1025) Private tag data FL: 0.0
(0063, 1026) Private tag data FL: 0.0
(0063, 1030) Private tag data FL: 1.0
(0063, 1032) Private tag data FL: 1.0
(0063, 1035) Private tag data SL: 113
(0063, 1047) Private tag data FL: 297.0
(0063, 1048) Private tag data FL: 872.0
(0063, 1049) Private tag data FL: 1222.0
(0071, 1070) Private tag data FL: -292.0
(0071, 1095) Private tag data FL: 0.0
(0071, 1100) Private tag data FL: 0.0
(0071, 1105) Private tag data SL: 0
(0073, 1085) Private tag data FL: 1.0
(0073, 1090) Private tag data SL: 0
(0073, 1095) Private tag data SL: 0
(0073, 1100) Private tag data SL: 0
(0073, 1105) Private tag data FL: 0.0
(0073, 1110) Private tag data FL: 0.0
(0073, 1125) Private tag data SL: Array of 128 elements
(0073, 1135) Private tag data FL: 0.6000000238418579
(0073, 1200) Private tag data SL: Array of 128 elements
(0075, 1015) Private tag data 0 item(s) ----
(0075, 1020) Private tag data SL: 0
(0075, 1021) Private tag data SL: 0
(0075, 1035) Private tag data FL: 3.5299999713897705
(0075, 1065) Private tag data FL: 0.0
(0075, 1070) Private tag data FL: 0.0
(0075, 1075) Private tag data FL: 0.0
(0075, 1080) Private tag data FL: 0.0
(0075, 1085) Private tag data SL: 0
(0075, 1210) Private tag data UL: 0
(0075, 1215) Private tag data FL: -inf
(0075, 1220) Private tag data FL: -inf
(7fe0, 0010) Pixel Data OB: Array of 211106 elements ```

I've determined how to unscramble these obfuscated CZM DICOM datasets. Essentially CZM transposes three regions of the JPEG 2000 data and then overwrites those by XORing every 7th byte with 0x5A, which is pretty nasty.
The following function should reverse this, producing a normal JPEG 2000 data stream that can then be written to file or opened with Pillow:
import math
def unscramble_czm(frame: bytes) -> bytearray:
"""Return an unscrambled image frame.
Parameters
----------
frame : bytes
The scrambled CZM JPEG 2000 data frame as found in the DICOM dataset.
Returns
-------
bytearray
The unscrambled JPEG 2000 data.
"""
# Fix the 0x5A XORing
frame = bytearray(frame)
for ii in range(0, len(frame), 7):
frame[ii] = frame[ii] ^ 0x5A
# Offset to the start of the JP2 header - empirically determined
jp2_offset = math.floor(len(frame) / 5 * 3)
# Double check that our empirically determined jp2_offset is correct
offset = frame.find(b"\x00\x00\x00\x0C")
if offset == -1:
raise ValueError("No JP2 header found in the scrambled pixel data")
try:
assert jp2_offset == offset
except AssertionError:
raise ValueError(
f"JP2 header found at offset {offset} rather than the expected "
f"{jp2_offset}"
)
d = bytearray()
d.extend(frame[jp2_offset:jp2_offset + 253])
d.extend(frame[993:1016])
d.extend(frame[276:763])
d.extend(frame[23:276])
d.extend(frame[1016:jp2_offset])
d.extend(frame[:23])
d.extend(frame[763:993])
d.extend(frame[jp2_offset + 253:])
assert len(d) == len(frame)
return d

Related

How to create a boolean calculated field in Amazon QuickSight?

Let's assume I have access to this data in QuickSight :
Id Amount Date
1 10 15-01-2019
2 0 16-01-2020
3 100 21-12-2019
4 34 15-01-2020
5 5 20-02-2020
1 50 13-09-2020
4 0 01-01-2020
I would like to create a boolean calculated field, named "Amount_in_2020", whose value is True when the Id have a total strictly positive Amount in 2020, else False.
With python I would have done the following :
# Sample data
df = pd.DataFrame({'Id' : [1,2,3,4,5,1,4],
'Amount' : [10,0,100,34,5,50,0],
'Date' : ['15-01-2019','16-01-2020','21-12-2019','15-01-2020','20-02-2020','13-09-2020','01-01-2020']})
df['Date']=df['Date'].astype('datetime64')
# Group by to get total amount and filter dates
df_gb = pd.DataFrame(df[(df["Date"]>="01-01-2020") & (df["Date"]<="31-12-2020")].groupby(by=["Id"]).sum()["Amount"])
# Creation of the wanted column
df["Amount_in_2020"]=np.where(df["Id"].isin(list(df_gb[df_gb["Amount"]>0].index)),True,False)
But I can't find a way to create such a calculated field in Quicksight. Could you please help me ?
Expected output :
Id Amount Date Amount_in_2020
1 10 2019-01-15 True
2 0 2020-01-16 False
3 100 2019-12-21 False
4 34 2020-01-15 True
5 5 2020-02-20 True
1 50 2020-09-13 True
4 0 2020-01-01 True
Finally found :
ifelse(sumOver(max(ifelse(extract("YYYY",{Date})=2020,{Amount},0)), [{Id}])>0,1,0)

Querying Historical Data to get Month End Data

We have a history table that keeps all instances of a record, and flags which is the current record and when it is changed - here is a cut down version for it
CREATE TABLE *schema*.hist_temp
(
record_id VARCHAR
,record_created_date DATE
,current_flag BOOLEAN
,value int
)
INSERT INTO hist_temp VALUES ('Record A','2018-06-01',1,1000);
INSERT INTO hist_temp VALUES ('Record A','2018-04-12',0,900);
INSERT INTO hist_temp VALUES ('Record A','2018-03-13',0,800);
INSERT INTO hist_temp VALUES ('Record A','2018-01-13',0,700);
So what we have is Record A, which has been updated 3 times, the latest record is flagged with a 1 but we want to see all 4 instances of the history.
Then we have a dates table which holds, among other things, month end dates:
SELECT
calendar_date
,trunc(month_start) as month_start
FROM common.calendar
WHERE
calendar_year = '2018'
and calendar_date < trunc(sysdate)
ORDER BY 1 desc
Sample data:
calendar_date month_start
2018-06-03 2018-06-01
2018-06-02 2018-06-01
2018-06-01 2018-06-01
2018-05-31 2018-05-01
2018-05-30 2018-05-01
2018-05-29 2018-05-01
2018-05-28 2018-05-01
2018-05-27 2018-05-01
2018-05-26 2018-05-01
2018-05-25 2018-05-01
etc
Required results:
I would like to be able to display the following - show the month start / end position for Record A for 2018
record_id, month_start, value
Record A, '2018-06-01', 1000
Record A, '2018-05-01', 900
Record A, '2018-04-01', 800
Record A, '2018-03-01', 700
Record A, '2018-02-01', 700
I am trying to write this query, I have something but know this is wrong as the value is summed up wrongly, please can someone help out ascertain how to get the correct values?
Try:
SELECT
record_id,
date_trunc('month', record_created_date)::date AS month_start,
value
FROM hist_temp
Output:
Record A 2018-06-01 1000
Record A 2018-04-01 900
Record A 2018-01-01 700
Record A 2018-03-01 800

Applying cutoff to data set with IDs

I am using SAS and managed to run proc logistic, which gives me a table like so.
Classification Table
Prob Correct Incorrect Percentages
Level Event Non- Event Non- Correct Sensi- Speci- FALSE FALSE
Event Event tivity ficity POS NEG J
0 33 0 328 0 9.1 100 0 90.9 . 99
0.02 33 62 266 0 26.3 100 18.9 89 0 117.9
0.04 31 162 166 2 53.5 93.9 49.4 84.3 1.2 142.3
0.06 26 209 119 7 65.1 78.8 63.7 82.1 3.2 141.5
How do I include IDs for the rows of data in lib.POST_201505_PRED below that have at least 0.6 probability?
proc logistic data=lib.POST_201503 outmodel=lib.POST_201503_MODEL descending;
model BUYER =
age
tenure
usage
payment
loyalty_card
/outroc=lib.POST_201503_ROC;
Score data=lib.POST_201505 out=lib.POST_201505_PRED outroc=lib.POST_201505_ROC;
run;
I've been reading the documentation and searching online but haven't found anything on it. I must be searching for the wrong keywords, as I presume this is a frequently used process.
You just need an id-statement to tell SAS your ID-variable identifies your observations;
proc logistic data=lib.POST_201503 outmodel=lib.POST_201503_MODEL descending;
id ID;
model BUYER = age tenure usage payment loyalty_card
/outroc=lib.POST_201503_ROC;
Score data=lib.POST_201505
out=lib.POST_201505_PRED
outroc=lib.POST_201505_ROC;
run;
Now your output contains all you need.
For instance to print the IDs that get had probability of at least 0.6 assigned of being a BUYER to them;
proc print data=lib.POST_201505_PRED (where=(P_1 GE 0.6));
var ID P_1;
run;
You find these id yourKey; statements throughout the statistical procedures in SAS, for instance ;
proc univariate data=psydata.stroop;
id Subject;
var ReadTime;
run;
** will report the most extreme values of ReadTime as
;
Turns out I just had to include the ids in lib.POST_201505

Pandas multilevel concat/group/chunking

I'm trying to groupby a large data set using chunking.
What works:
chunks = pd.read_stata('data.dta', chunksize = 50000, columns = ['year', 'race', 'app'])
pieces = [chunk.groupby(['race'])['app'].agg(['sum']) for chunk in chunks]
agg = pd.concat(pieces.groupby(level = 0).sum()
What doesn't work (error: Categorical objects has no attribute flags)
chunks = pd.read_stata('data.dta', chunksize = 50000, columns = ['year', 'race', 'app'])
pieces = [chunk.groupby(['year', 'race'])['app'].agg(['sum']) for chunk in chunks]
agg = pd.concat(pieces.groupby(['year', 'race']).sum()
Thoughts on what i'm missing when adding in year?
pieces:
2013 Asian 9325
Black 2655
AmInd 118
Hisp 6371
White 16825
Other 2446
Unknown 3502
Foreign 7280
Name: app, dtype: float64, year race
2013 Asian 8884
Black 2969
AmInd 72
Hisp 3760
White 18926
Other 1843
Unknown 3262
Foreign 8183
Name: app, dtype: float64, year race
2013 Asian 6429
Black 2176
AmInd 89
Hisp 3804
White 13903
Other 1752
Unknown 2760
Foreign 6825
2014 Asian 1522
Black 738
AmInd 23
Hisp 1133
White 4243
Other 437
Unknown 316
Foreign 1997
Name: app, dtype: float64, year race

Regex and file processing

This question relates to R but really isn't language specific per se. I have a bunch of csv files with this general format "sitename_03082015.csv". The files have 5 columns and various rows
Host MaximumIn MaximumOut AverageIn AverageOut
device1 30.63 Kbps 0 bps 24.60 Kbps 0 bps
device2 1.13 Mbps 24.89 Kbps 21.76 Kbps 461 bps
device5 698.44 Kbps 37.71 Kbps 17.49 Kbps 3.37 Kbps
I ultimately want to read in all the files and merge which I can do but during the merge I want to read the site name and date and add it to each related line so the output looks like this
Host MaximumIn MaximumOut AverageIn AverageOut Site Name Date
device1 30.63 Kbps 0 bps 24.60 Kbps 0 bps SiteA 3/7/15
device12 1.13 Mbps 24.89 Kbps 21.76 Kbps 461 bps SiteA 3/8/15
device1 698.44 Kbps 37.71 Kbps 17.49 Kbps 3.37 Kbps SiteB 3/7/15
device2 39.08 Kbps 1.14 Mbps 10.88 Kbps 27.06 Kbps SiteB 3/8/15
device3 123.43 Kbps 176.86 Kbps 8.62 Kbps 3.78 Kbps SiteB 3/9/15
With my R code I can do the following:
#Get list of file names
filenames<- list.files(pattern = ".csv$")
#This extracts everything up to the underscore to get site name
siteName <- str_extract(string = filenames, "[^_]*")
# Extract date from file names use
date <- str_extract(string = filenames, "\\d{8}" )
With the below R code I can merge all the files but that will be without the added columns of site name and date that I want.
myDF<-do.call("rbind", lapply(filenames, read.table, header=TRUE, sep=","))
I just can't get my head around how to do the extracts for site and date adding and populating the columns to create my ideal dataframe which is the second table above.
The solution that best worked for me was posted below :)
The way that immediately comes to my mind is to do cbind while reading information with additional infor and do rbind afterwards. Something similar to this:
myDF<-do.call("rbind",
lapply(filenames,
function(x) cbind(read.table(x, header=TRUE, sep=","),
"Site Name" = str_extract(string = x, "[^_]*"),
"Date" = as.Date(str_extract(string = x, "\\d{8}"), "%m%d%Y"))))
I have done something similar which can be applied here. You can add more fileNames separated by comma. Also Site can be extracted similarly. Let me know if you need more help .
##Assuming your csv files are saved in location C:/"
library(stringr)
##List all filenames
fileNames <- c("hist_03082015.csv","hist_03092015.csv")
##Create a empty dataframe to save all output to
final_df <- NULL
for (i in fileNames) {
##Read CSV
df <- read.csv(paste("C:/",i,sep=""),header = TRUE,
sep = ",",colClasses='character')
##Extract date from filename into a column
df$Date <- gsub("\\D","",i)
##Convert string to date
df$Date <-as.Date(paste(str_sub(df$Date, 1, 2),
str_sub(df$Date, 3,-5),
str_sub(df$Date, 5,-1),sep="-"),"%d-%m-%Y")
##save all data into 1 dataframe
final_df <- rbind(final_df,df)
print(summary(final_df))
}