How to decompress data stored in HDF5 file compressed with zstd? - compression

I faced some problems with decompression in zstd case. I have hdf5-format files, that was compressed in the following way:
import h5py as h5
import hdf5plugin
import sys
import os
filefrom = sys.argv[1]
h5path = sys.argv[2]
f = h5.File(filefrom,'r')
data = f[h5path]
shape_data = data.shape[1:]
num = data.shape[0]
initShape = (1,) + shape_data
maxShape = (num,) + shape_data
f_zstd = h5.File(filefrom.split('.')[0]+'_zstd.h5','w')
d_zstd = f_zstd.create_dataset(path_to_data, initShape, maxshape=maxShape, dtype=np.int32, chunks=initShape, **hdf5plugin.Zstd())
d_zstd[0,] = data[0,]
for i in range(num):
d_zstd.resize((i+1,) + shape_data)
d_zstd[i,] = data[i,]
f_zstd.close()
f.close()
So it compressed without any errors, but then when I try to look into the data with h5ls or h5dump it prints me out that data can't be printed, and no another way to look inside the file like reading in python3 (3.6) with h5py this compressed data is unsuccessful. I also tried h5repack (h5repack -i compressed_file.h5 -o out_file.h5 --filter=var:NONE) or the following piece of code:
import zstandard
import pathlib
import os
def decompress_zstandard_to_folder(input_file):
input_file = pathlib.Path(input_file)
destination_dir = os.path.dirname(input_file)
with open(input_file, 'rb') as compressed:
decomp = zstandard.ZstdDecompressor()
output_path = pathlib.Path(destination_dir) / input_file.stem
with open(output_path, 'wb') as destination:
decomp.copy_stream(compressed, destination)
nothing succeed. In situation with h5repack no warnings or errors appeared, with the last piece of code I got this zstd.ZstdError: zstd decompressor error: Unknown frame descriptor, so as I got it means that compressed data doesn't have the appropriete headers.
I use python 3.6.7, hdf5 1.10.5. So I'm a bit confused and don't have any idea how to overcome this issue.
Will be happy for any ideas/advice!

I wrote a simple test to validate zstd compression behavior with a simple dataset (NumPy array of int32). I can open the HDF5 file with h5py and read the dataset. (Note: I could not open with HDFView and h5repack only reports shape and type attributes, not the data.)
I suspect an undetected error in another part of your code. Have you tested your code logic without zstd compression? If not, I suggest you start there.
Code to Write example file:
import h5py as h5
import hdf5plugin
import numpy as np
data = np.arange(1_000).reshape(100,10)
with h5.File('test_zstd.h5','w') as f_zstd:
d_zstd = f_zstd.create_dataset('zstd_data', data=data, **hdf5plugin.Zstd())
Code to Read example file:
import h5py as h5
import hdf5plugin ## Note: plugin required to read
with h5.File('test_zstd.h5','r') as f_zstd:
d_zstd = f_zstd['zstd_data']
print(d_zstd.shape, d_zstd.dtype)
print(d_zstd[0,:])
print(d_zstd[-1,:])
Output from above:
(100, 10) int32
[0 1 2 3 4 5 6 7 8 9]
[990 991 992 993 994 995 996 997 998 999]
More on HDF5 and compression:
To use HDF5 utilities (like h5repack) to read a compressed file, the HDF5 installation needs the appropriate compression filter. Some are standard, many (including xstandard), require you to install a third party filter. Links to available plugins are here: HDF5 Registered Filter Plugins
You can verify the compression filter with h5dump by adding the -pH flag, like this:
E:\SO_68526704>h5dump -pH test_zstd.h5
HDF5 "test_zstd.h5" {
GROUP "/" {
DATASET "zstd_data" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 100, 10 ) / ( 100, 10 ) }
STORAGE_LAYOUT {
CHUNKED ( 100, 10 )
SIZE 1905 (2.100:1 COMPRESSION)
}
FILTERS {
USER_DEFINED_FILTER {
FILTER_ID 32015
COMMENT Zstandard compression: http://www.zstd.net
}
}
FILLVALUE {
FILL_TIME H5D_FILL_TIME_ALLOC
VALUE H5D_FILL_VALUE_DEFAULT
}
ALLOCATION_TIME {
H5D_ALLOC_TIME_INCR
}
}
}
}

Related

Reading multiple files in a directory with pyyaml

I'm trying to read all yaml files in a directory, but I am having trouble. First, because I am using Python 2.7 (and I cannot change to 3) and all of my files are utf-8 (and I also need them to keep this way).
import os
import yaml
import codecs
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
def yaml_dump(filepath, data):
with open(filepath, 'w') as file_descriptor:
yaml.dump(data, file_descriptor)
if __name__ == "__main__":
filepath = os.listdir(os.getcwd())
data = yaml_reader(filepath)
print data
When I run this code, python gives me the message:
TypeError: coercing to Unicode: need string or buffer, list found.
I want this program to show the content of the files. Can anyone help me?
I guess the issue is with filepath.
os.listdir(os.getcwd()) returns the list of all the files in the directory. so you are passing the list to codecs.open() instead of filename
There are multiple problems with your code, apart from that it is invalide Python, in the way you formatted this.
def yaml_reader(filepath):
with codecs.open(filepath, "r", encoding='utf-8') as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
however it is not necessary to do the decoding, PyYAML is perfectly capable of processing UTF-8:
def yaml_reader(filepath):
with open(filepath, "rb") as file_descriptor:
data = yaml.load_all(file_descriptor)
return data
I hope you realise your trying to load multiple documents and always get a list as a result in data even if your file contains one document.
Then the line:
filepath = os.listdir(os.getcwd())
gives you a list of files, so you need to do:
filepath = os.listdir(os.getcwd())[0]
or decide in some other way, which of the files you want to open. If you want to combine all files (assuming they are YAML) in one big YAML file, you need to do:
if __name__ == "__main__":
data = []
for filepath in os.listdir(os.getcwd()):
data.extend(yaml_reader(filepath))
print data
And your dump routine would need to change to:
def yaml_dump(filepath, data):
with open(filepath, 'wb') as file_descriptor:
yaml.dump(data, file_descriptor, allow_unicode=True, encoding='utf-8')
However this all brings you to the biggest problem: that you are using PyYAML, that will mangle your YAML, dropping flow-style, comment, anchor names, special int/float, quotes around scalars etc. Apart from that PyYAML has not been updated to support YAML 1.2 documents (which has been the standard since 2009). I recommend you switch to using ruamel.yaml (disclaimer: I am the author of that package), which supports YAML 1.2 and leaves comments etc in place.
And even if you are bound to use Python 2, you should use the Python 3 like syntax e.g. for print that you can get with from __future__ imports.
So I recommend you do:
pip install pathlib2 ruamel.yaml
and then use:
from __future__ import absolute_import, unicode_literals, print_function
from pathlib import Path
from ruamel.yaml import YAML
if __name__ == "__main__":
data = []
yaml = YAML()
yaml.preserve_quotes = True
for filepath in Path('.').glob('*.yaml'):
data.extend(yaml.load_all(filepath))
print(data)
yaml.dump(data, Path('your_output.yaml'))

Python-Zstandard: how to resolve error:global name 'zstd' is not defined?

I tried to run the script in linux termianl using python 2.7.
I'm getting the following error:
Traceback (most recent call last):
File "main1.py", line 50, in <module>
main()
File "main1.py", line 25, in main
ver = zstd.__version__;
NameError: global name 'zstd' is not defined
please check code and suggest the changes or any other way to run the script successfully.
# import zstd library for zstandard simple api access and Compress
import csv
import os
import time
import sys
import io
import random
import zlib
#import zstandard as zstd
from zstd import *
def main():
#set path of INPUT FILE here
path = 'sai.txt'
fh_input = open(path, "rb")
#determine size of input file
sizeinfo_if = os.stat('sai.txt')
print('Size of input file is :',sizeinfo_if.st_size,'Bytes')
#make sure zstd is installed and running
ver = zstd.__version__;
print('Zstd version : ',ver)
#File for already existing(or newly created ) output file where the compressed data needs to be written
fh_output = open('output.txt','wb')
#Zstd compressor object creation
cctx = zstd.ZstdCompressor()
initial_timestamp = time.time()
#Get byte sized chucks from in File, compress and write to out file
with open('sai.txt', 'rb') as fh_input:
byte_chunk = fh_input.read(1)
while byte_chunk:
#bdata=bytes(str.encode(byte_chunk))
compressed = cctx.compress(byte_chunk)
with cctx.write_to(fh_output) as compressor:
compressor.write(byte_chunk)
byte_chunk = fh_input.read(1)
end_timestamp = time.time()
print('Time taken to compress:',end_timestamp - initial_timestamp)
sizeinfo_of = os.stat('output.txt')
print('Size of output File is:',sizeinfo_of.st_size,'Bytes')
main()
It worked for me importing in the next way (python 3.8)
import zstd
print('Zstd version : ', zstd.version())

Os.walk - WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect:

new to python and looking for some help on a problem I am having with os.walk. I have had a solid look around and cannot find the right solution to my problem.
What the code does:
Scans a users selected HD or folder and returns all the filenames, subdirs and size. This is then manipulated in pandas (not in code below) and exported to an excel spreadsheet in the formatting I desired.
However, in the first part of the code, in Python 2.7, I am currently experiencing the below error:
WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'E:\03. Work\Bre\Files\folder2\icons greyscale flatten\._Icon_18?10 Stainless Steel.psd'
I have explored using raw string (r') but to no avail. Perhaps I am writing it wrong.
I will note that I never get this in 3.5 or on cleanly labelled selected folders. Due to Pandas and pysinstaller problems with 3.5, I am hoping to stick with 2.7 until the error with 3.5 is resolved.
import pandas as pd
import xlsxwriter
import os
from io import StringIO
#Lists for Pandas Dataframes
fpath = []
fname = []
fext = []
sizec = []
# START #Select file directory to scan
filed = raw_input("\nSelect a directory to scan: ")
#Scan the Hard-Drive and add to lists for Pandas DataFrames
print "\nGetting details..."
for root, dirs, files in os.walk(filed):
for filename in files:
f = os.path.abspath(root) #File path
fpath.append(f)
fname.append(filename) #File name
s = os.path.splitext(filename)[1] #File extension
s = str(s)
fext.append(s)
p = os.path.join(root, filename) #File size
si = os.stat(p).st_size
sizec.append(si)
print "\nDone!"
Any help would be greatly appreciated :)
In order to traverse filenames with unicode characters, you need to give os.walk a unicode path name.
Your path contains a unicode character, which is being displayed as ? in the exception.
If you pass in the unicode path, like this os.walk(unicode(filed)) you should not get that exception.
As noted in Convert python filenames to unicode sometimes you'll get a bytestring if the path is "undecodable" by Python 2.

converting tiff to jpeg in python

Can anyone help me to read .tiff image and convert into jpeg format?
from PIL import Image
im = Image.open('test.tiff')
im.save('test.jpeg')
The above code was not working.
I have successfully solved the issue. I posted the code to read the tiff files in a folder and convert into jpeg automatically.
import os
from PIL import Image
yourpath = os.getcwd()
for root, dirs, files in os.walk(yourpath, topdown=False):
for name in files:
print(os.path.join(root, name))
if os.path.splitext(os.path.join(root, name))[1].lower() == ".tiff":
if os.path.isfile(os.path.splitext(os.path.join(root, name))[0] + ".jpg"):
print "A jpeg file already exists for %s" % name
# If a jpeg is *NOT* present, create one from the tiff.
else:
outfile = os.path.splitext(os.path.join(root, name))[0] + ".jpg"
try:
im = Image.open(os.path.join(root, name))
print "Generating jpeg for %s" % name
im.thumbnail(im.size)
im.save(outfile, "JPEG", quality=100)
except Exception, e:
print e
import os, sys
from PIL import Image
I tried to save directly to jpeg but the error indicated that the mode was P and uncompatible with JPEG format so you have to convert it to RGB mode as follow.
for infile in os.listdir("./"):
print "file : " + infile
if infile[-3:] == "tif" or infile[-3:] == "bmp" :
# print "is tif or bmp"
outfile = infile[:-3] + "jpeg"
im = Image.open(infile)
print "new filename : " + outfile
out = im.convert("RGB")
out.save(outfile, "JPEG", quality=90)
This can be solved with the help of OpenCV. It worked for me.
OpenCV version == 4.3.0
import cv2, os
base_path = "data/images/"
new_path = "data/ims/"
for infile in os.listdir(base_path):
print ("file : " + infile)
read = cv2.imread(base_path + infile)
outfile = infile.split('.')[0] + '.jpg'
cv2.imwrite(new_path+outfile,read,[int(cv2.IMWRITE_JPEG_QUALITY), 200])
I believe all the answers are not complete
TIFF image format is a container for various formats. It can contain BMP, TIFF noncompressed, LZW compressions, Zip compressions and some others, among them JPG etc.
image.read (from PIL) opens these files but cant't do anything with them. At least you can find out that it is a TIFF file (inside, not only by its name). Then one can use
pytiff.Tiff (from pytiff package). For some reasons, when tiff has JPG compression (probably, some others too) it cannot encode the correct information.
Something is rotten in the state of Denmark (C)
P.S. One can convert file with help of Paint (in old windows Paint Brush (Something is rotten in this state too) or Photoshop - any version. Then it can be opened from PythonI'm looking for simple exe which can do it, the call it from python. Probably Bulk Image Converter will do
I liked the solution suggested in this answer: https://stackoverflow.com/a/28872806/12808155
But checking for tiff in my opinion is not entirely correct, since there may be situations when the extension .tif does not define the file format: for example, when indexing, macOS creates hidden files ( ._DSC_123.tif).
For a more universal solution, I suggest using the python-magic library (https://pypi.org/project/python-magic)
The code for checking for tiff format may look like this:
import magic
def check_is_tif(filepath: str) -> bool:
allowed_types = [
'image/tiff',
'image/tif'
]
if magic.from_file(filepath, mime=True) not in allowed_types:
return False
return True
Complete code may looks like this:
import argparse
import os
import magic
from PIL import Image
from tqdm import tqdm
def check_is_tif(filepath: str) -> bool:
allowed_types = [
'image/tiff',
'image/tif'
]
if magic.from_file(filepath, mime=True) not in allowed_types:
return False
return True
def count_total(path: str) -> int:
print('Please wait till total files are counted...')
result = 0
for root, _, files in os.walk(path):
for name in files:
if check_is_tif(os.path.join(root, name)) is True:
result += 1
return result
def convert(path) -> None:
progress = tqdm(total=count_total(path))
for root, _, files in os.walk(path):
for name in files:
if check_is_tif(os.path.join(root, name)) is True:
file_path = os.path.join(root, name)
outfile = os.path.splitext(file_path)[0] + ".jpg"
try:
im = Image.open(file_path)
im.thumbnail(im.size)
im.save(outfile, "JPEG", quality=80)
os.unlink(file_path)
except Exception as e:
print(e)
progress.update()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Recursive TIFF to JPEG converter')
parser.add_argument('path', type=str, help='Path do directory with TIFF files')
args = parser.parse_args()
convert(args.path)

Reading zipped ESRI BIL files with Python

I have precipitation data from the PRISM Climate Group which are now offered in .bil format (ESRI BIL, I think) and I'd like to be able to read these datasets with Python.
I've installed the spectral package, but the open_image() method returns an error:
def ReadBilFile(bil):
import spectral as sp
b = sp.open_image(bil)
ReadBilFile(r'G:\truncated\ppt\1950\PRISM_ppt_stable_4kmM2_1950_bil.bil')
IOError: Unable to determine file type or type not supported.
The documentation for spectral clearly says that it supports BIL files, can anyone shed any light on what's happening here? I am also open to using GDAL, which supposedly supports the similar/equivalent ESRI EHdr format, but I can't find any good code snipets to get started.
It's now 2017 and there is a slightly better option. The package rasterio supports bil files.
>>>import rasterio
>>>tmean = rasterio.open('PRISM_tmean_stable_4kmD1_20060101_bil.bil')
>>>tmean.affine
Affine(0.041666666667, 0.0, -125.0208333333335,
0.0, -0.041666666667, 49.9375000000025)
>>> tmean.crs
CRS({'init': 'epsg:4269'})
>>> tmean.width
1405
>>> tmean.height
621
>>> tmean.read().shape
(1, 621, 1405)
Ok, I'm sorry to post a question and then answer it myself so quickly, but I found a nice set of course slides from Utah State University that has a lecture on opening raster image data with GDAL. For the record, here is the code I used to open the PRISM Climate Group datasets (which are in the EHdr format).
import gdal
def ReadBilFile(bil):
gdal.GetDriverByName('EHdr').Register()
img = gdal.Open(bil)
band = img.GetRasterBand(1)
data = band.ReadAsArray()
return data
if __name__ == '__main__':
a = ReadBilFile(r'G:\truncated\ppt\1950\PRISM_ppt_stable_4kmM2_1950_bil.bil')
print a[44, 565]
EDIT 5/27/2014
I've built upon my answer above and wanted to share it here since the documentation seems to be lacking. I now have a class with one main method that reads the BIL file as an array and returns some key attributes.
import gdal
import gdalconst
class BilFile(object):
def __init__(self, bil_file):
self.bil_file = bil_file
self.hdr_file = bil_file.split('.')[0]+'.hdr'
def get_array(self, mask=None):
self.nodatavalue, self.data = None, None
gdal.GetDriverByName('EHdr').Register()
img = gdal.Open(self.bil_file, gdalconst.GA_ReadOnly)
band = img.GetRasterBand(1)
self.nodatavalue = band.GetNoDataValue()
self.ncol = img.RasterXSize
self.nrow = img.RasterYSize
geotransform = img.GetGeoTransform()
self.originX = geotransform[0]
self.originY = geotransform[3]
self.pixelWidth = geotransform[1]
self.pixelHeight = geotransform[5]
self.data = band.ReadAsArray()
self.data = np.ma.masked_where(self.data==self.nodatavalue, self.data)
if mask is not None:
self.data = np.ma.masked_where(mask==True, self.data)
return self.nodatavalue, self.data
I call this class using the following function where I use GDAL's vsizip function to read the BIL file directly from a zip file.
import prism
def getPrecipData(years=None):
grid_pnts = prism.getGridPointsFromTxt()
flrd_pnts = np.array(pd.read_csv(r'D:\truncated\PrismGridPointsFlrd.csv').grid_code)
mask = prism.makeGridMask(grid_pnts, grid_codes=flrd_pnts)
for year in years:
bil = r'/vsizip/G:\truncated\PRISM_ppt_stable_4kmM2_{0}_all_bil.zip\PRISM_ppt_stable_4kmM2_{0}_bil.bil'.format(year)
b = prism.BilFile(bil)
nodatavalue, data = b.get_array(mask=mask)
data *= mm_to_in
b.write_to_csv(data, 'PrismPrecip_{}.txt'.format(year))
return
# Get datasets
years = range(1950, 2011, 5)
getPrecipData(years=years)
You've already figured out a good solution to reading the file so this answer is just in regard to shedding light on the error you encountered.
The problem is that the spectral package does not support the Esri multiband raster format. BIL (Band Interleaved by Line) is not a specific file format; rather, it is a data interleave scheme (like BIP and BSQ), which can be used in many file formats. The spectral package does support BIL for the file formats that it recognizes (e.g., ENVI, Erdas LAN) but the Esri raster header is not one of those.