How to zstd compress a chunk of bytes from an input file?

How to zstd compress a chunk of bytes from an input file? - compression

Not enough example on zstd compression. I am using zstandard 0.8.1, trying to compress 2 bytes at a time. Came across https://anaconda.org/rolando/zstandard on using write_to(fh), but not sure how to use it. Below is my partial code trying to read a chuck bytes from a file, then compresses each chuck,
cctx = zstd.ZstdCompressor(level=4)
with open(path, 'rb') as fh:
while True:
bin_data = fh.read(2) #read 2 bytes
if not bin_data:
break
compressed = cctx.compress(bin_data)
fh.close()
with open(path, 'rb') as fh:
with open(outpath, 'wb') as outfile:
outfile.write(compressed)
...
But how shall i use the write_to()?

I think I found the right way to use zstd 0.8.1 module for streaming chunk of bytes:
with open(filename, 'wb') as fh:
cctx = zstd.ZstdCompressor(level=4)
with cctx.write_to(fh) as compressor:
compressor.write(b'data1')
compressor.write(b'data2')
with open(filename, 'rb') as fh:
cctx = zstd.ZstdCompressor(level=4)
for chunk in cctx.read_from(fh, read_size=128, write_size=128):
#do something

Related

Python: writing file and using buffer

I'm using django to generate personalized file, but while doing so a file is generated, and in terms of space using it is quite a poor thing to do.
this is how i do it right now:
with open(filename, 'wb') as f:
pdf.write(f) #pdf is an object of pyPDF2 library
with open(filename, 'rb') as f:
return send_file(data=f, filename=filename) #send_file is a HTTPResponse parametted to download file data
So in the code above a file is generated.
The easy fix would be to deleted the file after downloading it, but i remember in java using stream object to handle this case.
Is it possible to do so in Python?
EDIT:
def send_file(data, filename, mimetype=None, force_download=False):
disposition = 'attachment' if force_download else 'inline'
filename = os.path.basename(filename)
response = HttpResponse(data, content_type=mimetype or 'application/octet-stream')
response['Content-Disposition'] = '%s; filename="%s"' % (disposition, filename)
return response

Without knowing the exact details of the pdf.write and send_file functions, I expect in both cases they will take an object that conforms to the BinaryIO interface. So, you could try using a BytesIO to store the content in an in-memory buffer, rather than writing out to a file:
with io.BytesIO() as buf:
pdf.write(buf)
buf.seek(0)
send_file(data=buf, filename=filename)
Depending on the exact nature of the above-mentioned functions, YMMV.

open pdf without text with python

I want open a PDF for a Django views but my PDF has not a text and python returns me a blank PDF.
On each page, this is a scan of a page : link
from django.http import HttpResponse
def views_pdf(request, path):
with open(path) as pdf:
response = HttpResponse(pdf.read(),content_type='application/pdf')
response['Content-Disposition'] = 'inline;elec'
return response
Exception Type: UnicodeDecodeError
Exception Value: 'charmap' codec can't decode byte 0x9d in position 373: character maps to < undefined >
Unicode error hint
The string that could not be encoded/decoded was: � ��`����
How to say at Python that is not a text but a picture ?

By default, Python 3 opens files in text mode, that is, it tries to interpret the contents of a file as text. This is what causes the exception that you see.
Since a PDF file is (generally) a binary file, try opening the file in binary mode. In that case, read() will return a bytes object.
Here's an example (in IPython). First, opening as text:
In [1]: with open('2377_001.pdf') as pdf:
...: data = pdf.read()
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-d807b6ccea6e> in <module>()
1 with open('2377_001.pdf') as pdf:
----> 2 data = pdf.read()
3
/usr/local/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
--> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
Next, reading the same file in binary mode:
In [2]: with open('2377_001.pdf', 'rb') as pdf:
...: data = pdf.read()
...:
In [3]: type(data)
Out[3]: bytes
In [4]: len(data)
Out[4]: 45659
In [5]: data[:10]
Out[5]: b'%PDF-1.4\n%'
That solves the first part, how to read the data.
The second part is how to pass it to a HttpResponse. According to the Django documentation:
"Typical usage is to pass the contents of the page, as a string, to the HttpResponse constructor"
So passing bytes might or might not work (I don't have Django installed to test). The Django book says:
"content should be an iterator or a string."
I found the following gist to write binary data:
from django.http import HttpResponse
def django_file_download_view(request):
filepath = '/path/to/file.xlsx'
with open(filepath, 'rb') as fp: # Small fix to read as binary.
data = fp.read()
filename = 'some-filename.xlsx'
response = HttpResponse(mimetype="application/ms-excel")
response['Content-Disposition'] = 'attachment; filename=%s' % filename # force browser to download file
response.write(data)
return response

The problem is probably that the file you are trying to using isn't using the correct type of encoding. You can easily find the encoding of your pdf in most pdf viewers like adobe acrobat (in properties). Once you've found out what encoding it's using you can give it to Python like so:
Replace
with open(path) as pdf:
with :
with open(path, encoding="whatever encoding your pdf is in") as pdf:
Try Latin-1 encoding this often works

How to check if .tar.gz archive corrupted after download?

I use python 2.7.8 and request library to download tar.gz archives from USGS.gov site.
Data example:
http://dds.cr.usgs.gov/ltaauth//data/standard_l1t/etm/29/30/2016/LE70290302016178EDC00.tar.gz?id=48aq2ki3sr01iq18pdo8jdmi47&iid=LE70290302016178EDC00&did=252710635&ver=production
Sometimes my connection is interrupted and not all the files uncompressed properly from archive (but file is not corrupted completely). I use the following code (a part of it) to download data:
import requests
import traceback
def download_file(url, file_path):
# NOTE the stream=True parameter
r = requests.get(url, timeout=120, stream=True)
with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
return file_path
try:
download_file(URL, scene_path)
except:
traceback.print_exc()
if os.path.isfile(scene_path):
os.remove(scene_path)
print u'<= DEL'
How to check if *.tar.gz archive corrupted after download?

Python - Is it recommended to always open file with 'b' mode?

So I have this simple python function:
def ReadFile(FilePath):
with open(FilePath, 'r') as f:
FileContent = f.readlines()
return FileContent
This function is generic and used to open all sort of files. However when the file opened is a binary file, this function does not perform as expected. Changing the open() call to:
with open(FilePath, 'rb') as f:
solve the issue for binary files (and seems to keep valid in text files as well)
Question:
Is it safe and recommended to always use rb mode for reading a file?
If not, what are the cases where it is harmful?
If not, How do you know which mode to use if you don't know what type of file you're working with?
Update
FilePath = r'f1.txt'
def ReadFileT(FilePath):
with open(FilePath, 'r') as f:
FileContent = f.readlines()
return FileContent
def ReadFileB(FilePath):
with open(FilePath, 'rb') as f:
FileContent = f.readlines()
return FileContent
with open("Read_r_Write_w", 'w') as f:
f.writelines(ReadFileT(FilePath))
with open("Read_r_Write_wb", 'wb') as f:
f.writelines(ReadFileT(FilePath))
with open("Read_b_Write_w", 'w') as f:
f.writelines(ReadFileB(FilePath))
with open("Read_b_Write_wb", 'wb') as f:
f.writelines(ReadFileB(FilePath))
where f1.txt is:
line1
line3
Files Read_b_Write_wb, Read_r_Write_wb & Read_r_Write_w eqauls to the source f1.txt.
File Read_b_Write_w is:
line1
line3

In the Python 2.7 Tutorial:
https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
On Windows, 'b' appended to the mode opens the file in binary mode, so
there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows
makes a distinction between text and binary files; the end-of-line
characters in text files are automatically altered slightly when data
is read or written. This behind-the-scenes modification to file data
is fine for ASCII text files, but it’ll corrupt binary data like that
in JPEG or EXE files. Be very careful to use binary mode when reading
and writing such files. On Unix, it doesn’t hurt to append a 'b' to
the mode, so you can use it platform-independently for all binary
files.
My takeaway from that is using 'rb' seems to the best practice, and it looks like you ran into the problem they warn about - opening a binary file with 'r' on Windows.

Read a gzip file from a url with zlib in Python 2.7

I'm trying to read a gzip file from a url without saving a temporary file in Python 2.7. However, for some reason I get a truncated text file. I have spend quite some time searching the net for solutions without success. There is no truncation if I save the "raw" data back into a gzip file (see sample code below). What am I doing wrong?
My example code:
import urllib2
import zlib
from StringIO import StringIO
url = "ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/clinvar_00-latest.vcf.gz"
# Create a opener
opener = urllib2.build_opener()
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
# Fetch the gzip filer
respond = opener.open(request)
compressedData = respond.read()
respond.close()
opener.close()
# Extract data and save to text file
compressedDataBuf = StringIO(compressedData)
d = zlib.decompressobj(16+zlib.MAX_WBITS)
buffer = compressedDataBuf.read(1024)
saveFile = open('/tmp/test.txt', "wb")
while buffer:
saveFile.write(d.decompress(buffer))
buffer = compressedDataBuf.read(1024)
saveFile.close()
# Save "raw" data to new gzip file.
saveFile = open('/tmp/test.gz', "wb")
saveFile.write(compressedData)
saveFile.close()

Because that gzip file consists of many concatenated gzip streams, as permitted by RFC 1952. gzip automatically decompresses all of the gzip streams.
You need to detect the end of each gzip stream and restart the decompression with the subsequent compressed data. Look at unused_data in the Python documentation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to zstd compress a chunk of bytes from an input file? - compression

Related

Python: writing file and using buffer

open pdf without text with python

How to check if .tar.gz archive corrupted after download?

Python - Is it recommended to always open file with 'b' mode?

Read a gzip file from a url with zlib in Python 2.7

Categories

Resources