I have to pass one value in payload for REST API as dynamic value which need to be converted to Text to Gzip Compress using Python2.7 - python-2.7

In payload one value is as below to be converted this value to Gzip Compress in Python for REST API as we are using Python 2.7
Original Value:-
<ATTACHMENT_ID>ff38fac4-d962-49b1-843f-34d352b6ff49</ATTACHMENT_ID>0cc5e563-42e5-4e37-81ce-f25cc8756346<SF_OBJECT_ID>7d80f6ad-803b-475c-86b7-f10e28986df7</SF_OBJECT_ID><SF_OBJECT_TYPE>Agreement</SF_OBJECT_TYPE><SF_OBJECT_VERSION>1.0</SF_OBJECT_VERSION><SF_TEMPLATE_VERSION></SF_TEMPLATE_VERSION><SF_BUSINESS_OBJECT_CONTEXT>Agreement</SF_BUSINESS_OBJECT_CONTEXT>FalseFalse
To be converted to gzip compress
H4sIAAAAAAAACn2Sb2+CMBDGP1HHvwIlaUgQ62SZaEZntlcG2qsjEUoAZ/btByZOdNne3T33e55ek6NR08+12LS6gbYvoQtpxHkUL1cs5btkHirlEJULjGTg2QgHhYUIdhRysHRcu/CUwgE1bj10SDxWUPdDbQrhgus5CNvgIgyOj4glACnbFYL4wwR71JgY6AraPSS10iHdQtuVug6pca0m42yxW8+eWHx+1JfEVF4uETGdAmHfFYh4hY+UZYJNAuJJ5VPjxjIJ4O8bFkb7FmBcY8qdJxNyy16yZJ2G1oM5xS7ySHK22jxHnF3FEfytDuLsNUtSlmWXmHidcvbG71b5i6IcquaQ9zB+xpg2y7zLqrztkx6qcJEfOqDGjUZf4LOE02Orj00iB/ddvywlZB/6dHbEuu6HXWb6WMu8/foJ/I8ZjuL+tL4Bxmc3aG0CAAA=
I am able to do this using this link online - https://www.multiutil.com/text-to-gzip-compress/
I explored many options in python2.7 but was not able to get the exact format after encoding as mentioned above. I need the exact string as above after encoding.
I tried using the below code, but it is not giving me the required value after encoding
code = base64.b64encode(zlib.compress(compress, 9))
code = code.decode('utf-8')

The "original value" you provided does not match the decompression of the gzip data you provided. The latter is:
<AptDocProperties><ATTACHMENT_ID>ff38fac4-d962-49b1-843f-34d352b6ff49</ATTACHMENT_ID><DocumentID>0cc5e563-42e5-4e37-81ce-f25cc8756346</DocumentID><MergeInfo><Version></Version></MergeInfo><SF_OBJECT_ID>7d80f6ad-803b-475c-86b7-f10e28986df7</SF_OBJECT_ID><SF_OBJECT_TYPE>Agreement</SF_OBJECT_TYPE><SF_OBJECT_VERSION>1.0</SF_OBJECT_VERSION><SF_TEMPLATE_VERSION></SF_TEMPLATE_VERSION><SF_BUSINESS_OBJECT_CONTEXT>Agreement</SF_BUSINESS_OBJECT_CONTEXT><TemplateID></TemplateID><HasSmartItem>False</HasSmartItem><ReviewGroupId></ReviewGroupId><HideShowSmartContentBoundary>False</HideShowSmartContentBoundary></AptDocProperties>
There some similar substrings, but otherwise entirely different.
So maybe you should start by figuring out why that is.
Even once you get your data to match your data, you should not expect to be able to reconstruct the exact same compressed gzip stream. All that matters, and all you need to do, it make a gzip stream that decompresses to exactly what you started with.

Related

LZ4 giving different compressed data for different languages

I am using lz4 to compress a data, that data is consumed by different application written in Java, Go & Python.
Thankfully, libraries are available for the said languages.
My source data compression is done using Golang and the languages are decompressing and using it.
However, the issue is all of them gives different compressed data base64 string for same string.
So the receiving applications are unable to get the expected data.
e.g Swift.lz4_compress("ABC") != Java.lz4_compress("ABC")
Is this an expected behaviour?
Thanks.

Use LZMA to codificate a stream of information

The professor gave me a research paper that shows a way to efficiently compress some kind of data.
It's not worth to eplain the full algorithm since the question is not about that, I just introduce a little example that should allow you to undestand what the real question is about.
Our compression algorithm have is own dictionary which is a table (no matter how it is calculated, just assume that both compressor and decompressor have it), each table row has a string.
The compressor in order to compress a message will open it and start from begining, it will search for a match in the dictionary and eventually send a MATCH message with the row id, if nothing is found then a SET message with the message to set is sent.
Note that MATCH do not really have to be complete match, they can be followed by many MISSMATCH message each containing the byte offset wrong and the correct byte.
So for example the compressor might want to encode:
Now, in the paper they say that they entropy encode this "stream" of data using LZMA and they assume it's a trivial thing to do without giving further details.
I've searched online but I didn't come up with anything. Do you have any idea on how this last step could be done? Do you have any reference?
There is a stream compression algorithm with preset dictionary using LZMA as part of this open-source project: Zip-Ada . The preset dictionary is called there "training data".

How to compress ascii text without overhead

I want to compress small text (400 bytes) and decompress it on the other side. If I do it with standard compressor like rar or zip, it writes metadata along with the compressed file and it's bigger that the file itself..
Is there a way to compress the file without this metadata and open it on the other side with known ahead parameters?
You can do raw deflate compression with zlib. That avoids even the six-byte header and trailer of the zlib format.
However you will find that you still won't get much compression, if any at all, with just 400 bytes of input. Compression algorithms need much more history than that to get rolling, in order to build statistics and find redundancy in the data.
You should consider either a dictionary approach, where you build a dictionary of representative strings to provide the compressor something to work with, or you can consider a sequence of these 400-byte strings to be a single stream that is decompressed as a stream on the other end.
You can have a look at compression using Huffman codes. As an example look at here and here.

How to concat two or more gzip files/streams

I want to concat two or more gzip streams without recompressing them.
I mean I have A compressed to A.gz and B to B.gz, I want to compress them to single gzip (A+B).gz without compressing once again, using C or C++.
Several notes:
Even you can just concat two files and gunzip would know how to deal with them, most of programs would not be able to deal with two chunks.
I had seen once an example of code that does this just by decompression of the files and then manipulating original and this significantly faster then normal re-compression, but still requires O(n) CPU operation.
Unfortunaly I can't found this example I had found once (concatenation using decompression only), if someone can point it I would be greatful.
Note: it is not duplicate of this because proposed solution is not fits my needs.
Clearification edit:
I want to concate several compressed HTML pices and send them to browser as one page, as per request: "Accept-Encoding: gzip", with respnse "Content-Encoding: gzip"
If the stream is concated as simple as cat a.gz b.gz >ab.gz, Gecko (firefox) and KHTML web engines gets only first part (a); IE6 does not display anything and Google Chrome displays first part (a) correctly and the second part (b) as garbage (does not decompress at all).
Only Opera handles this well.
So I need to create a single gzip stream of several chunks and send them without re-compressing.
Update: I had found gzjoin.c in the examples of zlib, it does it using only decompression. The problem is that decompression is still slower them simple memcpy.
It is still faster 4 times then fastest gzip compression. But it is not enough.
What I need is to find the data I need to save together with gzip file in order to
not run decompression procedure, and how do I find this data during compression.
Look at the RFC1951 and RFC1952
The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.
To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.
There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.
Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.
The gzip manual says that two gzip files can be concatenated as you attempted.
http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
So it appears that the other tools may be broken. As seen in this bug report.
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263
Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.
As others have mentioned you may be able to perform surgery:
http://www.gzip.org/zlib/rfc-gzip.html
And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.
In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.
Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.
It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.
Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.
Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.
If taring them is not out of the question (since the linked cat solution isn't viable for you):
tar cf A_B.gz.tar A.gz B.gz
Then, to get them back:
tar xf A_B.gz.tar

How can I download a utf-8-encoded web page with libcurl, preserving the encoding?

Im trying to get libcurl to download a webpage that is encoded in UTF-8, which is working fine, except for the fact that it converts it to ASCII and screws up some of the characters. Is there an easy way to get it to keep it in UTF-8?
libcurl doesn't translate/convert the data at all so there's actually nothing particular you need to do. Just get it.
Check the CURL options for conversion. They might have been defined at compilation time.