LZ4 giving different compressed data for different languages - compression

I am using lz4 to compress a data, that data is consumed by different application written in Java, Go & Python.
Thankfully, libraries are available for the said languages.
My source data compression is done using Golang and the languages are decompressing and using it.
However, the issue is all of them gives different compressed data base64 string for same string.
So the receiving applications are unable to get the expected data.
e.g Swift.lz4_compress("ABC") != Java.lz4_compress("ABC")
Is this an expected behaviour?
Thanks.

Related

Trying to identify exact Lempel-Ziv variant of compression algorithm in firmware

I'm currently reverse engineering a firmware that seems to be compressed, but am really having hard time identifying which algorithm it is using.
I have the original uncompressed data dumped from the flash chip, below is some of human readable data, uncompressed vs (supposedly) compressed:
You can get the binary portion here, should it helps: Link
From what I can tell, it might be using Lempel-Ziv variant of compression algorithm such as LZO, LZF or LZ4.
gzip and zlib can be ruled out because there will be very little to no human readable data after compression.
I do tried to compress the dumped data with Lempel-Ziv variant algorithms mentioned above using their respective Linux cli tools, but none of them show exact same output as the "compressed data".
Another idea I have for now is to try to decompress the data with each algorithm and see what it gives. But this is very difficult due to lack of headers in the compressed firmware. (Binwalk and signsrch both detected nothing.)
Any suggestion on how I can proceed?

Implementing Data Frames in OCaml

I have been learning OCaml on my own and I've been really impressed with the language. I wanted to develop a small machine learning library for practice but I've been presented with a problem.
In Python one can use Pandas to load data files then pass it to a library like Scikit-Learn very easily. I would like to emulate the same process in OCaml. However, there doesn't seem to be any data frames library in OCaml. I've checked 'ocaml-csv' but it doesn't really seem to be doing what I want. I also looked into 'Frames' from Haskell but it uses TemplateHaskell but I believe a simpler way to do the same thing should be possible if Pandas can simply load the data file into memory without compile-time metaprogramming.
Does anyone know how data frames are implemented in Pandas or R, a quick search on Google doesn't seem to return useful links.
Is it possible to use a parser generator such as Menhir to parse CSV files? Also, I'm unsure how static typing works with data frames.
Would you have a reference about the format of data frames? It may not be so hard to add to ocaml-csv if CSV is the underlying representation. The better is to open an issue with a request and the needed information.

Memory consumption during parsing of YAML with yaml-cpp

I am developing Qt application for embedded system with limited memory. I need to receive a few megabytes of JSON data and parse it as fast as possible and without using too much memory.
I was thinking about using streams:
JSON source (HTTP client) ---> ZIP decompressor ---> YAML parser ----> objects mapped to database
Data will arrive from network much slower than I can parse it.
How much memory yaml-cpp need to parse 1MB of data?
I would like already parsed raw data from decompressor and internal memory used for that data by YAML parser to be released as soon as object mapped to database is created. Is it possible?
Does yaml-cpp support asynchronous parsing? So as soon as JSON object is parsed, I can store it in database without waiting for full content from HTTP source.
Since you have memory constraints and your data is already in JSON, you should use a low-memory JSON parser instead of a YAML parser. Try jsoncpp - although I'm not sure what their support for streaming is (since JSON doesn't have the concept of documents).
yaml-cpp is designed for streaming, so it won't block if there are documents to parse but the stream is still open; however, there is an outstanding issue in yaml-cpp where it reads more than a single document at a time, so it really isn't designed for extremely low memory usage.
As for how much memory it takes to parse 1 MB of data, it is probably on the order of 3 MB (the raw input stream, plus the parsed stream, plus the resulting data structure), but it may vary dramatically depending on what kind of data you're parsing.

How to compress ascii text without overhead

I want to compress small text (400 bytes) and decompress it on the other side. If I do it with standard compressor like rar or zip, it writes metadata along with the compressed file and it's bigger that the file itself..
Is there a way to compress the file without this metadata and open it on the other side with known ahead parameters?
You can do raw deflate compression with zlib. That avoids even the six-byte header and trailer of the zlib format.
However you will find that you still won't get much compression, if any at all, with just 400 bytes of input. Compression algorithms need much more history than that to get rolling, in order to build statistics and find redundancy in the data.
You should consider either a dictionary approach, where you build a dictionary of representative strings to provide the compressor something to work with, or you can consider a sequence of these 400-byte strings to be a single stream that is decompressed as a stream on the other end.
You can have a look at compression using Huffman codes. As an example look at here and here.

What compression/archive formats support inter-file compression?

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.