Using apache-arrow with chunked files - apache-arrow

I would like to use apache-arrow library for internal development, this deals with object storage. The idea is to do some processing on these files( objects) but each file is stored in equally sized chunks.
The idea is to stream each chunk and apply processing.
Example If i have a csv file with 4 columns , this will be split to 9 chunks ( not aligned with lines ) so I need to combine each current chunk with next one for processing?
Example :
file
col1,col2, col3, col3
a1,b1,c1,d1
...
an,bn,cn,dn
split to 6 chunks of equal sizes
chunk1:
a1,b1,c1,d1
...
ak,bk
chunk2:
ck,dk
...
am,bm,cm,dm
and so on
Any idea how to achieve this using cpp arrow library?
Thanks

Related

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Compress .npy data to save space in disk

l have stored on my disk a huge dataset. Since my dataset is about 1.5 TB. l divide it into 32 samples to be able to use numpy.save('data_1.npy') in python 2.7 . Here is a sample of 9 sub-datasets. Each one is about 30 GB.
The shape of each .npy file is (number_of_examples,224,224,19) and values are float.
data_1.npy
data_2.npy
data_3.npy
data_4.npy
data_5.npy
data_6.npy
data_7.npy
data_8.npy
data_9.npy
Using np.save(' *.npy'), my dataset occupy 1.5 Tera in my disk.
1)Is there an efficient way to compress my dataset in order to gain some free space disk ?
2) Is there an efficient way of saving files which take less space memory than np.save() ?
Thank you
You might want to check out xz compression mentioned in this answer. I've found it to be the best compression method while saving hundreds of thousands of .npy files adding up to a few hundred GB. The shell command for a directory called dataset containing your .npy files would be:
tar -vfcJ dataset.tar.xz dataset/
This is just to save disk space while storing and moving the dataset; it needs to be decompressed before loading into python.

can i perform gzseek to update a file compressed using gzwrite (CPP)?

I have a file written using gzwrite. Now i want to edit this file and insert some data in the middle by seeking. Is this possible with gzseek/gzwrite in cpp?
No, it isn't possible. You have to create a new file by successively writing the pieces.
So it is not much different from inserting data in the middle of an uncompressed file, except for one thing: with the uncompressed file, you could leave a hole of the right size (a series of spaces, for example) and later on overwrite that with the data to be inserted, but of course that is not possible with the compressed file because you cannot predict its compressed length.

fast encode bitmap buffer to png using libpng

My objective is to convert a 32-bit bitmap(BGRA) buffer into png image in real-time using C/C++. To achieve it, i used libpng library to convert bitmap buffer and then write into a png file. However it seemed to take huge time (~5 secs) to execute on target arm board (quad core processor) in single thread. On profiling, i found that libpng compression process (deflate algorithm) is taking more than 90% of time. So i tried to reduce it by using parallelization in some way. The end goal here is to get it done in less than 0.5 secs at least.
Now since a png can have multiple IDAT chunks, i thought of writing png with multiple IDATs in parallel. To write custom png file with multiple IDATs following methodology is adopted
1. Write PNG IHDR chunk
2. Write IDAT chunks in parallel
i. Split input buffer in 4 parts.
ii. compress each part in parallel using zlib "compress" function.
iii. compute CRC of chunk { "IDAT"+zlib compressed data }.
iv. create IDAT chunk i.e. { "IDAT"+zlib compressed data+ CRC}.
v. Write length of IDAT chunk created.
vi. Write complete chunk in sequence.
3. write IEND chunk
Now the problem is the png file created by this method is not valid or corrupted. Can somebody point out
What am I doing wrong?
Is there any fast implementation of zlib compress or multi-threaded png creation, preferably in C/C++?
Any other alternate way to achieve target goal?
Note: The PNG specification is followed in creating chunks
Update:
This method works for creating IDAT in parallel
1. add one filter byte before each row of input image.
2. split image in four equal parts. <-- may not be required passing pointer to buffer and their offsets
3. Compress Image Parts in parallel
(A)for first image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--store adler32 for current chunk, {a1=zstrm->adler} <--adler is of uncompressed data
(B)for second and third image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FULL_FLUSH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes, reduce length by 2
--store adler32 for current chunk zstrm->adler,{a2,a3 similar to A} <--adler is of uncompressed data
(C) for last image part
--deflateinit(zstrm,Z_BEST_SPEED)
--deflate(zstrm, Z_FINISH)
--deflateend(zstrm)
--store compressed buffer and its length
--strip first 2-bytes and last 4-bytes of buffer, reduce length by 6
--here last 4 bytes should be equal to ztrm->adler,{a4=zstrm->adler} <--adler is of uncompressed data
4. adler32_combine() all four parts i.e. a1,a2,a3 & a4 <--last arg is length of uncompressed data used to calculate adler32 of 2nd arg
5. store total length of compressed buffers <--to be used in calculating CRC of complete IDAT & to be written before IDaT in file
6. Append "IDAT" to Final chunk
7. Append all four compressed parts in sequence to Final chunk
8. Append adler32 checksum computed in step 4 to Final chunk
9. Append CRC of Final chunk i.e.{"IDAT"+data+adler}
To be written in png file in this manner: [PNG_HEADER][PNG_DATA][PNG_END]
where [PNG_DATA] ->Length(4-bytes)+{"IDAT"(4-bytes)+data+adler(4-bytes)}+CRC(4-bytes)
Even when there are multiple IDAT chunks in a PNG datastream, they still contain a single zlib compressed datastream. The first two bytes of the first IDAT are the zlib header, and the final four bytes of the final IDAT are the zlib "adler32" checksum of the entire datastream (except for the 2-byte header), computed before compressing it.
There is a parallel gzip (pigz) under development at zlib.net/pigz. It will generate zlib datastreams instead of gzip datastreams when invoked as "pigz -z".
For that you won't need to split up your input file because the parallel compression happens internally to pigz.
In your step ii, you need to use deflate(), not compress(). Use Z_FULL_FLUSH on the first three parts, and Z_FINISH on the last part. Then you can concatenate them to a single stream, after pulling off the two-byte header from the last three (keep the header on the first one), and pulling the four-byte check values off of the last one. For all of them, you can get the check value from strm->adler. Save those.
Use adler32_combine() to combine the four check values you saved into a single check value for the complete input. You can then tack that on to the end of the stream.
And there you have it.

Hadoop Distribution File System

I would like to modify the way an input file is split into blocks and stored in Hadoop Distributed File System.(example it splits the file based on block size but my application requires to split the file based on the file content).
So i would like to know exactly the class which splits the file into blocks based on the block size Property of HADOOP.
Blocks are the abstractions for HDFS and InputSplits are the abstractions for MapReduce. By default, one HDFS block corresponds to one InputSplit which can be modified.
HDFS by default divides the blocks into exact 64MB blocks and might also split across record boundaries. It's upto the InputFormat to create InputSplits from the blocks of data in case of file input format. Each InputSplit will be processed by a separate mapper.
example it splits the file based on block size but my application requires to split the file based on the file content
Think in terms of InputSplits and create a new InputFormat as per the application requirement. Here are some tutorials (1, 2 and 3) on creating a new InputFormat.