Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates? - google-cloud-platform

One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).
While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?

If you have a look at the documentation:
The extensions appended will be one of: .bzip2, .deflate, .gz.
Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.
In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.
Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.
Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.

Related

Compression container-format with arbitrary file operations

Is there mature compression format that allows arbitrary file operations for items inside like Delete/Insert/Update but not requiring full archive recreation for this.
I'm aware of Sqlar based on Sqlite file format that naturally supports this since the mentioned operations is just deleting/inserting/updating records containing blobs. But it is more like experimental project created with other goals in mind and not widely adopted
UPDATE: to be more precise with what I have in mind, this is more like file system inside the archive when the files inserted might occupy a different "sectors" inside this container, depending on the scenario of previous delete and update operations. But the "chain" of the file is compressed while being added so occupies effectively less space than the original file.
The .zip format. You may need to copy the zip file contents to do a delete, but you don't need to recreate the archive.
Update:
The .zip format can, in principle, support the deletion and addition of entries without copying the entire zip file, as well as the re-use of the space from deleted entries. The central directory at the end can be updated and cheaply rewritten. I have heard of it being done. You would have to deal with fragmentation, as with any file system. I am not aware of an open-source library that supports using a zip file as a file system. The .zip format does not support breaking an entry into sectors that could be scattered across the zip file, as file systems do. A single entry has to be contiguous in a zip file.

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

Load multiple files, check file name, archive a file

In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv
Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.

Read Parquet Files from HDFS cluster

looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?
example- 3 files present in directory-
hdfs://app/data/customer/file1.parquet
hdfs://app/data/customer/file2.parquet
hdfs://app/data/customer/file3.parquet
Thanks!
You can use FetchParquet processor in combination with ListHDFS/GetHDFS..etc processors.
This processor added starting from NiFi-1.2 version and Jira NiFi-3724 addressing this improvement.
ListHDFS //stores the state and runs incrementally.
GetHDFS //doesn't stores the state get's all the files from the configured directory (Keep source file property to True incase you don't want to delete the source file).
You can use some other ways(using UpdateAttribute..etc) to add fully qualified filename as attribute to the flowfile then feed the connection to FetchParquet processor then processor fetches those parquet files.
Based on the RecordWriter specified FetchParquet Processor reads parquet files and write them in the format specified in RecordWriter.
Flow:
ListHDFS/GetHDFS -> FetchParquet -> other processors
If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:
A combination of ListHDFS and FetchHDFS
GetHDFS
The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.

AWS S3: distributed concatenation of tens of millions of json files in s3 bucket

I have an s3 bucket with tens of millions of relatively small json files, each less than 10 K.
To analyze them, I would like to merge them into a small number of files, each having one json per line (or some other separator), and several thousands of such lines.
This would allow me to more easily (and performantly) use all kind of big data tools out there.
Now, it is clear to me this cannot be done with one command or function call, but rather a distributed solution is needed, because of the amount of files involved.
The question is if there is something ready and packaged or must I pull out my own solution.
don't know of anything out there that can do this out of the box, but you can pretty easily do it yourself. the solution also depends a lot on how fast you need to get this done.
2 suggestions:
1) list all the files, split the list, download sections, merge and reupload.
2) list all the files, and after them go through them one at a time and read/download and write it to a kinesis steam. configure kinesis to dump the files to s3 via kinesis firehose.
In both scenarios the tricky bit is going to be handling failures and ensuring you don't get the data multiple times.
For completeness, if the files would be larger (>5MB) you could also leverage http://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html which would allow you to merge files in S3 directly without having to download.
Assuming each json file is one line only, then I would do:
cat * >> bigfile
This will concat all files in a directory into the new file bigfile.
You can now read bigfile one line at a time, json decode the line and do something interesting with it.
If your json files are formatted for readability, then you will first need to combine all the lines in the file into one line.