Load multiple files, check file name, archive a file - google-cloud-platform

In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv

Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files
You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates?

One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).
While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?
If you have a look at the documentation:
The extensions appended will be one of: .bzip2, .deflate, .gz.
Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.
In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.
Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.
Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.

Dynamic file names to exported files in google SDK

i am trying to use Google's cloud command line interface (SDK) on desktop to extract a file from Google big query and place it in a google storage area. I have managed to do this initial part but now i want to give the file a dynamic date name as this process will be repeated in order to create a history of these files. The idea would be to have (filename).20.01.2020 or something like that so that we could have an organised history of these exports.
Here is what i currently have
bq extract mp-uid-all-touchpoints:83778322.prod_placementTouchpoints gs://touchpointsrecord/placements/%date%
What this does is correctly gather the current date and try to pass that as the filename. The problem is that in google sdk's syntax, '/' stands for newline, so when the date is passed in this format 'dd/mm/yyy' what ends up happening is it creates a new folder for day, month and then year.
i need it to just be one file not multiple folders within folders
hope someone can help solve
Putting / will produce directory structure so becomes difficult to Read/Write objects with / in the name.
I suggest to try some other date format like DDMMYYYY or DD-MM-YYYY or DD.MM.YYYY
For 20012020
export CURR_DATE=$(date '+%d%m%Y')
bq extract mp-uid-all-touchpoints:83778322.prod_placementTouchpoints gs://touchpointsrecord/placements/$CURR_DATE
Hope this helps.
You can use this format to create a file.
gs://touchpointsrecord/placements-21-01-2020/
gs://touchpointsrecord/placements-$(date '+%d-%m-%Y')

Read Partial Parquet file

I have a Parquet file and I don't want to read the whole file into memory. I want to read the metadata and then read the rest of the file on demand. That is, for example, I want to read the second page of the first column in the third-row group. How would I do that using Apache Parquet cpp library? I have the offset of the part that I want to read from the metadata and can read it directly from the disk. Is there any way to pass that buffer to Apache Parquet library to uncompress, decode and iterate through the values? How about the same thing for column chunk or row groups? Basically, I want to read the file partially and then pass it to the parquet APIs to process it as opposes to give the file handler to the API and let it go through the file. Is it possible?
Behind the scences this is what the Apache Parquet C++ library actually does. When you pass in a file handle, it will only read the parts it needs to. As it requires the file footer (the main metadata) to know where to find the segments of data, this will always be read. The data segments will only be read once you request them.
No need to write special code for this, the library already has it built-in. Thus, if you want to know in fine detail on how this is working, you only need to read the source of the library: https://github.com/apache/arrow/tree/master/cpp/src/parquet

Reading Input Data from GCS

What is the suggest way of loading data from GCS? The sample code shows copying the data from GCS to the /tmp/ directory. If this is the suggest approach, how much data may be copied to /tmp/?
While you have that option, you shouldn't need to copy the data over to local disk. You should be able to reference training and evaluation data directly from GCS, by referencing your files/objects using their GCS URI -- eg. gs://bucket/path/to/file. You can use these paths where you'd normally use local file system paths in TensorFlow APIs that accept file paths. TensorFlow supports the ability to access data (and write to) GCS.
You should also be able to use a prefix to reference a set of matching files, rather than referencing each file individually.
Followup note -- you'll want to check out https://cloud.google.com/ml/docs/how-tos/using-external-buckets in case you need to appropriately ACL your data for being accessible to training.
Hope that helps.