Reading Input Data from GCS - google-cloud-ml

What is the suggest way of loading data from GCS? The sample code shows copying the data from GCS to the /tmp/ directory. If this is the suggest approach, how much data may be copied to /tmp/?

While you have that option, you shouldn't need to copy the data over to local disk. You should be able to reference training and evaluation data directly from GCS, by referencing your files/objects using their GCS URI -- eg. gs://bucket/path/to/file. You can use these paths where you'd normally use local file system paths in TensorFlow APIs that accept file paths. TensorFlow supports the ability to access data (and write to) GCS.
You should also be able to use a prefix to reference a set of matching files, rather than referencing each file individually.
Followup note -- you'll want to check out https://cloud.google.com/ml/docs/how-tos/using-external-buckets in case you need to appropriately ACL your data for being accessible to training.
Hope that helps.

Related

How to write file-wide metadata into parquetfiles with apache parquet in C++

I use apache parquet to create Parquet tables with process information of a machine and I need to store file wide metadata (Machine ID and Machine Name).
It is stated that parquet files are capable of storing file wide metadata, however i couldn't find anything in the documentation about it.
There is another stackoverflow post that tells how it is done with pyarrow. As far as the post is telling, i need some kind of key value pair (maybe map<string, string>) and add it to the schema somehow.
I Found a class inside the parquet source code that is called parquet::FileMetaData that may be used for this purpose, however there is nothing in the docs about it.
Is it possible to store file-wide metadata with c++ ?
Currently i am using the stream_reader_writer example for writing parquet files
You can pass the file level metadata when calling parquet::ParquetFileWriter::Open, see the source code here

awswrangler write parquet dataframes to a single file

I am creating a very big file that cannot fit in the memory directly. So I have created a bunch of small files in S3 and am writing a script that can read these files and merge them. I am using aws wrangler to do this
My code is as follows:
try:
dfs = wr.s3.read_parquet(path=input_folder, path_suffix=['.parquet'], chunked=True, use_threads=True)
for df in dfs:
path = wr.s3.to_parquet(df=df, dataset=True, path=target_path, mode="append")
logger.info(path)
except Exception as e:
logger.error(e, exc_info=True)
logger.info(e)
The problem is that w4.s3.to_parquet creates a lot of files, instead of writing in one file, also I can't remove chunked=True because otherwise my program fails with OOM
How do I make this write a single file in s3.
AWS Data Wrangler is writing multiple files because you have specified dataset=True. Removing this flag or switching to False should do the trick as long as you are specifying a full path
I don't believe this is possible. #Abdel Jaidi suggestion won't work as append=True requires dataset to be true or will throw an error. I believe that in this case, append has more to do with "appending" the data in Athena or Glue by adding new files to the same folder.
I also don't think this is even possible for parquet in general. As per this SO post it's not possible in a local folder, let alone S3. To add to this parquet is compressed and I don't think it would be easy to add a line to a compressed file without loading it all into memroy.
I think the only solution is to get a beefy ec2 instance that can handle this.
I'm facing a similar issue and I think I'm going to just loop over all the small files and create bigger ones. For example, you could append sever dataframes together and then rewrite those but you won't be able to get back to one parquet file unless you get a computer with enough ram.

Load multiple files, check file name, archive a file

In Data Fusion pipeline:
How do I read all the file names from a bucket and load some based on file name, archive others ?
Is it possible to run gsutil script from the Data Fusion pipeline ?
Sometimes more complex logic needs to be put in place to decide what files should be loaded. Need to go through all the files on a location then load only those that are with current date or higher. The date is in a file name as a suffix i.e. customer_accounts_2021_06_15.csv
Depending on where you are planning on writing the files to, you may be able to use the GCS Source plugin with the logicalStartTime Macro in the Regex Path Filter field in order to filter on only files after a certain date. However, this may cause all your file data to be condensed down to record formats. If you want to retain each specific file in their original formats, you may want to consider writing your own custom plugin.

gsutil rsync between gzip/non-gzip local/cloud locations

For change detection, can gsutil's rsync use the gzip'd size for change detection?
Here's the situation:
Uploaded non-gzip'd static site content to a bucket using cp -Z so it's compressed at rest in the cloud.
Modify HTML files locally.
Need to rsync only the locally modified files.
So the upshot is that the content is compressed in the cloud and uncompressed locally. Can rsync be used to figure out what's changed?
From what I've tried, I'm thinking no because of the way rsync does it's change detection:
If -c is used, compare checksums but ONLY IF file sizes are the same.
Otherwise use times.
And it doesn't look like -J/-j impacts comparing the file size (the local uncompressed filesize is compared against the compressed cloud version which of course is FALSE) so -c won't kick in. Then, the times won't match and thus everything is uploaded again.
This seems like a fairly common use case. Is there a way of solving this?
Thank you,
Hans
To figure out how rsync identifies what has been changed while using gsutils please check Change Detection Algorithm.
I am unsure how do you want to compare between gzip non-gzip, but maybe gsutil compose could be used to make that middle step while compare between files before being compressed.
Take into account that in gsutils rsync's 4th limitation:
The gsutil rsync command copies changed files in their entirety and does not employ the rsync delta-transfer algorithm to transfer portions of a changed file. This is because Cloud Storage objects are immutable and no facility exists to read partial object checksums or perform partial replacements.

Read Parquet Files from HDFS cluster

looking for an advice on how to read parquet file from hdfs cluster using Apache Nifi. In the cluster, there are multiple files present under single directory, want to read all in one flow. Does Nifi provide an inbuilt component to read the files in HDFS directory (parquet in this case)?
example- 3 files present in directory-
hdfs://app/data/customer/file1.parquet
hdfs://app/data/customer/file2.parquet
hdfs://app/data/customer/file3.parquet
Thanks!
You can use FetchParquet processor in combination with ListHDFS/GetHDFS..etc processors.
This processor added starting from NiFi-1.2 version and Jira NiFi-3724 addressing this improvement.
ListHDFS //stores the state and runs incrementally.
GetHDFS //doesn't stores the state get's all the files from the configured directory (Keep source file property to True incase you don't want to delete the source file).
You can use some other ways(using UpdateAttribute..etc) to add fully qualified filename as attribute to the flowfile then feed the connection to FetchParquet processor then processor fetches those parquet files.
Based on the RecordWriter specified FetchParquet Processor reads parquet files and write them in the format specified in RecordWriter.
Flow:
ListHDFS/GetHDFS -> FetchParquet -> other processors
If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle. You can use either of the two approaches:
A combination of ListHDFS and FetchHDFS
GetHDFS
The difference between the two approaches is GetHDFS will keep listing the contents of the directories that is configured for each run, so it will produce duplicates. The former approach, however, keeps track of the state so only new additions and/or modifications are returned in each subsequent runs.