How to use regular expression in Google Dataflow streaming templates? - google-cloud-platform

Using the Dataflow streaming templates, namely the Cloud Storage Text to BigQuery (Stream) template, it used to be possible to describe the "inputFilePattern" (i.e.: the Cloud Storage location of the text you'd like to process) as a regular expression. For example you could enter gs://my-bucket/my-files/file-to-upload* as the parameter and all the files starting with "file-to-upload" would then be streamed.
Unfortunately it now throws this error message: "Object not found."
Is there another way to upload all files from a google storage location with a similar naming convention to BigQuery?
Please see screenshots below:
Thanks in advance.

This looks like a bug in the UI you can pass the file pattern when you submit the job via command line. The source code takes the file pattern as input so there should not be any problem with the actual job
PCollectionTuple transformedOutput =
pipeline
// 1) Read from the text source continuously.
.apply(
"ReadFromSource",
TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))

Related

Error: Cannot find the jsonl: gs://{bucket_name}/Frist_test.jsonl in request

I am exploring using Google Cloud Platform Natural Language for entity extraction. I am working on just setting up a playground to get the hang of things and I can't seem to get past square one. I have created a new cloud store bucket to hold my project file.
I made a simple csv file to point to a one line jsonl file. But I am missing something in the address to my cloud bucket stored file.
My csv looks like this:
Train, gs://new_wc_training/Frist_test.jsonl
And my jsonl file looks like this:
{"text_snippet":{"content": "This is a first test of my json file."}}
When I import my csv file I get the error:
Error: Cannot find the jsonl: gs://new_wc_training/Frist_test.jsonl in request.
I am sure I am just missing something in the structure of the address to the jsonl file in the bucket, but I am at a loss as to finding it.
Thank you for looking over my issue and if there is any additional information needed do not hesitate in asking.
As per discussion with #fred_rogers, the problem is that the JSONL filename declared inside the CSV file does not match with the actual filename in the bucket.
The fix is to match the JSONL filename in the bucket and in the CSV file.

Beam/Dataflow ReadAllFromParquet doesn't read anything but my job still succeeds?

I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?
You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.

How to train SageMaker BlazingText model using multiple channels

I have two separate normalized text files that I want to train my BlazingText model on.
I am struggling to get this to work and the documentation is not helping.
Basically I need to figure out how to supply multiple files or S3 prefixes as "inputs" parameter to the sagemaker.estimator.Estimator.fit() method.
I first tried:
s3_train_data1 = 's3://{}/{}'.format(bucket, prefix1)
s3_train_data2 = 's3://{}/{}'.format(bucket, prefix2)
train_data1 = sagemaker.session.s3_input(s3_train_data1, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
train_data2 = sagemaker.session.s3_input(s3_train_data2, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
bt_model.fit(inputs={'train1': train_data1, 'train2': train_data2}, logs=True)
this doesn't work because SageMaker is looking for the key specifically to be "train" in the inputs parameter.
So then i tried:
bt_model.fit(inputs={'train': train_data1, 'train': train_data2}, logs=True)
This trains the model only on the second dataset and ignores the first one completely.
Now finally I tried using a Manifest file using the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
(see manifest file format under "S3Uri" section)
the documentation says the manifest file format is a JSON that looks like this example:
[
{"prefix": "s3://customer_bucket/some/prefix/"},
"relative/path/to/custdata-1",
"relative/path/custdata-2"
]
Well, I don't think this is valid JSON in the first place but what do I know, I still give it a try.
When I try this:
s3_train_data_manifest = 'https://s3.us-east-2.amazonaws.com/bucketpath/myfilename.manifest'
train_data_merged = sagemaker.session.s3_input(s3_train_data_manifest, distribution='FullyReplicated', content_type='text/plain', s3_data_type='ManifestFile')
data_channel_merged = {'train': train_data_merged}
bt_model.fit(inputs=data_channel_merged, logs=True)
I get an error saying:
ValueError: Error training blazingtext-2018-10-17-XX-XX-XX-XXX: Failed Reason: ClientError: Data download failed:Unable to parse manifest at s3://mybucketpath/myfilename.manifest - invalid format
I tried replacing square brackets in my manifest file with curly braces ...but still I feel the JSON file format seems to be missing something that documentation fails to describe correctly?
You can certainly match multiple files with the same prefix, so your first attempt could have worked as long as you organize your files in your S3 bucket to suit. For e.g. the prefix: s3://mybucket/foo/ will match the files s3://mybucket/foo/bar/data1.txt and s3://mybucket/foo/baz/data2.txt
However, if there is a third file in your bucket called s3://mybucket/foo/qux/data3.txt that you don't want matched (while still matching the first two) there is no way to do achieve that with a single prefix. In these cases a manifest would work. So, in the above example, the manifest would simply be:
[
{"prefix": "s3://mybucket/foo/"},
"bar/data1.txt",
"baz/data2.txt"
]
(and yes, this is valid json - it is an array whose first element is an object with an attribute called prefix and all subsequent elements are strings).
Please double check your manifest (you didn't actually post it so I can't do that for you) and make sure it conforms to the above syntax.
If you're still stuck please open up a thread on the AWS sagemaker forums - https://forums.aws.amazon.com/forum.jspa?forumID=285 and after you do that we can setup a PM to try and get to the bottom of this (never post your AWS account id in a public forum like StackOverflow or even in AWS forums).

Can we write a ParDo function inside a ParDo function?

For example, I have a list of URLs as strings which are stored in Datastore.
So, I used the DatastoreIO function and read them into a PCollection. In ParDo’s DoFn, for each URL (which is a GCP cloud storage location of a file), I have to read the file present in that location and do further transformations.
So I want to know if I can write ParDo for PCollections inside a ParDo function. Kind of parallel execution of each file transformation and send KV (key, PCollection) something as output of the first ParDo function.
Sorry, if I haven't presented my scenario clearly. I'm a newbie to Apache Beam & Google Dataflow
What you want is TextIO#readAll().
PCollection<String> urls = pipeline.apply(DatastoreIO.read(...))
PCollection<String> lines = urls.apply(TextIO.readAll())

Cloud Storage Text to BigQuery using GCP Template

I'm trying to execute a pipeline using the GCP template available at:
https://cloud.google.com/dataflow/docs/templates/provided-templates#cloud-storage-text-to-bigquery
But I'm getting the error:
2018-03-30 (15:35:17) java.lang.IllegalArgumentException: Failed to match any files with the pattern: gs://.......
Can anyone share a working CSV file to be used as an input for running that pipeline?
The problem was between chair and keyboard, you just need to create a CSV file accordingly to the data structure defined in the JSON file and transformed by the JS file.
I see that this has been answered but I was having a similar issue and this answer was partial for me - as it turns out, the path pattern (at the moment, at least) in the template does not support some types of patterns.
For example, for multiple CSV files across multiple sub-directories in a given GCS path (this was my use-case):
gs://bucket-name/dir/
The pattern that will work is:
gs://bucket-name/dir/*/*.csv
These patterns, although they are valid via gsutil ls and return the correct files, will not work in the template:
gs://bucket-name/dir/*
gs://bucket-name/dir/*.csv