S3 Batch operations fails when key has '++' symbol - amazon-web-services

I am using S3 Batch operations to copy some files between buckets in different regions.
Here is my manifest:
test-input-bucket,preview++.png
test-input-bucket,preview.png
preview.png copies just fine, but preview++.png doesn't. It gives this error in the report output:
test-input-bucket,preview++.png,,failed,200,PermanentFailure,PermanentFailure: 404: Not Found
The key definitely exists, so I tried to escape the +'s in the manifest like so:
test-input-bucket,preview\+\+.png
but no luck (same issue). Is there a way for me to fix this without renaming the file?

Per https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the keys have to be url-encoded, so the key would have to be encoded as preview%2B%2B.png

Related

How can I configure a snowpipe to grab the same filename from an S3 bucket when the file is refreshed and re-uploaded?

We have a csv file that is maintained by an analyst who manually updates it at irregular intervals and reuploads (by drag and drop) the same file to an S3 bucket. I have Snowpipe set up to ingest files from this S3 bucket, but it won't re-process the same filename even when the contents change. We don't want to rely on the analyst(s) remembering to manually rename the file each time they upload it, so are looking for an automated solution. I have pretty minimal input on how the analysts work with this file, I just need to ingest it for them. The options I'm considering are:
Somehow adding a timestamp or unique identifier to the filename on
upload (not finding a way to do this easily in S3). I've also
experimented with versioning in the S3 bucket but this doesn't seem
to have any effect.
Somehow forcing the pipe to grab the file again even with the same name. I've read
elsewhere that setting "Force=true" might do it, but that seems to
be an invalid option for a pipe COPY INTO statement.
Here is the pipe configuration, I'm not sure if this will be helpful here:
CREATE OR REPLACE PIPE S3_INGEST_MANUAL_CSV AUTO_INGEST=TRUE AS
COPY INTO DB.SCHEMA.STAGE_TABLE
FROM(
SELECT $1, $2, metadata$filename, metadata$file_row_number
FROM #DB.SCHEMA.S3STAGE
)
FILE_FORMAT=(
TYPE='csv'
skip_header=1
) ON_ERROR='SKIP_FILE_1%'
enter code here
Ignoring the fact that updating the same file rather than having a unique filename is really bad practice, you can use the FORCE option to force the reloading of the same file.
If the file hasn't been changed and you run the process with this option you'll potentially end up with duplicates in your target

Beam/Dataflow ReadAllFromParquet doesn't read anything but my job still succeeds?

I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?
You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.

How to train SageMaker BlazingText model using multiple channels

I have two separate normalized text files that I want to train my BlazingText model on.
I am struggling to get this to work and the documentation is not helping.
Basically I need to figure out how to supply multiple files or S3 prefixes as "inputs" parameter to the sagemaker.estimator.Estimator.fit() method.
I first tried:
s3_train_data1 = 's3://{}/{}'.format(bucket, prefix1)
s3_train_data2 = 's3://{}/{}'.format(bucket, prefix2)
train_data1 = sagemaker.session.s3_input(s3_train_data1, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
train_data2 = sagemaker.session.s3_input(s3_train_data2, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
bt_model.fit(inputs={'train1': train_data1, 'train2': train_data2}, logs=True)
this doesn't work because SageMaker is looking for the key specifically to be "train" in the inputs parameter.
So then i tried:
bt_model.fit(inputs={'train': train_data1, 'train': train_data2}, logs=True)
This trains the model only on the second dataset and ignores the first one completely.
Now finally I tried using a Manifest file using the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
(see manifest file format under "S3Uri" section)
the documentation says the manifest file format is a JSON that looks like this example:
[
{"prefix": "s3://customer_bucket/some/prefix/"},
"relative/path/to/custdata-1",
"relative/path/custdata-2"
]
Well, I don't think this is valid JSON in the first place but what do I know, I still give it a try.
When I try this:
s3_train_data_manifest = 'https://s3.us-east-2.amazonaws.com/bucketpath/myfilename.manifest'
train_data_merged = sagemaker.session.s3_input(s3_train_data_manifest, distribution='FullyReplicated', content_type='text/plain', s3_data_type='ManifestFile')
data_channel_merged = {'train': train_data_merged}
bt_model.fit(inputs=data_channel_merged, logs=True)
I get an error saying:
ValueError: Error training blazingtext-2018-10-17-XX-XX-XX-XXX: Failed Reason: ClientError: Data download failed:Unable to parse manifest at s3://mybucketpath/myfilename.manifest - invalid format
I tried replacing square brackets in my manifest file with curly braces ...but still I feel the JSON file format seems to be missing something that documentation fails to describe correctly?
You can certainly match multiple files with the same prefix, so your first attempt could have worked as long as you organize your files in your S3 bucket to suit. For e.g. the prefix: s3://mybucket/foo/ will match the files s3://mybucket/foo/bar/data1.txt and s3://mybucket/foo/baz/data2.txt
However, if there is a third file in your bucket called s3://mybucket/foo/qux/data3.txt that you don't want matched (while still matching the first two) there is no way to do achieve that with a single prefix. In these cases a manifest would work. So, in the above example, the manifest would simply be:
[
{"prefix": "s3://mybucket/foo/"},
"bar/data1.txt",
"baz/data2.txt"
]
(and yes, this is valid json - it is an array whose first element is an object with an attribute called prefix and all subsequent elements are strings).
Please double check your manifest (you didn't actually post it so I can't do that for you) and make sure it conforms to the above syntax.
If you're still stuck please open up a thread on the AWS sagemaker forums - https://forums.aws.amazon.com/forum.jspa?forumID=285 and after you do that we can setup a PM to try and get to the bottom of this (never post your AWS account id in a public forum like StackOverflow or even in AWS forums).

Redshift COPY command raises error if S3 prefix does not exist

When I run this COPY command:
COPY to_my_table (field1, field2, etc)
FROM s3://my-service-f55b83j5vvkp/2018/09/03
CREDENTIALS 'aws_iam_role=...'
JSON 'auto' TIMEFORMAT 'auto';
I get this error:
The specified S3 prefix '2018/09/03' does not exist
Which makes sense, because my S3 bucket does not have any file in that specific prefix. However, this is part of a daily job to load data, where sometimes there's something to load, but some other times there's nothing to load.
I checked the COPY documentation and it doesn't seem to be any way to avoid the error and just don't do anything if there are no objects under that prefix. Maybe I am missing something?
I would like to suggest here, how we have solved this problem in our case, though its simple solution but may be helpfull to others. Jon Scot has suggested good option in comment that I liked. But, unfortuanetely in our case, we coundn't do it as system adding files to S3 was not in our controll. So not sure it its your case too.
I think you could solve your problem multiple ways, but here are two options that I suggest.
1) As you may be running cron job to load data to Redshift, put a file existence check before executing the Copy command, like below.
path=s3://my-service-f55b83j5vvkp/2018/09/03
count=\`s3cmd ls $path | wc -l\`
if [[ $count -eq 1 ]]; then
//Your Redshift copy code goes here.
else
echo "Nothing to load"
fi
Advantage of this options is your saving some cost though may be completely negligible.
2) dummy file without records, that will eventually load no data to Redshift.

Can someone explain AWS GET?

For reference: GET Bucket (List Objects)
When I do a get request on the root bucket it comes back with test/ and test/subdir/ both 0 bytes. Which is correct, there should be 2 folders up there. When I upload a file to test/subdir/file. The root bucket has an item with the key=test/subdir/file. test/ and test/subdir/ are still 0 bytes. When I do a get request on test/subdir/ it returns nothing.
What's going on here?
Note: I do not have access to the console.
Greg, this might sound confusing at first, but the truth is that there's no such thing as "a folder" in Amazon S3. I'll explain.
The data structure of S3 is like a flat list of objects -- not like a tree. When you think you have a "file" called puppy.jpg inside a "folder" called pics, what you actually have is an object which key is pics/puppy.jpg. Note that the / character is not any more special than the . character, or the p characters.
You might be thinking, Bruno is nuts, I see folders in the AWS Management Console. True, you see the folders. But they are actually emulated by the GUI.
When you create a folder through the AWS Management Console, what it will actually do is create an object which name is the full path of the "folder", with a trailing slash, and 0 bytes. Just like the test/ object (not "folder") and the test/subdir/ object (not "folder") you mention in your question.
To actually identify and draw "folders", the AWS Management Console (as well as many other S3 browsing tools) is doing is some API magic with the parameters delimiter and prefix.
Now, knowing the fact that there's no such thing as a folder, and that they are emulated through the use of those 0-byte, trailing-/ objects, it should be easy to understand why you see the test/ object as a 0-byte object... The same reasoning would explain why you see nothing when you do a GET on a "folder" -- you are actually downloading a 0-byte object!
Finally, as a conclusion, there's no easy way to obtain from S3 the size of "a folder" (they don't exist...). The only way would be for you to list all the objects with that prefix and add their sizes. Or keep an index of your object ("files" and "folders") in some kind of database with more advanced querying capabilities.