When I run this COPY command:
COPY to_my_table (field1, field2, etc)
FROM s3://my-service-f55b83j5vvkp/2018/09/03
CREDENTIALS 'aws_iam_role=...'
JSON 'auto' TIMEFORMAT 'auto';
I get this error:
The specified S3 prefix '2018/09/03' does not exist
Which makes sense, because my S3 bucket does not have any file in that specific prefix. However, this is part of a daily job to load data, where sometimes there's something to load, but some other times there's nothing to load.
I checked the COPY documentation and it doesn't seem to be any way to avoid the error and just don't do anything if there are no objects under that prefix. Maybe I am missing something?
I would like to suggest here, how we have solved this problem in our case, though its simple solution but may be helpfull to others. Jon Scot has suggested good option in comment that I liked. But, unfortuanetely in our case, we coundn't do it as system adding files to S3 was not in our controll. So not sure it its your case too.
I think you could solve your problem multiple ways, but here are two options that I suggest.
1) As you may be running cron job to load data to Redshift, put a file existence check before executing the Copy command, like below.
path=s3://my-service-f55b83j5vvkp/2018/09/03
count=\`s3cmd ls $path | wc -l\`
if [[ $count -eq 1 ]]; then
//Your Redshift copy code goes here.
else
echo "Nothing to load"
fi
Advantage of this options is your saving some cost though may be completely negligible.
2) dummy file without records, that will eventually load no data to Redshift.
Related
I am using S3 Batch operations to copy some files between buckets in different regions.
Here is my manifest:
test-input-bucket,preview++.png
test-input-bucket,preview.png
preview.png copies just fine, but preview++.png doesn't. It gives this error in the report output:
test-input-bucket,preview++.png,,failed,200,PermanentFailure,PermanentFailure: 404: Not Found
The key definitely exists, so I tried to escape the +'s in the manifest like so:
test-input-bucket,preview\+\+.png
but no luck (same issue). Is there a way for me to fix this without renaming the file?
Per https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the keys have to be url-encoded, so the key would have to be encoded as preview%2B%2B.png
We have a csv file that is maintained by an analyst who manually updates it at irregular intervals and reuploads (by drag and drop) the same file to an S3 bucket. I have Snowpipe set up to ingest files from this S3 bucket, but it won't re-process the same filename even when the contents change. We don't want to rely on the analyst(s) remembering to manually rename the file each time they upload it, so are looking for an automated solution. I have pretty minimal input on how the analysts work with this file, I just need to ingest it for them. The options I'm considering are:
Somehow adding a timestamp or unique identifier to the filename on
upload (not finding a way to do this easily in S3). I've also
experimented with versioning in the S3 bucket but this doesn't seem
to have any effect.
Somehow forcing the pipe to grab the file again even with the same name. I've read
elsewhere that setting "Force=true" might do it, but that seems to
be an invalid option for a pipe COPY INTO statement.
Here is the pipe configuration, I'm not sure if this will be helpful here:
CREATE OR REPLACE PIPE S3_INGEST_MANUAL_CSV AUTO_INGEST=TRUE AS
COPY INTO DB.SCHEMA.STAGE_TABLE
FROM(
SELECT $1, $2, metadata$filename, metadata$file_row_number
FROM #DB.SCHEMA.S3STAGE
)
FILE_FORMAT=(
TYPE='csv'
skip_header=1
) ON_ERROR='SKIP_FILE_1%'
enter code here
Ignoring the fact that updating the same file rather than having a unique filename is really bad practice, you can use the FORCE option to force the reloading of the same file.
If the file hasn't been changed and you run the process with this option you'll potentially end up with duplicates in your target
I have a Dataflow job which:
Reads a text file from GCS with other filenames in it
Passes the filenames to ReadAllFromParquet to read the .parquet files
Writes to BigQuery
Despite my job 'succeeding' it basically doesn't have an output collection past the ReadAllFromParquet step.
I successfully read the files in a list such as:['gs://my_bucket/my_file1.snappy.parquet','gs://my_bucket/my_file2.snappy.parquet','gs://my_bucket/my_file3.snappy.parquet']
I am also confirming this list is correct and the GCS paths to the files are correct using a logger on the step before ReadAllFromParquet.
That's what my pipeline looks like (omitting the full code for brevity but I am confident that it normally works as I have the exact same pipeline for .csv using ReadAllFromText and it works fine):
with beam.Pipeline(options=pipeline_options_batch) as pipeline_2:
try:
final_data = (
pipeline_2
|'Create empty PCollection' >> beam.Create([None])
|'Get accepted batch file: {}'.format(runtime_options.complete_batch) >> beam.ParDo(OutputValueProviderFn(runtime_options.complete_batch))
|'Read all filenames into a list'>> beam.ParDo(FileIterator(runtime_options.files_bucket))
|'Read all files' >> beam.io.ReadAllFromParquet(columns=['locationItemId','deviceId','timestamp'])
|'Process all files' >> beam.ParDo(ProcessSch2())
|'Transform to rows' >> beam.ParDo(BlisDictSch2())
|'Write to BigQuery' >> beam.io.WriteToBigQuery(
table = runtime_options.comp_table,
schema = SCHEMA_2,
project = pipeline_options_batch.view_as(GoogleCloudOptions).project, #options.display_data()['project'],
create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED, #'CREATE_IF_NEEDED',#create if does not exist.
write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND #'WRITE_APPEND' #add to existing rows,partitoning
)
)
except Exception as exception:
logging.error(exception)
pass
That's what my job diagram looks like after:
Does somebody have an idea what might be going wrong here and what's the best way to debug?
My ideas currently:
A bucket permissions issue. I noticed the bucket I am reading from is odd as earlier I couldn't download the files despite being a project Owner. The Owners of project only had 'Storage Legacy Bucket Owner'. I added 'Storage Admin' and it then worked fine when manually downloading files with my own account. As per the Dataflow documentation I have ensured that both the default compute service account as well as the dataflow one have 'Storage Admin' on this bucket. However, maybe that's all a red herring as ultimately if there was a permissions issue I should see this in the log and the job would fail?
ReadAllFromParquet expects the file patterns in a different format? I have showed an example of the list (in my diagram above I can see the input collection correctly shows elements added = 48 for 48 files in the list) I supply above. I know this format works for ReadAllFromText so I assumed that they are equivalent and should work.
=========
EDIT:
Noticed something else potentially consequential. Comparing against my other job which uses ReadAllFromText and works fine I noticed a slight mismatch in the naming that is worrying.
This is the name of the output collection for my working job:
And that's the name on my parquet job that doesn't actually read anything:
Note specifically
Read all files/ReadAllFiles/ReadRange.out0
vs
Read all files/Read all files/ReadRange.out0
The first part of the path is the name of my step for both jobs.
But I believe the second to be the ReadAllFiles class from apache_beam.io.filebasedsource (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py) which both ReadAllFromText and ReadAllFromParquet call.
Seems like a potential bug but don't seem to be able to trace it in the source code.
=============
EDIT 2
After some more digging it seems that ReadAllFromParquet just isn't functional yet. ReadFromParquet calls apache_beam.io.parquetio._ParquetSource whereas ReadAllFromParquet simply calls
apache_beam.io.filebasedsource._ReadRange.
I wonder if there's a way to turn this on if it's an experimental function?
You didn't mentioned if you are using the last Beam SDK, try using SDK 2.16 to test the last changes.
The doc states that ReadAllFromParquet is an experimental funtion as well as ReadFromParquet; nonetheless, ReadFromParquet is reported as working in this thread Apache-Beam: Read parquet files from nested HDFS directories, you might want to try to using this funtion.
I have two separate normalized text files that I want to train my BlazingText model on.
I am struggling to get this to work and the documentation is not helping.
Basically I need to figure out how to supply multiple files or S3 prefixes as "inputs" parameter to the sagemaker.estimator.Estimator.fit() method.
I first tried:
s3_train_data1 = 's3://{}/{}'.format(bucket, prefix1)
s3_train_data2 = 's3://{}/{}'.format(bucket, prefix2)
train_data1 = sagemaker.session.s3_input(s3_train_data1, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
train_data2 = sagemaker.session.s3_input(s3_train_data2, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
bt_model.fit(inputs={'train1': train_data1, 'train2': train_data2}, logs=True)
this doesn't work because SageMaker is looking for the key specifically to be "train" in the inputs parameter.
So then i tried:
bt_model.fit(inputs={'train': train_data1, 'train': train_data2}, logs=True)
This trains the model only on the second dataset and ignores the first one completely.
Now finally I tried using a Manifest file using the documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/API_S3DataSource.html
(see manifest file format under "S3Uri" section)
the documentation says the manifest file format is a JSON that looks like this example:
[
{"prefix": "s3://customer_bucket/some/prefix/"},
"relative/path/to/custdata-1",
"relative/path/custdata-2"
]
Well, I don't think this is valid JSON in the first place but what do I know, I still give it a try.
When I try this:
s3_train_data_manifest = 'https://s3.us-east-2.amazonaws.com/bucketpath/myfilename.manifest'
train_data_merged = sagemaker.session.s3_input(s3_train_data_manifest, distribution='FullyReplicated', content_type='text/plain', s3_data_type='ManifestFile')
data_channel_merged = {'train': train_data_merged}
bt_model.fit(inputs=data_channel_merged, logs=True)
I get an error saying:
ValueError: Error training blazingtext-2018-10-17-XX-XX-XX-XXX: Failed Reason: ClientError: Data download failed:Unable to parse manifest at s3://mybucketpath/myfilename.manifest - invalid format
I tried replacing square brackets in my manifest file with curly braces ...but still I feel the JSON file format seems to be missing something that documentation fails to describe correctly?
You can certainly match multiple files with the same prefix, so your first attempt could have worked as long as you organize your files in your S3 bucket to suit. For e.g. the prefix: s3://mybucket/foo/ will match the files s3://mybucket/foo/bar/data1.txt and s3://mybucket/foo/baz/data2.txt
However, if there is a third file in your bucket called s3://mybucket/foo/qux/data3.txt that you don't want matched (while still matching the first two) there is no way to do achieve that with a single prefix. In these cases a manifest would work. So, in the above example, the manifest would simply be:
[
{"prefix": "s3://mybucket/foo/"},
"bar/data1.txt",
"baz/data2.txt"
]
(and yes, this is valid json - it is an array whose first element is an object with an attribute called prefix and all subsequent elements are strings).
Please double check your manifest (you didn't actually post it so I can't do that for you) and make sure it conforms to the above syntax.
If you're still stuck please open up a thread on the AWS sagemaker forums - https://forums.aws.amazon.com/forum.jspa?forumID=285 and after you do that we can setup a PM to try and get to the bottom of this (never post your AWS account id in a public forum like StackOverflow or even in AWS forums).
I created a dataframe and selected some columns say col1col2 and col3 using df.select().
df1=df.select(col1,col2,col3)
I am writing this into a parquet file and saving it to s3.
df1.write.partitionBy("col1").format("parquet").save('s3a://myBucket/fol1/subfolder')
currently there is no location like 's3a://myBucket/fol1/subfolder' in my s3. Only thing I have is 's3a:myBucket'. My question as there are no objects named fol1 and subfolder.Will It create objects itself and save the file? or the code will fail?
I think you're asking if save('s3a://myBucket/fol1/subfolder') will create the fol1/subfolder structure in S3, and if it doesn't, do you need to.
The bottom line is that you don't need to worry about creating the intermediate folder structure because Hadoop FS API creates it for you, as needed.
#SteveLoughran's answer provides much more detail and deserves to be the accepted answer.
Although S3 is an object store, Spark, Hive &c all pretend its a filesystem & use the Hadoop filesystem API.
Some early actions of a spark save() are
call FileSystem.exists(dest) & fail if there's something there (unless you have enabled appending to existing data)
call FileSystem.mkdir(dest).
set up some _temporary dir underneath for the job, renaming things into place when the job is committed.
Action #2 triggers a scan for any entry in the path /a/b/c/dest being a file (Failure), creates an empty directory marker object /a/b/c/dest/. That marker will be deleted as soon as a child directory (i.e _temporary) is created.
At the end of the job then, there won't be any parent marker entries, but they go in there just to keep quiet all those bits of code which expect that after a mkdirs() call that the created directory exists.
Finally, be advised: the whole commit-by-rename mechanism is broken when it comes to S3 as it is (a) slow and (b) at risk of losing data due to directory listing consistency. You need a consistent listing layer (EMR: Consistent S3, Apache Hadoop: S3Guard, Databricks: something also DynamoDB based), and, for maximum performance atop Apache Hadoop 3.1, switch to a specific zero-rename S3A committer.