I'm an AWS newbie trying to use Textract API, their OCR service.
As far as I understood I need to upload files to a S3 bucket and then run textract on it.
I got the bucket on and the file inside it:
I got the permissions:
But when I run my code it bugs.
import boto3
import trp
# Document
s3BucketName = "textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420"
documentName = "ok0001_prioridade01_x45f3.pdf"
]\[\[""
# Amazon Textract client
textract = boto3.client('textract',region_name="us-east-1",aws_access_key_id="xxxxxx",
aws_secret_access_key="xxxxxxxxx")
# Call Amazon Textract
response = textract.analyze_document(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
Here is the error I get:
botocore.errorfactory.InvalidS3ObjectException: An error occurred (InvalidS3ObjectException) when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions.
What am I missing? How could I solve that?
You are missing S3 access policy, you should add AmazonS3ReadOnlyAccess policy if you want a quick solution according to your needs.
A good practice is to apply the least privilege access principle and keep granting access when needed. So I'd advice you to create a specific policy to access your S3 bucket textract-console-us-east-1-057eddde-3f44-45c5-9208-fec27f9f6420 only and only in us-east-1 region.
Amazon Textract currently supports PNG, JPEG, and PDF formats. Looks like you are using PDF.
Once you have a valid format, you can use the Python S3 API to read the data of the object in the S3 object. Once you read the object, you can pass the byte array to the analyze_document method. TO see a full example of how to use the AWS SDK for Python (Boto3) with Amazon Textract to
detect text, form, and table elements in document images.
https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_wrapper.py
Try following that code example to see if your issue is resolved.
"Could you provide some clearance on the params to use"
I just ran the Java V2 example and it works perfecly. In this example, i am using a PNG file located in a specific Amazon S3 bucket.
Here are the parameters that you need:
Make sure when implementing this in Python, you set the same parameters.
Related
How do I write lambda function in AWS(python) to delete the contents of S3 buckets. please share the template on this regard I just want the codes.
This will help you in setting up a Python based lambda - including entry point for handler:
https://stackify.com/aws-lambda-with-python-a-complete-getting-started-guide/
Once that is figured out, you need to create an s3 client using:
import boto3
client = boto3.client("s3")
Then you can follow the user guide to empty bucket:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/empty-bucket.html
Also note that the lambda is run using an assumed role, please make sure the IAM role has relevant permissions:
https://docs.aws.amazon.com/lambda/latest/dg/lambda-intro-execution-role.html
I want to get the bucket policy for the various buckets. I tried the following code snippet(picked from the boto3 documentation):
conn = boto3.resource('s3')
bucket_policy=conn.BucketPolicy('demo-bucket-py')
print(bucket_policy)
But here's the output I get :
s3.BucketPolicy(bucket_name='demo-bucket-py')
What shall I rectify here ? Or is there some another way to get the access policy for s3 ?
Try print(bucket_policy.policy). More information on that here.
this worked for me
import boto3
# Create an S3 client
s3 = boto3.client('s3')
# Call to S3 to retrieve the policy for the given bucket
result = s3.get_bucket_policy(Bucket='my-bucket')
print(result)
to perform this you need to configure or mention your keys like this s3=boto3.client("s3",aws_access_key_id=access_key_id,aws_secret_access_key=secret_key). BUT there is much better way to do this is by using aws configure command and enter your credentials. for setting up docs. Once you set up you wont need to enter your keys again in your code, boto3 or aws cli will automatically fetch it behind the scenes .https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html.
you can even set different profiles to work with different accounts
I have been provided with the access and secret key for an Amazon S3 container. No more details were provided other than to drop some files into some specific folder.
I downloaded Amazon CLI and also the Amazon SDK. So far, seems to be no way for me to check the bucket name or list the folders where I'm supposed to drop my files. Every single command seems to require the knowledge of a bucket name.
Trying to list with aws s3 ls gives me the error:
An error occurred (AccessDenied) when calling the ListBuckets operation: Access Denied
Is there a way to list the content of my current location (I'm guessing the credentials I was given are linked directly to a bucket?). I'd like to see at least the folders where I'm supposed to drop my files, but the SDK client for the console app I'm building seems to always require a bucket name.
Was I provided incomplete info or limited rights?
Do you know the bucket name or not? If you don't and you don't have permission to ListAllMyBuckets and GetBucketLocation on * and ListBucket on the bucket in question, then you can't get the bucket name. That's how it is supposed to work. If you know the bucket, then you can run aws s3 s3://bucket-name/ to get objects in the bucket.
Note, that S3 buckets don't have the concept of "folder". It's User interface "sugar" to make it look like folders and files. Internally, it's just the key and the object
Looks like it was just not possible without enhanced rights or with the actual bucketname. I was able to procure both later on from the client and able to complete the task. Thanks for the comments.
I uploaded a .flac file to an Amazon S3 bucket but when I try to transcribe the audio using the Amazon Transcribe Golang SDK I get the error below. I tried making the .flac file in the S3 bucket public but still get the same error, so I don't think its a permission issue. Is there anything that prevents the Transcribe service from accessing the file from the S3 bucket that I'm missing? The api user that is uploading and transcribing have full access for the S3 and Transcribe services.
example Go code:
jobInput := transcribe.StartTranscriptionJobInput{
JobExecutionSettings: &transcribe.JobExecutionSettings{
AllowDeferredExecution: aws.Bool(true),
DataAccessRoleArn: aws.String("my-arn"),
},
LanguageCode: aws.String("en-US"),
Media: &transcribe.Media{
MediaFileUri: aws.String("https://s3.us-east-1.amazonaws.com/{MyBucket}/{MyObjectKey}"),
},
Settings: &transcribe.Settings{
MaxAlternatives: aws.Int64(2),
MaxSpeakerLabels: aws.Int64(2),
ShowAlternatives: aws.Bool(true),
ShowSpeakerLabels: aws.Bool(true),
},
TranscriptionJobName: aws.String("jobName"),
}
Amazon Transcribe response:
BadRequestException: The S3 URI that you provided can't be accessed. Make sure that you have read permission and try your request again.
My issue was the audio file being uploaded to s3 was specifying an ACL. I removed that from the s3 upload code and I no longer get the error. Also per the docs, if you have "transcribe" in your s3 bucket name, the transcribe service will have permission to access it. I also made that change but you still need to ensure you aren't using an ACL
I have spark job which needs to read the data from s3 which is in other account**(Data Account)** and process that data.
once its processed it should write back to s3 which is in my account.
So I configured access and secret key of "Data account" like below in my spark session
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key","DataAccountKey")
hadoopConf.set("fs.s3a.secret.key","DataAccountSecretKey")
hadoopConf.set("fs.s3a.endpoint", "s3.ap-northeast-2.amazonaws.com")
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
val df = spark.read.json("s3a://DataAccountS/path")
/* Reading is success */
df.take(3).write.json("s3a://myaccount/test/")
with this reading is fine, but I am getting below error when writing.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 301, AWS Service: Amazon S3, AWS Request ID: A5E574113745D6A0, AWS Error Code: PermanentRedirect, AWS Error Message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.
but If I dont configure details of Data Account and try to write some dummy data to my s3 from spark it works.
So how should I configure to make both reading from different account s3 and writing to my account s3 works
If your spark classpath has hadoop-2.7 JARs on, you can use secrets-in-Paths as the technique, so a URL like s3a://DataAccountKey:DataAccountSecretKey/DataAccount/path. Be aware this will log the secrets everywhere.
Hadoop 2.8+ JARs will tell you off for logging your secrets everywhere, but adds per-bucket binding
spark.hadoop.fs.s3a.bucket.DataAccount.access.key DataAccountKey
spark.hadoop.fs.s3a.bucket.DataAccount.secret.key DataAccountSecretKey
spark.hadoop.fs.s3a.bucket.DataAccount.endpoint s3.ap-northeast-2.amazonaws.com
then for all interaction with that bucket, these per-bucket options will override the main settings.
Note: if you want to use this, don't think dropping hadoop-aws-2.8.jar into your classpath will work, you'll only get classpath errors. All of hadoop-* JAR needs to go to 2.8 and the aws-sdk updated too.