AWS Glue Crawler - Crawl new folders only - Internal Service Exception

AWS Glue Crawler - Crawl new folders only - Internal Service Exception - amazon-web-services

I have created a glue crawler to run every 6 hours , I am using "Crawl new folders only" option. Every time crawler runs it fails with "Internal Service Exception" error.
What I tried so far ?
Created another crawler with option "Crawl all the folders" set to run every 6 hours. It works perfectly without any issue.
Created another crawler with option "Crawl new folders only" but with "Run on demand" option. It works perfectly without any issue.
All above 3 scenarios point to same S3 bucket with same IAM policy rule. I also tried reducing 6 hours run time to 15 mins / 1 hour but no luck.
What I am missing ? Any help is appreciated.

If you are setting RecrawlPolicy as CRAWL_NEW_FOLDERS_ONLY then please make sure that the UpdateBehavior is LOG only otherwise you will get the error like The SchemaChangePolicy for "Crawl new folders only" Amazon S3 target can have only LOG DeleteBehavior value and LOG UpdateBehavior value.

Related

can't find bucket logs using bucket stream

I'm trying to install https://github.com/eth0izzle/bucket-stream
when I write
python3 bucket-stream.py
nothing happens at all
does any one know whats happening ?
i waited more than 10 min. to see buckets log

AWS Glue Crawler and JDCBConnection : "Expected string length >= 1, but found 0 for params.Targets.JdbcTargets[0].customJdbcDriverClassName"

I am trying to setup an AWS Glue Crawler using a JDBC connection in order to populate my AWS Glue Data Catalog databases.
I already have a Connection which passes the test but when I submit my crawler creation, I have this error : "Expected string length >= 1, but found 0 for params.Targets.JdbcTargets[0].customJdbcDriverClassName" as you can see in the first screenshot.
The only clue I have for now is that there is no Class Name attached to my connection. However I cannot edit it while editing the connection
Does it ring a bell to someone?
Thanks a lot

I've also had this issue, and even tried using aws-cli to create/update my connection to try to manually input the required parameter.
Turns out this is an AWS UI issue caused by a recent update. According to this post you can create it using the Legacy console for now (on the sidbar, there is a Legacy section where you can find the Legacy pages). I just tried it on my end and it worked =)

Running Dataflow Flex template poll time out

I have two service accounts with exact same roles under the same project and one can run the Flex template without any issue but the other fails to do so and returns:
Timeout in polling result file: <LOGGING_BUCKET>. Service account: <SERVICE_ACCOUNT> Image URL: <IMAGE_URL> Troubleshooting guide at https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling
The SA that fails to run doesn't write the logs to GCS bucket, making it really difficult to debug. The graph doesn't get created and seems to get stuck at queue stage. The roles of both SAs are:
BigQuery Admin
Bigtable User
Dataflow Developer
Editor
Storage Object Viewer

Sorry if is it obvious...but
Have you checked the google doc from the error? (https://cloud.google.com/dataflow/docs/guides/common-errors#timeout-polling).
Both SAs have the same roles?
Let's say that SA1 can run Flex1, and SA2 can't run Flex2. Have you tried to assign SA1 into Flex2?
What could be any possible difference between both SAs?
If you create SA3 with the same roles as SA2 and assign it to Flex2, does it work?
Good luck

AWS EMR - Hive creating new table in S3 results in AmazonS3Exception: Bad Request

I have a Hive script I'm running in EMR that is creating a partitioned Parquet table in S3 from a ~40GB gzipped CSV file also stored in S3.
The script runs fine for about 4 hours but reaches a point (pretty sure when it is just about done creating the Parquet table) where it errors out. The logs show that the error is:
HiveException: Hive Runtime Error while processing row
caused by:
AmazonS3Exception: Bad Request
There really isn't any more useful information in the logs that I can see. It is reading the CSV file fine from S3 and it creates a couple metadata files in S3 fine as well, so I've confirmed the instance has read/write permissions to the Bucket.
I really can't think of anything else that's going on and I wish there was more info in the logs about what "Bad Request" to S3 that Hive is making. Anyone have any ideas?

BadRequest is a fairly meaningless response from AWS which it sends if there is any reason why it doesn't like the caller. Nobody really knows what's happening.
The troubleshooting docs for the ASF S3A connector list some causes, but they aren't complete, and based on guesswork from what made the message go away.
If you have the request ID which failed, you can submit a support request for amazon to see what they saw on their side.
If it makes you feel any better, I'm seeing it when I try to list exactly one directory in an object store, and I'm co-author of the s3a connector. Like I said "guesswork". Once you find out, add a comment here or, if it's not in the troubleshooting doc, submit a patch to hadoop on the topic.

Athena Write Performance to AWS S3

I'm executing a query in AWS Athena and writing the results to s3. It seems like it's taking a long time (way too long in fact) for the file to be available when I execute the query from a lambda script.
I'm scanning 70MB of data, and the file returned is 12MB. I execute this from a lambda script like so:
athena_client = boto3.client('athena')
athena_client.start_query_execution(
QueryString=query_string,
ResultConfiguration={
'OutputLocation': 'location_on_s3',
'EncryptionConfiguration': 'SSE_S3',
}
)
If I run the query directly in Athena it takes 2.97 seconds to run. However it looks like the file is available after 2 minutes if I run this query from the lambda script.
Does anyone know the write performance of AWS Athena to AWS S3? I would like to know if this is normal. The docs don't state how quickly the write occurs.

Every query in Athena writes to S3.
If you check the History tab on the Athena page in the console you'll see a history of all queries you've run (not just through the console, but generally). Each of those has a link to a download path.
If you click the Settings button a dialog will open asking you to specify an output location. Check that location and you'll find all your query results there.
Why is this taking so much longer from your Lambda script? I'm guessing, but the only possible suggestion I have is that you're querying across regions - if your data is in your region and your result location is in another location you might experience slowness due to transfer cost. Even so, 12MB should be fast.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Glue Crawler - Crawl new folders only - Internal Service Exception - amazon-web-services

If you are setting RecrawlPolicy as CRAWL_NEW_FOLDERS_ONLY then please make sure that the UpdateBehavior is LOG only otherwise you will get the error like The SchemaChangePolicy for "Crawl new folders only" Amazon S3 target can have only LOG DeleteBehavior value and LOG UpdateBehavior value.

Related

can't find bucket logs using bucket stream

AWS Glue Crawler and JDCBConnection : "Expected string length >= 1, but found 0 for params.Targets.JdbcTargets[0].customJdbcDriverClassName"

Running Dataflow Flex template poll time out

AWS EMR - Hive creating new table in S3 results in AmazonS3Exception: Bad Request

Athena Write Performance to AWS S3

Categories

Resources