How to access Columnar URL INDEX using Amazon Athena - amazon-web-services

I am new to AWS and I'm following this tutorial to access Columnar dataset in Common Crawl. I executed this query:
SELECT COUNT(*) AS count,
url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
AND subset = 'warc'
AND url_host_tld = 'no'
GROUP BY url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY count DESC
And I keep getting this error:
Error opening Hive split s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-05/subset=warc/part-00082-248eba37-08f7-4a53-a4b4-d990640e4be4.c000.gz.parquet (offset=0, length=33554432): com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ZSRS4FD2ZTNJY9PV; S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=; Proxy: null), S3 Extended Request ID: IvDfkWdbDYXjjOPhmXSQD3iVkBiE2Kl1/K3xaFc1JulOhCIcDbWUhnbww7juthZIUm2hZ9ICiwg=
What's the reason? And how do I resolve it?

You are hitting the request rate limit of S3 since your query is trying to access too many parquet files at the same time. Consider compacting the underlying files into less.

Related

Can export `csv` file from query in Clickhouse to s3? (only sharing)

Get error timeout when try query in Clickhouse with large data. So I try query then export csv file and upload it to S3.
Yes, Can do that.
The first ensure s3_create_new_file_on_insert=1 in current Clickhouse database. Else can run. Need permission to execute below script.
SET s3_create_new_file_on_insert = 1
Example:
INSERT INTO FUNCTION s3('https://...naws.com/my.csv', 'KEY', 'SECRET')
SELECT user_id, name
FROM db.users
WHERE application_id =2
More info
https://medium.com/datadenys/working-with-s3-files-directly-from-clickhouse-7db330af7875
https://clickhouse.com/docs/en/sql-reference/table-functions/s3/

Using '--filter expression' on Datetime column of DynamoDB

I want to filter data from Dynamodb using a column which saves datetime values (like '2021-06-01T06:00:00.255Z'). I tried with the condition --filter-expression "Datecolumn BETWEEN 2021-06-01T12:00:00.000Z AND 2021-06-30T12:00:00.000Z" in aws dynamodb scan query. I'm getting the following error message :
Error Message:
An error occurred (ValidationException) when calling the Scan operation: Invalid FilterExpression: Syntax error; token: "2021", near: "BETWEEN 2021-"
Could someone please help me with the query?

Permission bigquery.tables.updateData denied when querying INFORMATION_SCHEMA.COLUMNS

I'm querying bigquery (via databricks) with a service account with the following roles:
BigQuery Data Viewer
BigQuery Job User
BigQuery Metadata Viewer
BigQuery Read Session User
The query is:
SELECT distinct(column_name) FROM `project.dataset.INFORMATION_SCHEMA.COLUMNS` where data_type = "TIMESTAMP" and is_partitioning_column = "YES"
I'm actually querying via Azure Databricks:
spark.read.format("bigquery")
.option("materializationDataset", dataset)
.option("parentProject", projectId)
.option("query", query)
.load()
.collect()
But I'm getting:
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Table project:dataset._sbc_f67ac00fbd5f453b90....: Permission bigquery.tables.updateData denied on table project:dataset._sbc_f67ac00fbd5f453b90.... (or it may not exist).",
"reason" : "accessDenied"
} ],
After adding BigQuery Data Editor the query works.
Why I need write permissions to view this metadata? Any lower permissions I can give?
In the docs I see that only data viewer is required, so I'm not sure what I'm doing wrong.
BigQuery saves all query results to a temporary table if a specific table name is not specified.
From the document, following permissions are required.
bigquery.tables.create permissions to create a new table
bigquery.tables.updateData to write data to a new table, overwrite a table, or append data to a table
bigquery.jobs.create to run a query job
Since the service account already have BigQuery Job User role, it is able to run the query, it needs BigQuery Data Editor role for bigquery.tables.create and bigquery.tables.updateData permissions.

Unable to access DynamoDB from Lambda

I have created a lambda java project to get items from dynamoDB.But I am getting error while accessing.
Code that I wrote:
dbClient = AmazonDynamoDBClientBuilder.standard().withRegion(Regions.US_EAST_1).build();
DynamoDB dynamoDB = new DynamoDB(dbClient);
Table table = dynamoDB.getTable("TokenSystem");
Item item = table.getItem("TokenId", 123456);
return item.toJSON();
Error getting in lambda console:
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: The provided key element does not match the schema (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: 9OJVLPJHV011KLGKI20Q8QN2FNVV4KQNSO5AEMVJF66Q9ASUAAJG)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1658)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1322)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1072)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:745)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:719)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:701)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:669)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:651)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:515)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:3609)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:3578)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:3567)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeGetItem(AmazonDynamoDBClient.java:1869)
at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.getItem(AmazonDynamoDBClient.java:1840)
at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.doLoadItem(GetItemImpl.java:77)
at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItemOutcome(GetItemImpl.java:40)
at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItemOutcome(GetItemImpl.java:99)
at com.amazonaws.services.dynamodbv2.document.internal.GetItemImpl.getItem(GetItemImpl.java:111)
at com.amazonaws.services.dynamodbv2.document.Table.getItem(Table.java:624)
at com.amazonaws.lambda.service.TokenValidatorService.retrieveItemFromDB(TokenValidatorService.java:82)
at com.amazonaws.lambda.service.TokenValidatorService.checkToken(TokenValidatorService.java:42)
In DynamoDB I have a table with the SAME NAME and REGION.I am using the AWS package com.amazonaws.services.dynamodbv2 to do all the operations.Can anyone help me to solve the issue.
The provided key element does not match the schema
Make sure you provide the right key Name

Digital Ocean Spaces | Add expire date for files with s3cmd

I try add expire days to a file and bucket but I have this problem:
sudo s3cmd expire s3://<my-bucket>/ --expiry-days=3 expiry-prefix=backup
ERROR: Error parsing xml: syntax error: line 1, column 0
ERROR: not found
ERROR: S3 error: 404 (Not Found)
and this
sudo s3cmd expire s3://<my-bucket>/<folder>/<file> --expiry-day=3
ERROR: Parameter problem: Expecting S3 URI with just the bucket name set instead of 's3:////'
How to add expire days in DO Spaces for a folder or file by using s3cmd?
Consider configuring Bucket's Lifecycle Rules
Lifecycle rules can be used to perform different actions on objects in a Space over the course of their "life." For example, a Space may be configured so that objects in it expire and are automatically deleted after a certain length of time.
In order to configure new lifecycle rules, send a PUT request to ${BUCKET}.${REGION}.digitaloceanspaces.com/?lifecycle
The body of the request should include an XML element named LifecycleConfiguration containing a list of Rule objects.
https://developers.digitalocean.com/documentation/spaces/#get-bucket-lifecycle
The expire option is not implemented on Digital Ocean Spaces
Thanks to Vitalii answer for pointing to API.
However API isn't really easy to use, so I've done it via NodeJS script.
First of all, generate your API keys here: https://cloud.digitalocean.com/account/api/tokens
And put them in ~/.aws/credentials file (according to docs):
[default]
aws_access_key_id=your_access_key
aws_secret_access_key=your_secret_key
Now create empty NodeJS project, run npm install aws-sdk and use following script:
const aws = require('aws-sdk');
// Replace with your region endpoint, nyc1.digitaloceanspaces.com for example
const spacesEndpoint = new aws.Endpoint('fra1.digitaloceanspaces.com');
// Replace with your bucket name
const bucketName = 'myHeckingBucket';
const s3 = new aws.S3({endpoint: spacesEndpoint});
s3.putBucketLifecycleConfiguration({
Bucket: bucketName,
LifecycleConfiguration: {
Rules: [{
ID: "autodelete_rule",
Expiration: {Days: 30},
Status: "Enabled",
Prefix: '/', // Unlike AWS in DO this parameter is required
}]
}
}, function (error, data) {
if (error)
console.error(error);
else
console.log("Successfully modified bucket lifecycle!");
});