I am trying to use DynamoDB operation BatchWriteItem, wherein I want to insert multiple records into one table.
This table has one partition key and one sort key.
I am using AWS lambda and Go language.
I get the elements to be inserted into a slice.
I am following this procedure.
Create PutRequest structure and add AttributeValues for the first record from the list.
I am creating WriteRequest from this PutRequest
I am adding this WriteRequest to an array of WriteRequests
I am creating BatchWriteItemInput which consists of RequestItems, which is basically a map of Tablename and the array of WriteRequests.
After that I am calling BatchWriteItem, which results into an error:
Provided list of item keys contains duplicates.
Any pointers, why this could be happening?
You've provided two or more items with identical primary keys (which in your case means identical partition and sort keys).
Per the BatchWriteItem docs, you cannot perform multiple operations on the same item in the same BatchWriteItem request.
Consideration: This answers works for Python
As #Benoit has remarked, the boto3 documentation states:
If you want to bypass no duplication limitation of single batch write request as botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the BatchWriteItem operation: Provided list of item keys contains duplicates.
you could specify overwrite_by_pkeys=['partition_key', 'sort_key'] on the batch writer to "de-duplicate request items in buffer if match new request item on specified primary keys" according to the documentation and the source code. That is, if the combination primary-sort already exists in the buffer it will drop that request and replace it with the new one.
Example
Suppose there is pandas dataframe that you want to write to a DynamoDB table, the following function could be helpful,
import json
import datetime as dt
import boto3
import pandas as pd
from typing import Optional
def write_dynamoDB(df:'pandas.core.frame.DataFrame', tbl:str, partition_key:Optional[str]=None, sort_key:Optional[str]=None):
'''
Function to write a pandas DataFrame to a DynamoDB Table through
batchWrite operation. In case there are any float values it handles
them by converting the data to a json format.
Arguments:
* df: pandas DataFrame to write to DynamoDB table.
* tbl: DynamoDB table name.
* partition_key (Optional): DynamoDB table partition key.
* sort_key (Optional): DynamoDB table sort key.
'''
# Initialize AWS Resource
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(tbl)
# Check if overwrite keys were provided
overwrite_keys = [partition_key, sort_key] if partition_key else None
# Check if they are floats (convert to decimals instead)
if any([True for v in df.dtypes.values if v=='float64']):
from decimal import Decimal
# Save decimals with JSON
df_json = json.loads(
json.dumps(df.to_dict(orient='records'),
default=date_converter,
allow_nan=True),
parse_float=Decimal
)
# Batch write
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
for element in df_json:
batch.put_item(
Item=element
)
else: # If there are no floats on data
# Batch writing
with table.batch_writer(overwrite_by_pkeys=overwrite_keys) as batch:
columns = df.columns
for row in df.itertuples():
batch.put_item(
Item={
col:row[idx+1] for idx,col in enumerate(columns)
}
)
def date_converter(obj):
if isinstance(obj, dt.datetime):
return obj.__str__()
elif isinstance(obj, dt.date):
return obj.isoformat()
by calling write_dynamoDB(dataframe, 'my_table', 'the_partition_key', 'the_sort_key').
Use batch_writer instead of batch_write_item:
import boto3
dynamodb = boto3.resource("dynamodb", region_name='eu-west-1')
my_table = dynamodb.Table('mirrorfm_yt_tracks')
with my_table.batch_writer(overwrite_by_pkeys=["user_id", "game_id"]) as batch:
for item in items:
batch.put_item(
Item={
'user_id': item['user_id'],
'game_id': item['game_id'],
'score': item['score']
}
)
If you don't have a sort key, overwrite_by_pkeys can be None
This is essentially the same answer as #MiguelTrejo (thanks! +1) but simplified
Related
I have a dynamodb table on which a GSI is defined with a partition key and sort key.
Let's say the parition key is name and sort key is ssn for the GSI.
I have to fetch based upon a name and ssn, below is the query I am using and it works fine.
table.query(IndexName='lookup-by-name',KeyConditionExpression=Key('name').eq(name)\
& Key('ssn').eq(ssn))
Now, I have to query based upon a name and a list of ssns.
For Example
ssns=['ssn1','ss2','ss3',ssn4']
name='Alex'
query all records which has name as 'Alex' and whose ssn is present in ssns list.
How do I implement something like this ?
While DynamoDB native SDK cannot provide the functionality to do this, you can achieve it using PartiQL which provides a SQL like interface for interacting with DynamoDB.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-gettingstarted.html
import boto3
client = boto3.client('dynamodb', region_name="eu-west-1")
name = 'Alex'
ssns = ['ssn1','ssn2','ssn3','ssn4']
response = client.execute_statement(
Statement = "Select * from \"MyTableTest\".\"lookup-by-name\" where \"name\" = '%s' AND \"ssn\" IN %s" % (name, ssns)
)
print(response['Items'])
It would also require you to use the lower level Client instead of the Table level resource which you are using above.
You would have to do multiple queries.
Ended up using just the name as keycondition and then filter out the ssn in python code.
Below worked for me as the number of records was not a lot.
response=table.query(IndexName='lookup-by-name',KeyConditionExpression=Key('name').eq(name)
ssns=['ssn1','ss2','ss3',ssn4']
data= response['Items']
data=list(filter(lambda record: record['ssn'] in ssns,data))
return data
I am trying to perform a batch write item for a dynamodb table using boto3 python library. The table has both hash and range key. When I performed the same with another table with only hash key it worked well. I am wondering how to add both hash and range key when performing batch write item operation.
import boto3
from boto3.dynamodb.conditions import Attr,Key
dynamodb = boto3.resource('dynamodb', 'us-east-2')
table = dynamodb.Table('edc_test')
scan = table.scan(
#ProjectionExpression='#k',
ProjectionExpression='resource_id',
#ProjectionExpression='version_id',
FilterExpression=Attr('Health.New version - Veracity unavailable').eq("A new dataset is available but IDQ rules are not generated yet")
)
items=scan['Items']
print('length',str(len(items)))
print(items)
def lambda_handler(event, context):
with table.batch_writer() as batch:
for each in scan['Items']:
batch.delete_item(Key=each)
ProjectionExpression='version_id,resource_id',
FilterExpression=Attr('Health.New version - Veracity unavailable').eq("A new dataset is available but IDQ rules are not generated yet")
#ExpressionAttributeNames={
# '#k': 'name'
#}
)
items=scan['Items']
print('length',str(len(items)))
print(items)
#response = table.table.delete_item(Key={resource_id:1})
with table.batch_writer() as batch:
#for each in scan['Items']:
# batch.delete_item(Key=each)
for each in scan['Items']:
#batch.delete_item(Key={'version_id': each['version_id']})
batch.delete_item(Key={'resource_id': each['resource_id'], 'version_id': each['version_id']})
Included sort key in scan projection expression and included the same in delete batch item , it worked.
I am running the following cloud function. It runs with success and indicates data was loaded to the table. But when I query the BigQuery no data has been added. I am getting no errors and no indication that it isn't working.
from google.cloud import bigquery
import pandas as pd
def download_data(event, context):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =[rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths]
#print(my_list)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
client = bigquery.Client()
table_id = "<project_name>.raw.daily_load"
table = client.get_table(table_id)
print(client)
print(table_id)
print(table)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
Attempted so far;
Check data was being pulled -> PASSED, I printed out row_list and
data is there
Run locally from my machine -> PASSED, data appeared when I ran it from a python terminal
Print out the table details -> PASSED, see attached screenshot it all appears in the logs
Confirm it is able to find the table -> PASSED, I changed the name
of the table to one that didn't exist and it failed
Not sure what is next, any advice would be greatly appreciated
Maybe this post in Google Cloud documentation could help.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv#loading_csv_data_into_a_table
You can directly stream the data from the website to BigQuery using Cloud Functions but the data should be clean and conform to BigQuery standards else the e insertion will fail. One more point to note is that the dataframe columns must match the table columns for the data to be successfully inserted. I tested this out and saw insertion errors returned by the client when the column names didn’t match.
Writing the function
I have created a simple Cloud Function using the documentation and pandas example. The dependencies that need to be included are google-cloud-bigquery and pandas.
main.py
from google.cloud import bigquery
import pandas as pd
def hello_gcs(event,context):
df = pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv')
df.set_axis(["Month", "Year_1", "Year_2", "Year_3"], axis=1, inplace=True) ## => Rename the columns if necessary
table_id = "project.dataset.airtravel"
## Get BiqQuery Set up
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, df) # Make an API request.
if errors == []:
print("Data Loaded")
return "Success"
else:
print(errors)
return "Failed"
requirements.txt
# Function dependencies, for example:
# package>=version
google-cloud-bigquery
pandas
Now you can directly deploy the function.
Output
Output Table
Assuming that the App Engine default service account has the default Editor role assigned and that you have a very simple schema for the BigQuery table. For example:
Field name Type Mode Policy tags Description
date STRING NULLABLE
location STRING NULLABLE
new_cases INTEGER NULLABLE
new_deaths INTEGER NULLABLE
total_cases INTEGER NULLABLE
total_deaths INTEGER NULLABLE
The following modification of your code should work for an HTTP triggered function. Notice that you were not including the Row_list.append(my_list) in the for loop to populate your list with the elements and that according to the samples on the documentation you should be using a list of tuples:
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "[PROJECT-ID].[DATASET].[TABLE]"
def download_data(request):
df = pd.read_csv('https://covid.ourworldindata.org/data/ecdc/full_data.csv')
# Create an empty list
Row_list =[]
# Iterate over each row
for index, rows in df.iterrows():
# Create list for the current row
my_list =(rows.date, rows.location, rows.new_cases, rows.new_deaths, rows.total_cases, rows.total_deaths)
# append the list to the final list
Row_list.append(my_list)
## Get Biq Query Set up
table = client.get_table(table_id)
errors = client.insert_rows(table, Row_list) # Make an API request.
if errors == []:
print("New rows have been added.")
With the very simple requirements.txt file:
# Function dependencies, for example:
# package>=version
pandas
google-cloud-bigquery
I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.