I want to transfer data(21M rows) from mysql database to DynamoDB. I am using boto Python API and django 1.3.1 to export data from mysql and transfer it to DynamoDB. Below is the code:
conn = boto.connect_dynamodb()
start_date = datetime.date(2012, 3, 1)
end_date = datetime.date(2012, 3, 31)
episode_report = TableName.objects.filter(viewdt__range=(start_date, end_date))
#Paginate 21 million rows in chunks of 1000 each
p = Paginator(episode_report, 1000)
table = conn.get_table('ep_march')
for page in range(1, p.num_pages + 1):
for items in p.page(page).object_list:
item_data = {
'id': int(items.id),
'user_id': format_user(items.user), #Foreign Key to User table
'episode_id': int(items.episode.id), #Foreign Key to Episode table
'series_id': int(items.series.id), #Foreign Key to Series Table
'viewdt': str(items.viewdt),
}
item = table.new_item(
# Our hash key is 'id'
hash_key= int(items.id),
# Our range key is 'viewdt'
range_key= str(items.viewdt),
# This has the
attrs=item_data
)
item.put()
The issue is that the process has been running for more than 12 hours and has still transferred 3M rows. Any ideas to speed up the process?
I would create more threads and parellize the transfer and see if that helps.
Thanks.
First, what is the provisioned throughput of your DynamoDB table? That will ultimately control how many writes/second you can make. Adjust accordingly.
Second, get some sort of concurrency going. You could use threads (make sure each thread has it's own connection object because httplib.py is not threadsafe) or you could use gevent or multiprocess or whatever mechanism you like but concurrency is key.
Amazon's solution for bulk data transfers into and out of DynamoDB is to use Elastic MapReduce. Here are the docs: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
Related
I've got an AWS Glue job, which reads data from 22 MySQL tables, transforms it using queries and fills in 8 MySQL tables in different schema.
These are fairly easy queries - few joins that execute in maximum few seconds.
All tables combined have around 1.5 mln records. I run this job every 4 hours incrementally - amount of records inserted each time is between 100 and 5000. Example of one of the parts of the script, containing the query:
# SOURCE TABLES
item_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="db-" + args['ENVIRONMENT'],
table_name="item",
transformation_ctx="chopdb2item_node1647201451763",
)
inventory_node1647206574114 = glueContext.create_dynamic_frame.from_catalog(
database="bi-db-" + args['BIDB_ENVIRONMENT'],
table_name="inventory"
)
# [...OTHER SOURCE TABLE DECLARATIONS HERE...]
# SQL
SqlQuery12 = """
select distinct r.id receipt_id,
i.total value_total,
i.qty quantity,
d2.id receipt_upload_date_id,
i.text item_name,
i.brand item_brand,
i.gtin item_gtin,
i.ioi,
cp.client_promotion_id,
inv.inventory_id
from receipt r
join item i on i.receipt_id = r.id
join verdict v on r.verdict_id = v.id and v.awardable = 1
join account a on v.account_id = a.id
join offer o on v.offer_id = o.id
join date_dimension d2
on d2.occurence_date = DATE(r.upload_time)
left join client_promotion cp
on cp.client_key = a.client_key and
cp.promotion_key = o.promotion_key
left join inventory inv on inv.inventory_gtin = i.gtin
"""
extractitemfacttabledata_node1647205714873 = sparkSqlQuery(
glueContext,
query=SqlQuery12,
mapping={
"item": item_node1647201451763,
"receipt": receipt_node1647477767036,
"verdict": verdict_without_bookmark,
"account": account_without_bookmark,
"offer": offer_without_bookmark,
"date_dimension": date_dimension_node1649721691167,
"client_promotion": client_promotion_node1647198052897,
"inventory": inventory_node1647206574114,
},
transformation_ctx="extractitemfacttabledata_node1647205714873",
)
# WRITING BACK TO MYSQL DATABASE
itemfacttableinwarehouse_node1647210647655 = glueContext.write_from_options(
frame_or_dfc=extractitemfacttabledata_node1647205714873,
connection_type="mysql",
connection_options={
"url": f"{args['DB_HOST_CM']}:{args['DB_PORT_CM']}/chopbidb?serverTimezone=UTC&useSSL=false",
"user": args['DB_USERNAME_CM'],
"password": args['DB_PASSWORD_CM'],
"dbtable": "bidb.item", # THIS IS DIFFERENT SCHEMA THAN THE SOURCE ITEM TABLE
"bulksize": 1
}
)
Structure of the file is:
ALL source tables declaration
ALL transformations (SQL queries)
ALL writes (inserting back to MySQL database)
My problem is, that from time to time, the job hangs and runs indefinitely. I set timeout to 8h and it's in progress for 8h and then cancelled.
I installed Apache Spark history UI and tried to analyze the logs.
The results are as follows:
90% of the time this is the query quoted above, but sometimes it's successful and the next one fails. Looks like it might fail when trying to insert the data.
Usually it happens when there is very little load on the system (in the middle of the night).
What I tried:
in first version of the script I inserted the data using glueContext.write_dynamic_frame.from_catalog. I thought it might be the issue with bulk inserts, which are causing the deadlocks in database, but changing it to write_from_options with bulksize = 1 did not help
increased resources in RDS -> did not help
moving insert to this particular table further in the script usually also resulted in it's failing
I am faced with the following problem and I am a newbie to Cloud computing and databases. I want set up a simple dashboard for an application. Basically I want to replicate this site which shows data about air pollution. https://airtube.info/
What I need to do in my perception is the following:
Download data from API: https://github.com/opendata-stuttgart/meta/wiki/EN-APIs and I have this link in mind in particular "https://data.sensor.community/static/v2/data.1h.json - average of all measurements per sensor of the last hour." (Technology: Python bot)
Set up a bot to transform the data a little bit to tailor them for our needs. (Technology: Python)
Upload the data to a database. (Technology: Google Big-Query or AWS)
Connect the database to a visualization tool so everyone can see it on our webpage. (Technology: Probably Dash in Python)
My questions are the following.
1. Do you agree with my thought process or you would change some element to make it more efficient?
2. What do you think about running a python script to transform the data? Is there any simpler idea?
3. Which technology would you suggest to set up the database?
Thank you for the comments!
Best regards,
Bartek
If you want to do some analysis on your data I recommend to upload the data to BigQuery and once this is done, here you can create new queries and get the results you want to analyze. I was cheking the dataset "data.1h.json" and I would create a table in BigQuery using a schema like this one:
CREATE TABLE dataset.pollution
(
id NUMERIC,
sampling_rate STRING,
timestamp TIMESTAMP,
location STRUCT<
id NUMERIC,
latitude FLOAT64,
longitude FLOAT64,
altitude FLOAT64,
country STRING,
exact_location INT64,
indoor INT64
>,
sensor STRUCT<
id NUMERIC,
pin STRING,
sensor_type STRUCT<
id INT64,
name STRING,
manufacturer STRING
>
>,
sensordatavalues ARRAY<STRUCT<
id NUMERIC,
value FLOAT64,
value_type STRING
>>
)
Ok, we have already created our table, so now we need to insert all the data from the JSON file into that table, to do that and since you want to use Python, I would use the BigQuery Python Client library [1] to read the Data from a bucket in Google Cloud Storage [2] where the file has to be stored and transform the data to upload it to the BigQuery table.
The code, would be something like this:
from google.cloud import storage
import json
from google.cloud import bigquery
client = bigquery.Client()
table_id = "project.dataset.pollution"
# Instantiate a Google Cloud Storage client and specify required bucket and
file
storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket')
blob = bucket.blob('folder/data.1h.json')
table = client.get_table(table_id)
# Download the contents of the blob as a string and then parse it using
json.loads() method
data = json.loads(blob.download_as_string(client=None))
# Partition the request in order to avoid reach quotas
partition = len(data)/4
cont = 0
data_aux = []
for part in data:
if cont >= partition:
errors = client.insert_rows(table, data_aux) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print(errors)
cont = 0
data_aux = []
# Avoid empty values (clean data)
if part['location']['altitude'] is "":
part['location']['altitude'] = 0
if part['location']['latitude'] is "":
part['location']['latitude'] = 0
if part['location']['longitude'] is "":
part['location']['longitude'] = 0
data_aux.append(part)
cont += 1
As you can see above, I had to create a partition in order to avoid reaching a quota on the size of the request. Here you can see the amount of quotas to avoid [3].
Also, some Data in the location field seems to have empty values, so it is necessary to control them to avoid errors.
And since you already have your data stored in BigQuery, in order to create a new Dashboard I would use Data Studio tool [4] to visualize your BigQuery data and create queries over the columns you want to display.
[1] https://cloud.google.com/bigquery/docs/reference/libraries#using_the_client_library
[2] https://cloud.google.com/storage
[3] https://cloud.google.com/bigquery/quotas
[4] https://cloud.google.com/bigquery/docs/visualize-data-studio
I am new to the AWS world and I am in need to find the data count from a DynamoDB table.
My table structure is like this.
It has 2 items (Columns in MySQL) say A and B
A - stores the (primary partition key) user ids.
B - stores the user profiles, number of profiles associated with a UserID.
Suppose A contains a user ID 3435 and it has 3 profiles ({"21btet3","3sd4","adf11"})
My requirement is to get the count 3 to the output as a JSON in the format :
How to set the parameters for scanning this query?
Can anyone please help?
DynamoDb is NoSQL so there are some limitations in terms of querying
the data. In your case you have to scan the entire table like below
def ScanDynamoData(lastEvalutedKey):
table = boto3.resource("dynamodb", "eu-west-1").Table('TableName') #Add your region and table name
if lastEvalutedKey:
return table.scan(
ExclusiveStartKey=lastEvalutedKey
)
else:
return table.scan()
And call this method in a loop until lastEvalutedKey is null (To scan all the records) like
response = ScanDynamoData(None);
totalUserIds = response["Count"]
#In response you will get the json of entire table you can count userid and profiles here
while "LastEvaluatedKey" in response:
response = ScanDynamoData(response["LastEvaluatedKey"])
totalUserIds += response["Count"]
#Add counts here also
you should not do full table scan on a regular basis.
If you requirement is to get this count frequently, you should subscribe a lambda function to dynamodb streams and update the count as and when new records are inserted into dynamodb. This will make sure
you are paying less
you will not have to do table scan to calculate this number.
I have a inventory Bucket - inside the bucket - I have 6 Folders.
In Athena for each 6 folders - i have 6 tables in athena.
Now i have to update the paritions - as and when a file is dropped into any one of the 6 folders.
How do i write multiple sql (6 Sql) in one lambda for s3 event trigger.
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
Database is same ; however i have 6 different tables. I have to update all 6 tables.
First I would check the key of the dropped file and only update the table that points to the prefix where the file was dropped. E.g. if your folders and tables are prefix0, prefix1, prefix2, etc. and the dropped file has the key prefix1/some-file you update only the table with the location prefix1. There is no need to update the other tables, their data hasn't changed.
However, I would suggest not using MSCK REPAIR TABLE for this. That command is terrible in almost every possible way. It's wildly inefficient and its performance becomes worse and worse as you add more objects to your table's prefix. It doesn't look like you wait for it to complete in your Lambda, so at least you're not paying for its inefficiency, but there are much better ways to add partitions.
You can use the Glue APIs directly (under the hoods Athena tables are tables in the Glue catalog), but that is actually a bit complicated to show since you need to specify a lot of metadata (a downside of the Glue APIs).
I would suggest that instead of the MSCK REPAIR TABLE … call you do ALTER TABLE ADD PARTITION …:
Change the line
sql = 'MSCK REPAIR TABLE some_database.some_table'
to
sql = 'ALTER TABLE some_database.some_table ADD IF NOT EXISTS PARTITION (…) LOCATION \'s3://…\''
The parts where it says … you will have to extract from the object's key. If your keys look like s3://some-bucket/pk0=foo/pk1=bar/object.gz and your table has the partition keys pk0 and pk1 the SQL would look like this:
ALTER TABLE some_database.some_table
ADD IF NOT EXISTS
PARTITION (pk0 = 'foo', pk1 = 'bar') LOCATION 's3://some-bucket/pk0=foo/pk1=bar/'
I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search that it needs to search there. If I make an API call to run the Glue crawler each time I need a new partition is too expensive so the best solution to do this is to tell glue that a new partition is added i.e to create a new partition is in it's properties table. I looked through AWS documentation but no luck, I am using Java with AWS. Any help?
You may want to use batch_create_partition() glue api to register new partitions. It doesn't require any expensive operation like MSCK REPAIR TABLE or re-crawling.
I had a similar use case for which I wrote a python script which does the below -
Step 1 - Fetch the table information and parse the necessary information from it which is required to register the partitions.
# Fetching table information from glue catalog
logger.info("Fetching table info for {}.{}".format(l_database, l_table))
try:
response = l_client.get_table(
CatalogId=l_catalog_id,
DatabaseName=l_database,
Name=l_table
)
except Exception as error:
logger.error("Exception while fetching table info for {}.{} - {}"
.format(l_database, l_table, error))
sys.exit(-1)
# Parsing table info required to create partitions from table
input_format = response['Table']['StorageDescriptor']['InputFormat']
output_format = response['Table']['StorageDescriptor']['OutputFormat']
table_location = response['Table']['StorageDescriptor']['Location']
serde_info = response['Table']['StorageDescriptor']['SerdeInfo']
partition_keys = response['Table']['PartitionKeys']
Step 2 - Generate a dictionary of lists where each list contains the information to create a single partition. All lists will have same structure but their partition specific values will change (year, month, day, hour)
def generate_partition_input_list(start_date, num_of_days, table_location,
input_format, output_format, serde_info):
input_list = [] # Initializing empty list
today = datetime.utcnow().date()
if start_date > today: # To handle scenarios if any future partitions are created manually
start_date = today
end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created
logger.info("Partitions to be created from {} to {}".format(start_date, end_date))
for input_date in date_range(start_date, end_date):
# Formatting partition values by padding required zeroes and converting into string
year = str(input_date)[0:4].zfill(4)
month = str(input_date)[5:7].zfill(2)
day = str(input_date)[8:10].zfill(2)
for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day
hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits
part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour)
input_dict = {
'Values': [
year, month, day, hour
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': input_format,
'OutputFormat': output_format,
'SerdeInfo': serde_info
}
}
input_list.append(input_dict.copy())
return input_list
Step 3 - Call the batch_create_partition() API
for each_input in break_list_into_chunks(partition_input_list, 100):
create_partition_response = client.batch_create_partition(
CatalogId=catalog_id,
DatabaseName=l_database,
TableName=l_table,
PartitionInputList=each_input
)
There is a limit of 100 partitions in a single api call, So if you are creating more than 100 partitions then you will need to break your list into chunks and iterate over it.
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition
You can configure you're glue crawler to get triggered every 5 mins
You can create a lambda function which will either run on schedule, or will be triggered by an event from your bucket (eg. putObject event) and that function could call athena to discover partitions:
import boto3
athena = boto3.client('athena')
def lambda_handler(event, context):
athena.start_query_execution(
QueryString = "MSCK REPAIR TABLE mytable",
ResultConfiguration = {
'OutputLocation': "s3://some-bucket/_athena_results"
}
Use Athena to add partitions manualy. You can also run sql queries via API like in my lambda example.
Example from Athena manual:
ALTER TABLE orders ADD
PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';
This question is old but I wanted to put it out there that someone could have s3:ObjectCreated:Put notifications trigger a Lambda function which registers new partitions when data arrives on S3. I would even expand this function to handle deprecations based on object deletes and so on. Here's a blog post by AWS which details S3 event notifications: https://aws.amazon.com/blogs/aws/s3-event-notification/
AWS Glue recently added a RecrawlPolicy that only crawls the new folders/paritions that you add to your S3 bucket.
https://docs.aws.amazon.com/glue/latest/dg/incremental-crawls.html
This should help you with minimizing crawling all the data again an again. From what I read, you can define incremental crawls while setting up your crawler, or editing an existing one. One thing however to note is that incremental crawls require the schema of new data to be more or less the same as existing schema.