I want to bulk load data to mulitple tables using a single mapreduce job.Since the data volumes is high ,It would be time consuming to iterate through dataset twice and load using multiple jobs.Is there any way to do this ? Thanks in advance.
I am using Hbase. But i didnt need bulk load yet. But I came across this article which might help you.
http://hbase.apache.org/book/arch.bulk.load.html
The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API.
Related
I have a trigger in Oracle. Can anyone please help me with how it can be replicated to Redshift? DynamoDB managed stream kind of functionality will also work.
Redshift does not support triggers because it's a data warehousing system which is designed to be able to import large amounts of data in a limited time. So, if every row insert would be able to fire a trigger the performance of batch inserts would suffer. This is probably why Redshift developers didn't bother to support this and I agree with them. The trigger type of behavior should be a part of business application logic that runs in OLTP environment and not the data warehousing logic. If you want to run some code in DW after inserting or updating data you have to do it as another step of your data pipeline.
Would like some suggestions on loading data to Redshift.
Currently we have an EMR cluster where RAW data is ingested regularly. We have a transformation job which runs daily and creates final modeled object. However, we are following truncate and load strategy in EMR . Due to business reasons there is no way to figure out which data has changed.
We are planning to store of this modeled object in Redshift.
Now my question is If we follow the same truncate and load strategy in
RedShift also, will that work?
I was able to find only articles which say use copy if you want to perform bulk copy, and then use insert command for small updates. But nothing on can and should we be using RedShift where the data is getting overwritten daily.
Under 3 nodes using redshift we plan on doing 50-100 inserts every 10 seconds. Within that 10 second window we also will try to do the equivalent of a redshift upsert as documented here https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html on about 50 to 100 rows as well.
I'm basically unaware if a 10 second window is realistic or a 10 minute window... etc is good for this kind of load. Should this be a daily batch? Should I try to re-architect to get rid of upserts?
My question is essentially can redshift handle this load? I feel the upsert is happening too many times. We are using structured streaming in spark to handle all of this. If yes what type of nodes should we be using? Has anyone who has done this have a ballpark estimate? If no, what are alternative architectures?
Essentially what we're trying to do is load entity data to be joined with the events in redshift. But we want the analytics to be as near real time as possible so we want load as fast as we can.
There's probably no exact answer for this, so any explanation that can get help me perform estimations on requirements based on load will be helpful.
I do not think you will achieve the performance you seek.
Running large numbers of INSERT statements is not an optimal way to load data into Amazon Redshift.
The best way is via running COPY from data stored in Amazon S3. This loads data in parallel across all nodes.
Unless you have a very real need to get data immediately into Redshift, it would be better to batch the data in S3 over a period of time (the larger the batch, the better), then load via COPY. This will also work well with the Staging Table approach to performing UPSERTS.
The best way to discover whether Redshift will handle a particular load is to try it! Spin up another cluster and try the various methods, measuring the performance each time.
I would recommend using Kinesis Firehose to insert data to Redshift. It will optimize for time / load and insert accordingly.
We tried inserting manually in batches, does not seems to be the cleaner way of handling it when an optimized cloud service exist for the same.
https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-redshift-firehose-stream.html
It does collect them in batches, compress and load them to Redshift.
Upsert Process:
If you want an upsert this is how I would have them done in a scalable way,
DynamoDB Table (Update) --> DynamoDB Streams --> Lambda --> Firehose --> Redshift
Have a scheduled job to cleanup any duplicate records based on created_timestamp.
Hope it helps.
I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.
I have a map reduce job which will do a bulk load in MapR table one at time. If i have to load another Mapr DB table then i will have to write another job for doing bulk load.Is there any way to do bulk load in single map reduce job?
Thanks in advance.