Does HBase need mapreduce/yarn, or it only need hdfs?
For a basic usage of HBase, like create table, insert data, scan/get data, I don't see any reason to use mapreduce/yarn.
Please help me out with this, thank you.
Yes, you are right . HBase needs only HDFS for columnar store , it doesn't need mapreduce/yarn.
If you don't need to paralellice, you don't need YARN, only HDFS, but if you use a bulk, YARN is necesary
Related
I was asked for a solution to migrate data from Redshift to an RDS instance.
The migration will consist in only few tables and i'm trying to migrate them to a postgresql instance.
Any clue where could I begin?
You have multiple options to do it. Which one suits best to you depends on many questions answer like-- size of the data, one time or regular process and so on and so forth...
Typical process to do it.
Select and Export the data
Temporary hold and transfer data from import will be triggered to EC2/Any Machine
Load the data
Here goes options-
For step 1, you have multiple options but best option could be use Copy command and export 3 using a copy command.
For step 2, Multiple options like store on local/ec2/S3/... I think best option could be s3 and its done in step-1 I suggested.
For step 3, depends on RDS type but most of database supports way to import data from CSV, so use that.
If its Redshift to mysql. You may would like to refer steps, I have suggested on one of answer earlier.
I hope it will help. Feel free if you have specific additional questions.
Just a brief question on early thoughts about best methods for staging tables in Redshift for update-insert-delete approaches (for a continual basis ) process in Redshift.
Thanks
It's a good practice to use COMPUPDATE OFF STATUPDATE OFF while loading data to staging table in Redshift. This will make the data load faster. If you want to know why, you can read more here
I have been looking at options to load (basically empty and restore) Parquet file from S3 to DynamoDB. Parquet file itself is created via spark job that runs on EMR cluster. Here are few things to keep in mind,
I cannot use AWS Data pipeline
File is going to contain millions of rows (say 10 million), so would need an efficient solution. I believe boto API (even with batch write) might not be that efficient ?
Are there any other alternatives ?
Can you just refer to the Parquet files in a Spark RDD and have the workers put the entries to dynamoDB? Ignoring the challenge of caching the DynamoDB client in each worker for reuse in different rows, it some bit of scala to take a row, build an entry for dynamo and PUT that should be enough.
BTW: Use DynamoDB on demand here, as it handles peak loads well without you having to commit to some SLA.
Look at the answer below:
https://stackoverflow.com/a/59519234/4253760
To explain the process:
Create desired dataframe
Use .withColumn to create new column and use psf.collect_list to convert to desired collection/json format, in the new column in the
same dataframe.
Drop all un-necessary (tabular) columns and keep only the JSON format Dataframe columns in Spark.
Load the JSON data into DynamoDB as explained in the answer.
My personal suggestion: whatever you do, do NOT use RDD. RDD interface even in Scala is 2-3 times slower than Dataframe API of any language.
Dataframe API's performance is programming language agnostic, as long as you dont use UDF.
I have a CSV table in S3 with 100's of attributes/features, I don't want to create table in RedShift with all these attributes before importing data. Is there anyway to select only the columns I need while copying data from S3 into Redshift?
You cannot achieve the above using just a copy command it is doable using a python script. Please go through this
Read specific columns from a csv file with csv module?
There are couple of options listed in aws forum for this problem, take a look at https://forums.aws.amazon.com/message.jspa?messageID=432590 if they may work for you.
Is there a way to store the results from Pig directly to a table on Redshift?
Yes, but you won't like it probably - it's not efficient.
Download jdbc driver (http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html)
Use PiggyBank DBStorage, mysql example here.
A better way is to prepare csv and then import it.