Query DynamoDB Data with EMR - amazon-web-services

Query DynamoDB Data with EMR - amazon-web-services

I am looking for a way to query the AWS DynamoDB data with SQL Syntax using amazon EMR.
I have my DynamoDB table set up and ready. How can I import/query the data using Hue? The table in DynamoDB has a size of around 8GB.

Please follow the below steps:-
Hive to query non-live DynamoDB data:-
1) Export Data from DynamoDB to Hive
Refer Section : Exporting Data from DynamoDB in EMR Hive Commands link below
2) Use Amazon EMR to query data stored in DynamoDB
Refer Section : Querying Data in DynamoDB in EMR Hive Commands link below
3) Use Hue to run the queries (i.e. run Hive queries from Hue workbench)
EMR Hive Commands
Hue Supported
Hive to query live DynamoDB:-
1) Create Hive table to map to DynamoDB table
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/EMR_Interactive_Hive.html
2) Once you create the Hive table and run queries on it, it will refer the live DynamoDB table to get the data
Disadvantage : It consumes DynamoDB read or write units for each execution. In other words, it will cost you for each query execution.
Sample code:-
CREATE EXTERNAL TABLE hivetable1 (col1 string, col2 bigint, col3 array<string>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1",
"dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays");

Related

AWS Athena query error: Hive File Not Found:partition location does not exist

I created one table in glue database using crawler job. Table created successfully.
However, when I am trying to access that table in athena query editor its giving me below error when i am try to select the data from table:
Query:
select * from DB1.data_tbl;
Output:
Hive File Not Found: Partition location does not exist
I haven't found the partition location define.
Please assist.

Athena, by default, can read only data in S3. It will not read your postgresql databases. To connect to anything other than S3, you have to setup and use Amazon Athena Federated Query.
Alternatively, setup a Glue Job to copy all data from your Postegresql into S3, and then use Athena to query the data from S3.

Amazon Redshift Spectrum with partitioned by does not return results

Given an S3 bucket partitioned by date in this way:
year
|___month
|___day
|___file_*.parquet
I am trying to create a table in amazon redshift Spectrum with this command:
create external table spectrum.visits(
ip varchar(100),
user_agent varchar(2000),
url varchar(10000),
referer varchar(10000),
session_id char(32),
store_id int,
category_id int,
page_id int,
product_id int,
customer_id int,
hour int
)
partitioned by (year char(4), month varchar(2), day varchar(2))
stored as parquet
location 's3://visits/visits-parquet/';
Although an error message is not thrown, the results of the queries are always null, i.e., do not return results. The bucket is not null. Does someone knows want am I doing wrong?

When an External Table is created in Amazon Redshift Spectrum, it does not scan for existing partitions. Therefore, Redshift is not aware that they exist.
You will need to execute an ALTER TABLE ... ADD PARTITION command for each existing partition.
(Amazon Athena has a MSCK REPAIR TABLE option, but Redshift Spectrum does not.)

As I can't comment on people solutions, I needed to add another one.
I would like to point that if your spectrum table comes from Amazon Glue Data Catalog you don't need to manually add partitions to tables, you can have a crawler update partitions on the data catalog and the changes will reflect on spectrum.

One can create external table in Athena & run msck repair on it. Make sure you add "/" at the end of the location. Then create external schema in redshift. This solved my problem of result being showing blank. Alternatively you can run Glue crawler on Athena database, that will generate partitions automatically.

aws glue tables created by athena are read twice by emr spark

When I create a table in athena with CTAS syntax (example below), tables are registered to glue in a way that when I read the same table on an EMR cluster with (py)spark, every partition is read twice, but when I read it with athena, it is alright. When I create a table through spark with write.saveAsTable syntax, it's registered to glue properly and this table is read properly with spark and with athena.
I didn't find anything in the spark/athena/glue documentation about this. After some trial and errors I found out that there is a glue table property that is set by spark and not set by athena: spark.sql.sources.provider='parquet'. When I set this manually on tables created via athena, spark will read it properly. But this feels like an ugly workaround and I would like to understand what's happening in the background. And I didn't find anything about this table property.
Athena create table syntax:
CREATE TABLE {database}.{table}
WITH (format = 'Parquet',
parquet_compression = 'SNAPPY',
external_location = '{s3path}')
AS SELECT

Populate external schema table in Redshift from S3 bucket file

I am new to AWS and trying to figure out how to populate a table within an external schema, residing in Amazon Redshift. I used Amazon Glue to create a table from a .csv file that sits in a S3 bucket. I can query the newly created table via Amazon Athena.
Here is where I am stuck because my task is to take the data and populate a table living in an RedShift external schema. I tried created a Job within Glue, but had no luck.
This is where I am stuck. Am I supposed to first create an empty destination table that mirrors the table that I can query using Athena?
Thank you to anyone in advance who might be able to assist!!!

Redshift Spectrum and Athena both use the Glue data catalog for external tables. When you create a new Redshift external schema that points at your existing Glue catalog the tables it contains will immediately exist in Redshift.
-- Create the Redshift Spectrum schema
CREATE EXTERNAL SCHEMA IF NOT EXISTS my_redshift_schema
FROM DATA CATALOG DATABASE 'my_glue_database'
IAM_ROLE 'arn:aws:iam:::role/MyIAMRole'
;
-- Review the schema info
SELECT *
FROM svv_external_schemas
WHERE schemaname = 'my_redshift_schema'
;
-- Review the tables in the schema
SELECT *
FROM svv_external_tables
WHERE schemaname = 'my_redshift_schema'
;
-- Confirm that the table returns data
SELECT *
FROM my_redshift_schema.my_external_table LIMIT 10
;

While Running AWS Athena query Query says Zero Records Returned

SELECT * FROM "sampledb"."parquetcheck" limit 10;
Trying to use Parquet file in S3 and created a table in AWS Athena and it is created perfectly.
However when I run the select query, it says "Zero Records Returned."
Although My Parquet file in S3 has data.
I have created partition too. IAM has full access on Athena.

If your specified columns names are correct, then you may need to load the partitions using: MSCK REPAIR TABLE EnterYourTableName;. This will add new partitions to the Glue Catalog.
If any of the above fails, you can create a temporary Glue Crawler to crawl your table and then validate the metadata in Athena by clicking the 3 dots next to the table name. Then select Generate Create Table DDL. You can then compare any differences in DDL.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Query DynamoDB Data with EMR - amazon-web-services

I am looking for a way to query the AWS DynamoDB data with SQL Syntax using amazon EMR. I have my DynamoDB table set up and ready. How can I import/query the data using Hue? The table in DynamoDB has a size of around 8GB.

Related

AWS Athena query error: Hive File Not Found:partition location does not exist

Amazon Redshift Spectrum with partitioned by does not return results

aws glue tables created by athena are read twice by emr spark

Populate external schema table in Redshift from S3 bucket file

While Running AWS Athena query Query says Zero Records Returned

Categories

Resources