I am new to AWS EMR and have created a Hive-Hbase table using the following code:
CREATE EXTERNAL TABLE IF NOT EXISTS airflow.card_transactions(card_id bigint,member_id bigint,amount float,postcode int,pos_id bigint,transaction_dt timestamp,status string) row format delimited fields terminated by ',' stored as textfile location '/user/hadoop/projectFD_pipeline/card_transactions'"
CREATE TABLE IF NOT EXISTS airflow.card_transactions_bucketed(cardid_txnts string,card_id bigint,member_id bigint,amount float,postcode int,pos_id bigint,transaction_dt timestamp,status string) clustered by (card_id) into 8 buckets STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping'=':key,trans_data:card_id,trans_data:member_id,trans_data:amount,trans_data:postcode,trans_data:pos_id,trans_data:transaction_dt,trans_data:status') TBLPROPERTIES('hbase.table.name'='card_transactions')"
When i tried to insert values into this table:
INSERT OVERWRITE TABLE airflow.card_transactions_bucketed select concat_ws('~',cast(card_id as string),cast(transaction_dt as string)) as cardid_txnts,card_id,member_id,amount,postcode,pos_id,transaction_dt,status from airflow.card_transactions it started failing with this error:
ERROR [25bd1caa-ccc6-4773-a13a-55082909aa47 main([])]: exec.Task (TezTask.java:execute(231)) - Failed to execute tez graph. org.apache.hadoop.hbase.TableNotFoundException: Can't write, table does not exist:card_transactions at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:185) ~[hbase-server-1.4.13.jar:1.4.13] at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat.checkOutputSpecs(HiveHBaseTableOutputFormat.java:86) ~[hive-hbase-handler-2.3.9-amzn-2.jar:2.3.9-amzn-2] at org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.checkOutputSpecs(HivePassThroughOutputFormat.java:46) ~[hive-exec-2.3.9-amzn-2.jar:2.3.9-amzn-2]
The table 'airflow.card_transactions_bucketed' is created and available in Hive but HBase table ''hbase.table.name'='card_transactions'' is not. I don't see any errors in hive.log.
I am expecting the Hbase table to be created as well.
So it looks like unlike in Cloudera, in AWS the Hbase needs to be created manually. The above query does not create the Hbase table but integrates an already created Hbase table in the cluster.
I was able to insert data through the integrated Hive table and the data showed up when queried in Hbase.
Related
Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?
Current set up:
JSON files in partitioned S3 bucket are crawled once a day
Glue Job creates Parquet files in specified folder
Run ANOTHER crawler to RECREATE the same table that was made in step 1
I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!
For convenience considerations - 2 crawlers is the way to go.
For cost considerations - a hacky solution whould be:
Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;
In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.
Execute the new CREATE TABLE DDL in Athena.
For example:
SHOW CREATE TABLE json_table;
Original DDL:
CREATE EXTERNAL TABLE `json_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
...
LOCATION
's3://bucket_name/table_data'
...
New DDL:
CREATE EXTERNAL TABLE `parquet_table`(
`id` int COMMENT,
`name` string COMMENT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_name/table_data'
You can also do it in the same way with Glue api methods: get_table() > replace > create_table().
Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.
I am trying to insert a data set in Redshift with values as :
"2015-04-12T00:00:00.000+05:30"
"2015-04-18T00:00:00.000+05:30"
"2015-05-09T00:00:00.000+05:30"
"2015-05-24T00:00:00.000+05:30"
"2015-07-19T00:00:00.000+05:30"
"2015-08-02T00:00:00.000+05:30"
"2015-09-05T00:00:00.000+05:30"
The crawler which I ran over S3 data is unable to identify the columns or datatype of the values. I have been tweaking the table settings to get the job to push the data into Redshift but no avail. Here is what I have tried so far :
Manually added the column in the table definition in Glue Catalog. There is only 1 column which is mentioned above.
Changed the Serde serialization lib from LazySimpleSerde to org.apache.hadoop.hive.serde2.lazy.OpenCSVSerDe
Added the following Serde parameters - quoteChar ", line.delim \n, field.delim \n
I have already tried different combinations of line.delim and field.delim properties. Including one, omitting another and taking both at the same time as well.
Changed the classification from UNKONWN to text in table properties.
Changed the recordCount property to 469 to match the raw data row counts.
The job runs are always successful. After the job runs, when I go to select * from table_name, I always get correct count of rows in the redshift table as per the raw data but all the rows are NULL. How do I populate the rows in Redshift ?
The table properties have been uploaded in image album here : Imgur Album
I was unable to push the data into Redshift using Glue. So I turned to COPY command of Redshift. Here is the command that I executed in case anyone else needs it or faces the same situation :
copy schema_Name.Table_Name
from 's3://Path/To/S3/Data'
iam_role 'arn:aws:iam::Redshift_Role'
FIXEDWIDTH 'Column_Name:31'
region 'us-east-1';
I'm just starting to experiment with AWS glue and I've successfully been able to pull data from my Aurora MySQL environment into my PostgreSQL DB. When the crawler creates the data catalog for the table I'm experimenting with, all the columns are out of order, and then when the job creates the destination table, the columns again are out of order, I'm assuming because it's created based off of what the crawler generated. How can I make the table structure in the catalog match what's in the source DB?
You can simply open the tabke that create by crawler then click on "edit schema", click on the number at the start of each row and change them, that are the order number of the rows.
We created the schema as follows:
create external schema spectrum
from data catalog
database 'test'
iam_role 'arn:aws:iam::20XXXXXXXXXXX:role/athenaaccess'
create external database if not exists;
and table as follows:
create external table spectrum.Customer(
Subr_Id integer,
SUB_CURRENTSTATUS varchar(100),
AIN integer,
ACCOUNT_CREATED timestamp,
Subr_Name varchar(100),
LAST_DEACTIVATED timestamp)
partitioned by (LAST_ACTIVATION timestamp)
row format delimited
fields terminated by ','
stored as textfile
location 's3://cequity-redshiftspectrum-test/'
table properties ('numRows'='1000');
the access rights are as follows:
Roles of athenaQuickSight access, Full Athena access, and s3 full access are attached to the redshift cluster
However, when we query as below we are getting 0 records. please help.
select count(*) from spectrum.Customer;
If your query returns zero rows from a partitioned external table, check whether a partition has been added to this external table. Redshift Spectrum only scans files in an Amazon S3 location that has been explicitly added using ALTER TABLE … ADD PARTITION. Query the SVV_EXTERNAL_PARTITIONS view to finding existing partitions. Run ALTER TABLE ADD … PARTITION for each missing partition.
Reference
I had the same issue. Doing the above, resolved my issue.
P.S. Explicit run of ALTER TABLE command to create partition can also be automated.
I wants to run SQL queries on S3 files/bucket through HIVE. I have no idea about how to do setup. Appreciate for your help.
You first create an EXTERNAL TABLE that defines the data format and points to a location in Amazon S3:
CREATE EXTERNAL TABLE s3_export(a_col string, b_col bigint, c_col array<string>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://bucketname/path/subpath/';
You can then read from the table using normal SELECT commands, for example:
SELECT b_col FROM s3_export
Alternatively, you can use Amazon Athena to run Hive-like queries against data in Amazon S3 without even requiring a Hadoop cluster. (It is actually based on Presto syntax, which is very similar to Hive.)