Sqoop MDSYS.SDO_GEOMETRY columns in GCP DataProc - google-cloud-platform

I am trying to sqoop MDSYS.SDO_GEOMETRY column into GCS Bucket using dataproc but sqoop is ignoring MDSYS.SDO_GEOMETRY column from selection. I am not sure what is the issue. Also Do I need to convert it using map-column-java?
I tried to convert it to String using map column java but it says column not found means sqoop is ignoring that column from selection

Related

Hbase Tables not created in EMR cluster using Hive-Hbase Integration

I am new to AWS EMR and have created a Hive-Hbase table using the following code:
CREATE EXTERNAL TABLE IF NOT EXISTS airflow.card_transactions(card_id bigint,member_id bigint,amount float,postcode int,pos_id bigint,transaction_dt timestamp,status string) row format delimited fields terminated by ',' stored as textfile location '/user/hadoop/projectFD_pipeline/card_transactions'"
CREATE TABLE IF NOT EXISTS airflow.card_transactions_bucketed(cardid_txnts string,card_id bigint,member_id bigint,amount float,postcode int,pos_id bigint,transaction_dt timestamp,status string) clustered by (card_id) into 8 buckets STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with SERDEPROPERTIES ('hbase.columns.mapping'=':key,trans_data:card_id,trans_data:member_id,trans_data:amount,trans_data:postcode,trans_data:pos_id,trans_data:transaction_dt,trans_data:status') TBLPROPERTIES('hbase.table.name'='card_transactions')"
When i tried to insert values into this table:
INSERT OVERWRITE TABLE airflow.card_transactions_bucketed select concat_ws('~',cast(card_id as string),cast(transaction_dt as string)) as cardid_txnts,card_id,member_id,amount,postcode,pos_id,transaction_dt,status from airflow.card_transactions it started failing with this error:
ERROR [25bd1caa-ccc6-4773-a13a-55082909aa47 main([])]: exec.Task (TezTask.java:execute(231)) - Failed to execute tez graph. org.apache.hadoop.hbase.TableNotFoundException: Can't write, table does not exist:card_transactions at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:185) ~[hbase-server-1.4.13.jar:1.4.13] at org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat.checkOutputSpecs(HiveHBaseTableOutputFormat.java:86) ~[hive-hbase-handler-2.3.9-amzn-2.jar:2.3.9-amzn-2] at org.apache.hadoop.hive.ql.io.HivePassThroughOutputFormat.checkOutputSpecs(HivePassThroughOutputFormat.java:46) ~[hive-exec-2.3.9-amzn-2.jar:2.3.9-amzn-2]
The table 'airflow.card_transactions_bucketed' is created and available in Hive but HBase table ''hbase.table.name'='card_transactions'' is not. I don't see any errors in hive.log.
I am expecting the Hbase table to be created as well.
So it looks like unlike in Cloudera, in AWS the Hbase needs to be created manually. The above query does not create the Hbase table but integrates an already created Hbase table in the cluster.
I was able to insert data through the integrated Hive table and the data showed up when queried in Hbase.

Spark SQL query to get the last updated timestamp of a Athena table stored as CSV in AWS S3

Is it possible to get the last updated timestamp of a Athena Table stored as a CSV file format in S3 location using Spark SQL query.
If yes, can someone please provide more information on it.
There are multiple ways to do this.
Use the athena jdbc driver and do a spark read where the format is jdbc. In this read you will provide your "select max(timestamp) from table" query. Then as the next step just save to s3 fcrom the spark dataframe
You can skip the jdbc read altogther and just use boto3 to run the above query. It would be a combination of start_query_execution and get_query_results. You can then save this to s3 as well.

Spanner to CSV DataFlow

I am trying to copy table from spanner to big query. I have created two dataflow. One which copies from spanner to text file and other one that imports text file into bigquery.
Table has a column which has JSON string as a value. Issue is seen when dataflow job runs while importing from text file to bigquery. Job throws below error :
INVALD JSON: :1:38 Expected eof but found, "602...
Is there anyway I can exclude this column while copying or any way I can copy JSON object as it is? I tried excluding this column in schema file but it did not help.
Thank you!
Looking at https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-spanner-to-cloud-storage-text there are options on BigQuery import jobs that would allow to skip columns, neither Cloud Spanner options that would skip a column when extracting.
I think your best shot is to write a custom processor that will drop the column, similar to Cleaning data in CSV files using dataflow.
it's more complicated but you can also try DataPrep: http://cloud/dataprep/docs/html/Drop-Transform_57344635. It should be possible to run DataPrep jobs as a DataFlow template.

Problems while uploading quoted data to Redshift from S3 using AWS GLUE. How do I insert the data?

I am trying to insert a data set in Redshift with values as :
"2015-04-12T00:00:00.000+05:30"
"2015-04-18T00:00:00.000+05:30"
"2015-05-09T00:00:00.000+05:30"
"2015-05-24T00:00:00.000+05:30"
"2015-07-19T00:00:00.000+05:30"
"2015-08-02T00:00:00.000+05:30"
"2015-09-05T00:00:00.000+05:30"
The crawler which I ran over S3 data is unable to identify the columns or datatype of the values. I have been tweaking the table settings to get the job to push the data into Redshift but no avail. Here is what I have tried so far :
Manually added the column in the table definition in Glue Catalog. There is only 1 column which is mentioned above.
Changed the Serde serialization lib from LazySimpleSerde to org.apache.hadoop.hive.serde2.lazy.OpenCSVSerDe
Added the following Serde parameters - quoteChar ", line.delim \n, field.delim \n
I have already tried different combinations of line.delim and field.delim properties. Including one, omitting another and taking both at the same time as well.
Changed the classification from UNKONWN to text in table properties.
Changed the recordCount property to 469 to match the raw data row counts.
The job runs are always successful. After the job runs, when I go to select * from table_name, I always get correct count of rows in the redshift table as per the raw data but all the rows are NULL. How do I populate the rows in Redshift ?
The table properties have been uploaded in image album here : Imgur Album
I was unable to push the data into Redshift using Glue. So I turned to COPY command of Redshift. Here is the command that I executed in case anyone else needs it or faces the same situation :
copy schema_Name.Table_Name
from 's3://Path/To/S3/Data'
iam_role 'arn:aws:iam::Redshift_Role'
FIXEDWIDTH 'Column_Name:31'
region 'us-east-1';

Aws Athena - Rename column name

I am trying to change a column name in an AWS Athena table.
From old_name to new_name.
Normal DDL commands does not affect the table (They cannot be executed).
Is It possible to change a column name without deleting and re-creating the table from scratch ?
I was mistaken, Athena uses HIVE DDL syntax so the correct command is :
ALTER TABLE %%table-name%% CHANGE %%old-column-name%% %%new-column-name%%<string>;
I based my answer on a hive related question.
You can find more about supported and unsupported DDLs here