How to define filegroup for SQL server in AWS? - amazon-web-services

I've been tasked to move on-prem SQL Server database to AWS RDS SQL Server and from there migrate the data to postgresql.
I'm preparing for those and in the DB creation script found the term "FILEGROUP" and each file group has been mentioned a path in the local server like below.
CREATE DATABASE [PROD_MIGTN]
CONTAINMENT = NONE
ON PRIMARY
( NAME = N'SDL_DEV', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\PROD_MIGTN.mdf' , SIZE = 503269504KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FG_EDU_GROUP]
( NAME = N'INV_EDU_GROUP', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\INV_EDU_GROUP.ndf' , SIZE = 13393920KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB ),
FILEGROUP [FG_PAYMNT_HISTORY]
( NAME = N'EXT_PAYMNT_HISTORY', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\INV_PAYMNT_HISTORY.ndf' , SIZE = 16516736KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
LOG ON
( NAME = N'PROD_MIGTN_DEV_log', FILENAME = N'D:\SQL_DB\MSSQL12.SDL_PRODDB\MSSQL\DATA\PROD_MIGTN_1.LDF' , SIZE = 133711872KB , MAXSIZE = 2048GB , FILEGROWTH = 51200KB )
Also noticed that the filegroups were tied to a partition scheme like below
CREATE PARTITION SCHEME [PS_SALE] AS PARTITION [PF_WHOLE_SALE] TO ([FG_EDU_GROUP])
GO
and the above in turn to a table
CREATE TABLE [dbo].[T_SALE](
[SALE_ID] [varchar](50) NOT NULL,
[REF_NO] [varchar](50) NOT NULL,
[COMPY_ID] [varchar](50) NOT NULL,
[STATUS_CODE] [nvarchar](128) NOT NULL
CONSTRAINT [PK_T_SALE_SALE_ID] PRIMARY KEY CLUSTERED
(
[SALE_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PS_SALE]([SALE_ID])
) ON [PS_SALE](SALE_ID)
GO
Want to know,
What is the significance of the filegroup given the scenario that i'm only going to migrate the data to postgresql and move it to archive?
Will there be any impact to data that removing the filegroup and its connected partition scheme result in mess up ?
In the event that avoiding them both will create issues, then how to define the filegroup in the cloud database i.e. how to give a file path ? If so, should i need to allocate a S3 bucket or some other kind of storage separately ?
Thanks

Related

Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

I have a table in Postgres Thing that has 100 Million rows.
I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys.
My task is to use the values of this column to populate another column with the same values but with the prefixes trimmed off. Let's call it simple_keys.
I tried the following migration:
from django.db import migrations
import time
def backfill_simple_keys(apps, schema_editor):
Thing = apps.get_model('thing', 'Thing')
batch_size = 100000
number_of_batches_completed = 0
while Thing.objects.filter(simple_key__isnull=True).exists():
things = Thing.objects.filter(simple_key__isnull=True)[:batch_size]
for tng in things:
prefixed_key = tng.prefixed_key
if prefixed_key.startswith("prefix_A"):
simple_key = prefixed_key[len("prefix_A"):]
elif prefixed_key.startswith("prefix_BBB"):
simple_key = prefixed_key[len("prefix_BBB"):]
tng.simple_key = simple_key
Thing.objects.bulk_update(
things,
['simple_key'],
batch_size=batch_size
)
number_of_batches_completed += 1
print("Number of batches updated: ", number_of_batches_completed)
sleep_seconds = 3
time.sleep(sleep_seconds)
class Migration(migrations.Migration):
dependencies = [
('thing', '0030_add_index_to_simple_key'),
]
operations = [
migrations.RunPython(
backfill_simple_keys,
),
]
Each batch took about ~7 minutes to complete. Which would means it would take days to complete!
It also increased the latency of the DB which is bing used in production.
Since you're going to go through every record in that table anyway it makes sense to traverse it in one go using a server-side cursor.
Calling
Thing.objects.filter(simple_key__isnull=True)[:batch_size]
is going to be expensive especially as the index starts to grow.
Also the call above retrieves ALL fields from that table even if you are only going to use only 2-3 fields.
update_query = """UPDATE table SET simple_key = data.key
FROM (VALUES %s) AS data (id, key) WHERE table.id = data.id"""
conn = psycopg2.connect(DSN, cursor_factory=RealDictCursor)
cursor = conn.cursor(name="key_server_side_crs") # having a name makes it a SSC
update_cursor = conn.cursor() # regular cursor
cursor.itersize = 5000 # how many records to retrieve at a time
cursor.execute("SELECT id, prefixed_key, simple_key FROM table")
count = 0
batch = []
for row in cursor:
if not row["simple_key"]:
simple_key = calculate_simple_key(row["prefixed_key"])
batch.append[(row["id"], simple_key)]
if len(batch) >= 1000 # how many records to update at once
execute_values(update_cursor, update_query, batch, page_size=1000)
batch = []
time.sleep(0.1) # allow the DB to "breathe"
count += 1
if count % 100000 == 0: # print progress every 100K rows
print("processed %d rows", count)
The above is NOT tested so it's advisable to create a copy of a few million rows of the table and test it against it first.
You can also test various batch size settings (both for retrieve and update).

How to add a partition boundary only when not exists in SQL Data Warehouse?

I am using Azure SQL Data Warehouse Gen 1, and I create a partition table like this
CREATE TABLE [dbo].[StatsPerBin1](
[Bin1] [varchar](100) NOT NULL,
[TimeWindow] [datetime] NOT NULL,
[Count] [int] NOT NULL,
[Timestamp] [datetime] NOT NULL)
WITH
(
DISTRIBUTION = HASH ( [Bin1] ),
CLUSTERED INDEX([Bin1]),
PARTITION
(
[TimeWindow] RANGE RIGHT FOR VALUES ()
)
)
How should I split a partition only when there is no such boundary?
First I think if I can get partition boundaries by table name, then I can write a if statement to determine add partition boundary or not.
But I cannot find a way to associate a table with its corresponding partition values, the partition values of all partitions can be retrieved by
SELECT * FROM sys.partition_range_values
But it only contains function_id as identifier which I don't know how to join other tables so that I can get partition boundaries by table name.
Have you tried joining sys.partition_range_values with sys.partition_functions view?
Granted we cannot create partition functions in SQL DW, but the view seems to be still supported.
I know this is an out of date question, but I was having the same problem. Here is a query I ended up with that can get you started. It is modified slightly from a query for SQL Server documentation:
SELECT s.[name] AS [schema_name]
, t.[name] AS [table_name]
, p.[partition_number] AS [partition_number]
, rv.[value] AS [partition_boundary_value]
, p.[data_compression_desc] AS [partition_compression_desc]
FROM sys.schemas s
JOIN sys.tables t ON t.[schema_id] = s.[schema_id]
JOIN sys.partitions p ON p.[object_id] = t.[object_id]
JOIN sys.indexes i ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.data_spaces ds ON ds.[data_space_id] = i.[data_space_id]
LEFT JOIN sys.partition_schemes ps ON ps.[data_space_id] = ds.[data_space_id]
LEFT JOIN sys.partition_functions pf ON pf.[function_id] = ps.[function_id]
LEFT JOIN sys.partition_range_values rv ON rv.[function_id] = pf.[function_id]
AND rv.[boundary_id] = p.[partition_number]

Azure Data Warehouse External Table

How to only read specific file from External Table that is pointing to a Folder in ADLS that has thousands of file ?
You can't do that with external tables / Polybase when the external table has already been created, but what you could do is create your own external table specifying the filename in the definition. eg if you table definition is like this (where a filename is not specified):
CREATE EXTERNAL TABLE ext.LINEITEM (
L_ORDERKEY BIGINT NOT NULL,
...
)
WITH (
LOCATION = 'input/lineitem/',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = TextFileFormat
);
You could copy it and your own table, eg
CREATE EXTERNAL TABLE ext.LINEITEM_42 (
L_ORDERKEY BIGINT NOT NULL,
...
)
WITH (
LOCATION = 'input/lineitem/lineitem42.txt',
DATA_SOURCE = AzureDataLakeStore,
FILE_FORMAT = TextFileFormat
);
See the difference? Another alternative would be to use one of the languages / platforms that can easily access data lake eg U-SQL, Databricks to write a query accessing the lake, eg a little U-SQL:
#input =
EXTRACT
l_orderkey int
...
FROM "/input/lineitem/lineitem42.txt"
USING Extractors.Csv(skipFirstNRows : 1 );
A little Scala:
val lineitem42 = "/mnt/lineitem/lineitem42.txt"
var df42 = spark.read
.option("sep", "|") // Use pipe separator
.csv(lineitem42)

CFQuery - Update a table by comparing it to another table [duplicate]

I have a database with account numbers and card numbers. I match these to a file to update any card numbers to the account number so that I am only working with account numbers.
I created a view linking the table to the account/card database to return the Table ID and the related account number, and now I need to update those records where the ID matches the Account Number.
This is the Sales_Import table, where the account number field needs to be updated:
LeadID
AccountNumber
147
5807811235
150
5807811326
185
7006100100007267039
And this is the RetrieveAccountNumber table, where I need to update from:
LeadID
AccountNumber
147
7006100100007266957
150
7006100100007267039
I tried the below, but no luck so far:
UPDATE [Sales_Lead].[dbo].[Sales_Import]
SET [AccountNumber] = (SELECT RetrieveAccountNumber.AccountNumber
FROM RetrieveAccountNumber
WHERE [Sales_Lead].[dbo].[Sales_Import]. LeadID =
RetrieveAccountNumber.LeadID)
It updates the card numbers to account numbers, but the account numbers get replaced by NULL
I believe an UPDATE FROM with a JOIN will help:
MS SQL
UPDATE
Sales_Import
SET
Sales_Import.AccountNumber = RAN.AccountNumber
FROM
Sales_Import SI
INNER JOIN
RetrieveAccountNumber RAN
ON
SI.LeadID = RAN.LeadID;
MySQL and MariaDB
UPDATE
Sales_Import SI,
RetrieveAccountNumber RAN
SET
SI.AccountNumber = RAN.AccountNumber
WHERE
SI.LeadID = RAN.LeadID;
The simple Way to copy the content from one table to other is as follow:
UPDATE table2
SET table2.col1 = table1.col1,
table2.col2 = table1.col2,
...
FROM table1, table2
WHERE table1.memberid = table2.memberid
You can also add the condition to get the particular data copied.
For SQL Server 2008 + Using MERGE rather than the proprietary UPDATE ... FROM syntax has some appeal.
As well as being standard SQL and thus more portable it also will raise an error in the event of there being multiple joined rows on the source side (and thus multiple possible different values to use in the update) rather than having the final result be undeterministic.
MERGE INTO Sales_Import
USING RetrieveAccountNumber
ON Sales_Import.LeadID = RetrieveAccountNumber.LeadID
WHEN MATCHED THEN
UPDATE
SET AccountNumber = RetrieveAccountNumber.AccountNumber;
Unfortunately the choice of which to use may not come down purely to preferred style however. The implementation of MERGE in SQL Server has been afflicted with various bugs. Aaron Bertrand has compiled a list of the reported ones here.
Generic answer for future developers.
SQL Server
UPDATE
t1
SET
t1.column = t2.column
FROM
Table1 t1
INNER JOIN Table2 t2
ON t1.id = t2.id;
Oracle (and SQL Server)
UPDATE
t1
SET
t1.colmun = t2.column
FROM
Table1 t1,
Table2 t2
WHERE
t1.ID = t2.ID;
MySQL
UPDATE
Table1 t1,
Table2 t2
SET
t1.column = t2.column
WHERE
t1.ID = t2.ID;
For PostgreSQL:
UPDATE Sales_Import SI
SET AccountNumber = RAN.AccountNumber
FROM RetrieveAccountNumber RAN
WHERE RAN.LeadID = SI.LeadID;
Seems you are using MSSQL, then, if I remember correctly, it is done like this:
UPDATE [Sales_Lead].[dbo].[Sales_Import] SET [AccountNumber] =
RetrieveAccountNumber.AccountNumber
FROM RetrieveAccountNumber
WHERE [Sales_Lead].[dbo].[Sales_Import].LeadID = RetrieveAccountNumber.LeadID
I had the same problem with foo.new being set to null for rows of foo that had no matching key in bar. I did something like this in Oracle:
update foo
set foo.new = (select bar.new
from bar
where foo.key = bar.key)
where exists (select 1
from bar
where foo.key = bar.key)
Here's what worked for me in SQL Server:
UPDATE [AspNetUsers] SET
[AspNetUsers].[OrganizationId] = [UserProfile].[OrganizationId],
[AspNetUsers].[Name] = [UserProfile].[Name]
FROM [AspNetUsers], [UserProfile]
WHERE [AspNetUsers].[Id] = [UserProfile].[Id];
For MySql that works fine:
UPDATE
Sales_Import SI,RetrieveAccountNumber RAN
SET
SI.AccountNumber = RAN.AccountNumber
WHERE
SI.LeadID = RAN.LeadID
Thanks for the responses. I found a solution tho.
UPDATE Sales_Import
SET AccountNumber = (SELECT RetrieveAccountNumber.AccountNumber
FROM RetrieveAccountNumber
WHERE Sales_Import.leadid =RetrieveAccountNumber.LeadID)
WHERE Sales_Import.leadid = (SELECT RetrieveAccountNumber.LeadID
FROM RetrieveAccountNumber
WHERE Sales_Import.leadid = RetrieveAccountNumber.LeadID)
In case the tables are in a different databases. (MSSQL)
update database1..Ciudad
set CiudadDistrito=c2.CiudadDistrito
FROM database1..Ciudad c1
inner join
database2..Ciudad c2 on c2.CiudadID=c1.CiudadID
Use the following block of query to update Table1 with Table2 based on ID:
UPDATE Sales_Import, RetrieveAccountNumber
SET Sales_Import.AccountNumber = RetrieveAccountNumber.AccountNumber
where Sales_Import.LeadID = RetrieveAccountNumber.LeadID;
This is the easiest way to tackle this problem.
MS Sql
UPDATE c4 SET Price=cp.Price*p.FactorRate FROM TableNamea_A c4
inner join TableNamea_B p on c4.Calcid=p.calcid
inner join TableNamea_A cp on c4.Calcid=cp.calcid
WHERE c4..Name='MyName';
Oracle 11g
MERGE INTO TableNamea_A u
using
(
SELECT c4.TableName_A_ID,(cp.Price*p.FactorRate) as CalcTot
FROM TableNamea_A c4
inner join TableNamea_B p on c4.Calcid=p.calcid
inner join TableNamea_A cp on c4.Calcid=cp.calcid
WHERE p.Name='MyName'
) rt
on (u.TableNamea_A_ID=rt.TableNamea_B_ID)
WHEN MATCHED THEN
Update set Price=CalcTot ;
update from one table to another table on id matched
UPDATE
TABLE1 t1,
TABLE2 t2
SET
t1.column_name = t2.column_name
WHERE
t1.id = t2.id;
The below SQL someone suggested, does NOT work in SQL Server. This syntax reminds me of my old school class:
UPDATE table2
SET table2.col1 = table1.col1,
table2.col2 = table1.col2,
...
FROM table1, table2
WHERE table1.memberid = table2.memberid
All other queries using NOT IN or NOT EXISTS are not recommended. NULLs show up because OP compares entire dataset with smaller subset, then of course there will be matching problem. This must be fixed by writing proper SQL with correct JOIN instead of dodging problem by using NOT IN. You might run into other problems by using NOT IN or NOT EXISTS in this case.
My vote for the top one, which is conventional way of updating a table based on another table by joining in SQL Server. Like I said, you cannot use two tables in same UPDATE statement in SQL Server unless you join them first.
This is the easiest and best have seen for Mysql and Maria DB
UPDATE table2, table1 SET table2.by_department = table1.department WHERE table1.id = table2.by_id
Note: If you encounter the following error based on your Mysql/Maria DB version "Error Code: 1175. You are using safe update mode and you tried to update a table without a WHERE that uses a KEY column To disable safe mode, toggle the option in Preferences"
Then run the code like this
SET SQL_SAFE_UPDATES=0;
UPDATE table2, table1 SET table2.by_department = table1.department WHERE table1.id = table2.by_id
it works with postgresql
UPDATE application
SET omts_received_date = (
SELECT
date_created
FROM
application_history
WHERE
application.id = application_history.application_id
AND application_history.application_status_id = 8
);
update within the same table:
DECLARE #TB1 TABLE
(
No Int
,Name NVarchar(50)
,linkNo int
)
DECLARE #TB2 TABLE
(
No Int
,Name NVarchar(50)
,linkNo int
)
INSERT INTO #TB1 VALUES(1,'changed person data', 0);
INSERT INTO #TB1 VALUES(2,'old linked data of person', 1);
INSERT INTO #TB2 SELECT * FROM #TB1 WHERE linkNo = 0
SELECT * FROM #TB1
SELECT * FROM #TB2
UPDATE #TB1
SET Name = T2.Name
FROM #TB1 T1
INNER JOIN #TB2 T2 ON T2.No = T1.linkNo
SELECT * FROM #TB1
I thought this is a simple example might someone get it easier,
DECLARE #TB1 TABLE
(
No Int
,Name NVarchar(50)
)
DECLARE #TB2 TABLE
(
No Int
,Name NVarchar(50)
)
INSERT INTO #TB1 VALUES(1,'asdf');
INSERT INTO #TB1 VALUES(2,'awerq');
INSERT INTO #TB2 VALUES(1,';oiup');
INSERT INTO #TB2 VALUES(2,'lkjhj');
SELECT * FROM #TB1
UPDATE #TB1 SET Name =S.Name
FROM #TB1 T
INNER JOIN #TB2 S
ON S.No = T.No
SELECT * FROM #TB1
try this :
UPDATE
Table_A
SET
Table_A.AccountNumber = Table_B.AccountNumber ,
FROM
dbo.Sales_Import AS Table_A
INNER JOIN dbo.RetrieveAccountNumber AS Table_B
ON Table_A.LeadID = Table_B.LeadID
WHERE
Table_A.LeadID = Table_B.LeadID
MYSQL (This is my preferred way for restoring all specific column reasonId values, based on primary key id equivalence)
UPDATE `site` AS destination
INNER JOIN `site_copy` AS backupOnTuesday
ON backupOnTuesday.`id` = destination.`id`
SET destdestination.`reasonId` = backupOnTuesday.`reasonId`
This will allow you to update a table based on the column value not being found in another table.
UPDATE table1 SET table1.column = 'some_new_val' WHERE table1.id IN (
SELECT *
FROM (
SELECT table1.id
FROM table1
LEFT JOIN table2 ON ( table2.column = table1.column )
WHERE table1.column = 'some_expected_val'
AND table12.column IS NULL
) AS Xalias
)
This will update a table based on the column value being found in both tables.
UPDATE table1 SET table1.column = 'some_new_val' WHERE table1.id IN (
SELECT *
FROM (
SELECT table1.id
FROM table1
JOIN table2 ON ( table2.column = table1.column )
WHERE table1.column = 'some_expected_val'
) AS Xalias
)
Summarizing the other answers, there're 4 variants of how to update target table using data from another table only when "match exists"
Query and sub-query:
update si
set si.AccountNumber = (
select ran.AccountNumber
from RetrieveAccountNumber ran
where si.LeadID = ran.LeadID
)
from Sales_Import si
where exists (select * from RetrieveAccountNumber ran where ran.LeadID = si.LeadID)
Inner join:
update si
set si.AccountNumber = ran.AccountNumber
from Sales_Import si inner join RetrieveAccountNumber ran on si.LeadID = ran.LeadID
Cross join:
update si
set si.AccountNumber = ran.AccountNumber
from Sales_Import si, RetrieveAccountNumber ran
where si.LeadID = ran.LeadID
Merge:
merge into Sales_Import si
using RetrieveAccountNumber ran on si.LeadID = ran.LeadID
when matched then update set si.accountnumber = ran.accountnumber;
All variants are more-less trivial and understandable, personally I prefer "inner join" option. But any of them could be used and developer has to select "better option" according to his/her needs
From performance perspective variants with join-s are more preferable:
Oracle 11g
merge into Sales_Import
using RetrieveAccountNumber
on (Sales_Import.LeadId = RetrieveAccountNumber.LeadId)
when matched then update set Sales_Import.AccountNumber = RetrieveAccountNumber.AccountNumber;
For Oracle SQL try using alias
UPDATE Sales_Lead.dbo.Sales_Import SI
SET SI.AccountNumber = (SELECT RAN.AccountNumber FROM RetrieveAccountNumber RAN WHERE RAN.LeadID = SI.LeadID);
I'd like to add one extra thing.
Don't update a value with the same value, it generates extra logging and unnecessary overhead.
See example below - it will only perform the update on 2 records despite linking on 3.
DROP TABLE #TMP1
DROP TABLE #TMP2
CREATE TABLE #TMP1(LeadID Int,AccountNumber NVarchar(50))
CREATE TABLE #TMP2(LeadID Int,AccountNumber NVarchar(50))
INSERT INTO #TMP1 VALUES
(147,'5807811235')
,(150,'5807811326')
,(185,'7006100100007267039');
INSERT INTO #TMP2 VALUES
(147,'7006100100007266957')
,(150,'7006100100007267039')
,(185,'7006100100007267039');
UPDATE A
SET A.AccountNumber = B.AccountNumber
FROM
#TMP1 A
INNER JOIN #TMP2 B
ON
A.LeadID = B.LeadID
WHERE
A.AccountNumber <> B.AccountNumber --DON'T OVERWRITE A VALUE WITH THE SAME VALUE
SELECT * FROM #TMP1
ORACLE
use
UPDATE suppliers
SET supplier_name = (SELECT customers.customer_name
FROM customers
WHERE customers.customer_id = suppliers.supplier_id)
WHERE EXISTS (SELECT customers.customer_name
FROM customers
WHERE customers.customer_id = suppliers.supplier_id);
update table1 dpm set col1 = dpu.col1 from table2 dpu where dpm.parameter_master_id = dpu.parameter_master_id;
If above answers not working for you try this
Update Sales_Import A left join RetrieveAccountNumber B on A.LeadID = B.LeadID
Set A.AccountNumber = B.AccountNumber
where A.LeadID = B.LeadID

WSO2 - Table created using Analytic Script Invisible in Gadget Generation Tool

My use case: pushes data from a stream configured in the ESB to BAM and create a report using “Gadget Generation Tool”
Publishing the stream from ESB to BAM after adding an agent to the proxy service worked fine.
From the stream I created a table using the Analytics->Add screen and the table seems to persist as I am able to do a select and see results from the same screen.
Now I am trying to generate a Dashboard using the Gadget Generation Tool but the table is not available, though the jdbc connection is working fine but the table is nowhere:
Script for Analytic Table run from Analytics->Add screen
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(creditkey STRING, creditFlag STRING, version STRING)
STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'
WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" ,
"cassandra.ks.username" = "admin" ,
"cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" ,
"cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Tried looking for table in following databases:
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Have not done any custom db configurations.
Did you try jdbc:h2:repository/database/samples/WSO2CARBON_DB;AUTO_SERVER=TRUE? Also, what you have pasted is the Cassandra Storage Definition, probably used for getting the input, not persisting the output. If you give the full hive query, that would help to figure out the problem more.
Why did I not see the table in Gadget Generation tool?
The table I have created using the Hive script is a Casandra Distributed database table and the reference I gave in the Gadget generation tool while looking up for the table were from the h2 RDBMS database table.
Below are the references to the h2 RDBMS databse which comes out of box with WSO2
jdbc:h2:repository/database/WSO2CARBON_DB;AUTO_SERVER=TRUE
jdbc:h2:repository/database/metastore_db;AUTO_SERVER=TRUE
jdbc:h2:repository/database/samples/BAM_STATS_DB;AUTO_SERVER=TRUE
Resolution ----- How to get tables listed in the Gadget Generation tool?
To get the tables listed in the Gadget Generation tool you have to extensively use the Hive Script to complete the following 3 steps:
Create a Hive table reference for the Casandra data stream to which data is pushed from ESB in my case.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLE(
payload_creditkey STRING, payload_creditFlag STRING, payload_version STRING) STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH SERDEPROPERTIES ( "cassandra.host" = "127.0.0.1" ,
"cassandra.port" = "9163" , "cassandra.ks.name" = "EVENT_KS" , "cassandra.ks.username" = "admin" , "cassandra.ks.password" = "admin" ,
"cassandra.cf.name" = "firstStream" , "cassandra.columns.mapping" = ":key,payload_k1-constant, Version" );
Using Hive script create a H2 RDBMS script and reference to which I would be copying my data from the Casandra stream.
CREATE EXTERNAL TABLE IF NOT EXISTS CREDITTABLEh2summary(
creditFlg STRING,
verSion STRING
)
STORED BY
'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
TBLPROPERTIES (
'mapred.jdbc.driver.class' = 'org.h2.Driver' ,
'mapred.jdbc.url' = 'jdbc:h2:C:/wso2bam-2.2.0/repository/samples/database/BAM_STATS_DB' ,
'mapred.jdbc.username' = 'wso2carbon' ,
'mapred.jdbc.password' = 'wso2carbon' ,
'hive.jdbc.update.on.duplicate' = 'true' ,
'hive.jdbc.primary.key.fields' = 'creditFlg' ,
'hive.jdbc.table.create.query' = 'CREATE TABLE CREDITTABLE_newh2(creditFlg VARCHAR(100), version VARCHAR(100))' );
Write a Hive query using which data would be copied from Casandra to H2[RDBMS]
insert overwrite table CREDITTABLEh2summary select a.payload_creditFlag,a.payload_version from CREDITTABLE a;
On doing this I was able to see the table in the Gadget Generation tool however I also had to chage the referenc to the H2 Database to absolute in the JDBC URL value that I passed.
Observation:
Was wondering if the Gadget generation tool can directly point to the Casandra Stream without having to copy the tables to a RDBMS database.