VORA Tables in Zeppelin and Spark shell - vora

We have created test table from spark shell as well as from Zepellin. But when we do show tables on single table is visible in respective environment. Table created via spark shell is not displayed in Zepellin show table command.
What is the difference between these two tables ? can anybody please explain.

The show tables command only shows the tables defined in the current session.
A table is created in a current session and also in a (persistent) catalog in Zookeeper. You can show all tables that Vora saved in Zookeeper via this command:
SHOW DATASOURCETABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
You can also register all or single tables in the current session via this command:
REGISTER ALL TABLES
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
REGISTER TABLE <tablename>
USING com.sap.spark.vora
OPTIONS(zkurls "<zookeeper_server>:2181")
So if you want to access the table that you created in the Spark Shell from Zookeeper and vice versa you need to register it first.
You can use those commands if you need to clear the Zookeeper Catalog. Be aware that tables then need to be recreated:
import com.sap.spark.vora.client._
ClusterUtils.clearZooKeeperCatalog("<zookeeper_server>:2181")
This (and more) information can be found in the Vora Installation and Developer Guide

Related

How can I reuse spark SQL view/table across multiple AWS EMR steps?

I am submitting multiple steps (concurrency - 1) to AWS EMR cluster by command - 'spark-submit --deploy-mode client --master yarn <>' one after other.
In first step I'm reading for S3 and creating dataframe out of it. I'm registering this dataframe as spark SQL table/view using createGlobalTempView
In second step I'm trying to access table/view in my spark SQL query (tried with global_temp... as well), but getting table/view not found exception.
What I am missing? Doesn't createGlobalTempView should be accessible across multiples sessions? Or sessions and steps are different things? How I can achieve this?
One step in EMR is like a single spark application and the lifetime of the view created using createGlobalTempView is tied to this Spark application.
The same can be seen in the pyspark documentation
And yes session and steps are different other things.
A step in EMR submits a job that creates a spark application and once the step finishes executing, all the temp views created within it are gone.
The same can be verified by adding this small line of code in your each-step script.
print(spark.sparkContext.getApplicationId())
spark mentioned above is sparkSession.
Both the steps that you have executed will have different applications.
There can be multiple SparkSessions associated with a spark application and can be only one SparkContext per application.
To achieve this you can create a temp table in Hive like:
df.write.mode('overwrite').saveAsTable("table-name")
and
can use this table in the next step as input data.
df= spark.sql("select * from table-name")
or
if you don't want to create a temp table another way is to include your transformation sqls on the same script.

BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point error from Multiple database table plugin

I'm trying to ingest data from different tables with in same database using Data fusion Multiple database tables plugin to bigquery tables using multiple big query tables sink. I write 3 different custom SQL and add them inside the plugin section which is under "Data Section Mode" > "Custom SQL Statements".
The problem is When I preview or deploy and run the pipeline I get the error "BigQuery Multi Table has no outputs. Please check that the sink calls addOutput at some point."
What I try to figure out this problem;
Run custom SQL on database and worked properly.
Create pipelines that are specific for custom SQLs but it's like 1 table ingestion from sql server to bigquery table as sink. it worked properly.
Try different Data Section Mode under multiple database tables plugin that is Table Allow List , works but it's just insert all data with no option to transform any column or filtering. Did that one to see if plugin can reach the database and able to read data ,it can read.
Data Pipeline - Multiple Database Tables Plugin Config - 1
Data Pipeline - Multiple Database Tables Plugin Config - 2
As a conclusion I would like to ingest data from one database with multiple tables with in one data pipeline. If possible I would like to do it with writing custom sqls for each tables.
Open for any advice and try.
Thank you.

How can i access metadata db of GCP Composer Airflow server?

I have created one Composer in gcp project. I want to access the Metadatadb of Airflow which runs at background on Cloud SQL.
How can i access that?
Also i want to create one table inside that metadatadb which i will be using to store some data query by one of airflow dag. Is it ok to create any table inside that metadatadb or that metadatadb is only for airflow server use?
You can access Airflow internal DB via UI using Data Profiling -> Ad Hoc Query
There you can see all the tables with a SQL query like :
SHOW tables;
I wouldn't recommand creating a new table or manually inserting rows into existing tables thought.
You should also be able to access this DB in your DAGs operators and sensors by using airflow-db connexion.

AWS DMS - Migrate - only schema

We have noticed that if a table is empty in SQL Server, the empty table does not come via DMS. Only after inserting a record it starts to show up.
Just checking, is there a way to get the schema only from DMS?
Thanks
You can use Schema conversion tool for moving DB objects and Schema. Its a free tool by AWS and can be installed on On-Prem server or on EC2. It gives a good report before you can actually migrate the DB schema and other DB objects. It shows how many Tables, SP's Funcs etc can be directly migrated and shows possible solutions too.

WSO2IS 5.10.0 - What SQL file(s) to create USERSTORE_DB

I'm installing WSO2IS 5.10.0 and I am creating five PostgreSQL databases per the column titled Recommended Database Structure in this document:
https://is.docs.wso2.com/en/next/setup/setting-up-separate-databases-for-clustering/
Actually it's six databases if you count the CARBON_DB. The five PostgreSQL databases are named as follows: SHARED_DB, USERSTORE_DB, IDENTITY_DB, CONSENT_MGT_DB and BPS_DB. I already have them configured in the deployment.toml file. I've created the databases in PostgreSQL and I have to manually execute the SQL files against each database in order to create the schema for each database. Based on the document in the link, I have figured out which SQL files to execute for four of the databases. However, I have no idea what SQL files I need to execute to create the USERSTORE_DB schema. It's got to be one of the files under the dbscripts directory but I just don't know which one(s). Can anybody help me on this one?
The CARBON_DB contains product-specific data. And by default that stores in the embedded h2 database. There is no requirement to point that DB to the PostgreSQL database. Hence you need to worry only about these databases SHARED_DB, USERSTORE_DB, IDENTITY_DB, CONSENT_MGT_DB and BPS_DB.
As per your next question, You can find the DB scripts related to USER_DB(USERSTORE_DB) in /dbscripts/postgresql.sql file. This file has tables starting with the name UM_. These tables are the user management tables. You can use those table sql scripts to create tables in USERSTORE_DB.
Refer the following doc for more information
[1]https://is.docs.wso2.com/en/5.10.0/administer/user-management-related-tables/