How to use parquet files as a source for can in Apache Calcite? - apache-calcite

I want to read from a parquet file as from a table using Apache Calcite. There are bunch of adapters listed in the docs, but no explicit one for parquet. From the other hand there is adapter for Spark, which can deal with Parquet perfectly. But for some reason I can't really find any example of how to use this spark adapter. Even reading it's code I can't say I understand how I need to define Spark's schemas. There is no factory for it... I've tried following code without really understanding how it should work and it obviously doesn't work:
Class.forName("org.apache.calcite.jdbc.Driver");
Properties info = new Properties();
info.setProperty("lex", "JAVA");
info.setProperty("spark", "true");
Connection connection = DriverManager.getConnection("jdbc:calcite:", info);
CalciteConnection calciteConnection = connection.unwrap(CalciteConnection.class);
SchemaPlus rootSchema = calciteConnection.getRootSchema();
SparkSession spark = SparkSession.builder()
.appName("test")
.master("local[1]")
.getOrCreate();
StopWatch w = StopWatch.createStarted();
Dataset<Row> ds = spark.read().parquet("/tmp/test.parquet");
ds.select("issue_desc", "valid_from_dttm").show(15);
ds.printSchema();
ds.createTempView("sparkTable");
System.out.println(w.getTime(TimeUnit.MILLISECONDS));
FrameworkConfig calciteConfig = Frameworks.newConfigBuilder()
.parserConfig(SqlParser.Config.DEFAULT)
.defaultSchema(rootSchema)
.programs()
.traitDefs(ConventionTraitDef.INSTANCE, RelDistributionTraitDef.INSTANCE)
.build();
RelBuilder builder = RelBuilder.create(calciteConfig);
RelRunner relRunner = calciteConnection.unwrap(RelRunner.class);
RelNode test1 = builder
.scan("sparkTable")
.build();
executeNode(relRunner, test1);
It simply fails with the exception:
Exception in thread "main" org.apache.calcite.runtime.CalciteException: Table 'sparkTable' not found
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.calcite.runtime.Resources$ExInstWithCause.ex(Resources.java:506)
at org.apache.calcite.runtime.Resources$ExInst.ex(Resources.java:600)
at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1238)
at org.apache.calcite.tools.RelBuilder.scan(RelBuilder.java:1265)
at ru.tinkoff.dwh.hercule.demo.SparkTest.main(SparkTest.java:64)
Could please somebody explain how to use it, or share an example, or explain why I can't use spark adapter like this?

Related

Can we change location from US to other region while reading data from Bigquery using Bigquery java library?

I am trying to read data from Bigquery using Bigquery java library.
My dataset is not in US location, so when i am giving my dataset name to library , it is throwing an error that dataset not found in US location because it searches by default in US location.
I have also tried giving the location using setLocation("asia-southeast1") but still it is finding in US location.
This is my code snippet:
val bigquery: BigQuery =BigQueryOptions.newBuilder().setLocation("asia-southeast1").build().getService
val query = "SELECT TO_JSON_STRING(t, true) AS json_row FROM "+dbName+"."+tableName+" AS t"
logger.info("Query is " + query)
val queryResult: QueryJobConfiguration = QueryJobConfiguration.newBuilder(query).build
val result: TableResult = bigquery.query(queryResult)
I am writing code in SCALA. As it uses same libraries as JAVA and JAVA is more popular, thats why I am asking this for JAVA.
Please help me to know that how I can change location from US to southeast.
Can I change something inside QueryJobConfiguration as i have searched a-lot but i am unable to find anything.
My only requirement is that I want final result as TableResult.
This is the exception being thrown
com.google.cloud.bigquery.BigQueryException: Not found: Dataset XXXXXXXX was not found in location US
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:106)
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:584)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1203)
at com.google.cloud.bigquery.BigQueryImpl$34.call(BigQueryImpl.java:1198)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1197)
at com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1181)
at com.google.cloud.bigquery.Job$1.call(Job.java:329)
at com.google.cloud.bigquery.Job$1.call(Job.java:326)
at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105)
at com.google.cloud.RetryHelper.run(RetryHelper.java:76)
at com.google.cloud.RetryHelper.poll(RetryHelper.java:64)
at com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:325)
at com.google.cloud.bigquery.Job.getQueryResults(Job.java:291)
at com.google.cloud.bigquery.BigQueryImpl.query(BigQueryImpl.java:1168)
...
Thanks in advance.
You shouldn't actually need to specify the location because BigQuery will infer it from the dataset being referenced in your query. See here.
When loading data, querying data, or exporting data, BigQuery
determines the location to run the job based on the datasets
referenced in the request. For example, if a query references a table
in a dataset stored in the asia-northeast1 region, the query job will
run in that region.
I just tested using the Java SDK on a dataset/table I created in asia-southeast1, and it worked without needing to explicitly specify the location.
If it's still not working for you by default (check the table you're referncing actually exists), then you can specify the location by setting it in the JobId and passing that to the overloaded method:
String query = "SELECT * FROM `grey-sort-challenge.asia_southeast1.a_table`;";
QueryJobConfiguration queryConfig = QueryJobConfiguration.newBuilder(query)
.setUseLegacySql(Boolean.FALSE)
.build();
JobId id = JobId.newBuilder().setLocation("asia-southeast1")
.setRandomJob()
.build();
try {
for (FieldValueList row : BIGQUERY.query(queryConfig, id).iterateAll()) {
for (FieldValue val : row) {
System.out.printf("%s,", val.toString());
}
System.out.printf("\n");
}
} catch (InterruptedException e) {
e.printStackTrace();
}

Using Spanner within Apache Beam Dataflow

I am trying to add a Spanner connection within an Apache Beam ParDo(DoFn). I need to lookup some rows as part of the ParDo. The dataflow creates a number of workers (usually 4 max) and I use the startBundle and finishBundle methods to open and close the spanner connections for the workers lifetime. Then within the processElement method I perform the lookup for each item passing the DatabaseClient and using a singleUseReadOnlyTransaction.
I should add this is running as a dataflow under GCP
Some code to illustrate this.
private static CustomDoFn<String, TransactionImport> processRow = new CustomDoFn<String, TransactionImport>(){
private static final long serialVersionUID = 1L;
private Spanner spanner = null;
private DatabaseClient dbClient = null;
#StartBundle
public void startBundle(StartBundleContext c){
TransactionFileOptions options = c.getPipelineOptions().as(TransactionFileOptions.class);
com.google.cloud.spanner.SpannerOptions spannerOptions = com.google.cloud.spanner.SpannerOptions.newBuilder().build();
spanner = spannerOptions.getService();
String spannerProjectID = options.getSpannerProjectId();
String spannerInstanceID = options.getSpannerInstanceId();
String spannerDatabaseID = options.getSpannerDatabaseId();
DatabaseId db = DatabaseId.of(spannerProjectID, spannerInstanceID, spannerDatabaseID);
dbClient = spanner.getDatabaseClient(db);
}
#FinishBundle
public void finishBundle(FinishBundleContext c){
spanner.close();
}
#ProcessElement
public void processElement(DoFn<String, TransactionImport>.ProcessContext c) throws Exception {
TransactionImport import = new TransactionImport();
Statement statement = Statement.newBuilder("SELECT * FROM Table1 WHERE Name= #Name")
.bind("Name").to( text)
.build();
ResultSet resultSet = dbClient.singleUseReadOnlyTransaction().executeQuery(statement);
// set some value on import dependant on retrieved value
c.output(import);
}
This always results in the dataflow not completing and when I check the log I see:
Processing stuck in step Process Rows for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:924)
at com.google.common.util.concurrent.Uninterruptibles.takeUninterruptibly(Uninterruptibles.java:233)
at com.google.cloud.spanner.SessionPool$Waiter.take(SessionPool.java:411)
at com.google.cloud.spanner.SessionPool$Waiter.access$3300(SessionPool.java:399)
at com.google.cloud.spanner.SessionPool.getReadSession(SessionPool.java:754)
at com.google.cloud.spanner.DatabaseClientImpl.singleUseReadOnlyTransaction(DatabaseClientImpl.java:52)
at com.mycompany.pt.SpannerDataAccess.getBinDetails(SpannerDataAccess.java:197)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1.processLine(TransactionFileDataflow.java:411)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1.processElement(TransactionFileDataflow.java:336)
at com.mycompany.pt.transactionFiles.TransactionFileDataflow$1$DoFnInvoker.invokeProcessElement(Unknown Source)
`
Does anyone have any experience using Spanner like this within a ParDo?
I'm not a spanner expert, but maybe I can help:
You should use #Setup/#Teardown to connect & disconnect from spanner. #{Start,Finish}Bundle gets called multiple times over the lifetime of a worker. See here for more details: https://beam.apache.org/documentation/execution-model/#bundling-and-persistence
Does your processElement method ever emit an element using
c.output(...)? If not, beam will think your pipeline is stuck

Search Informatica for text in SQL override

Is there a way to search all the mappings, sessions, etc. in Informatica for a text string contained within a SQL override?
For example, suppose I know a certain stored procedure (SP_FOO) is being called somewhere in an INFA process, but I don't know where exactly. Somewhere I think there is a Post SQL on a source or target calling it. Could I search all the sessions for Post SQL containing SP_FOO ? (Similar to what I could do with grep with source code.)
You can use Repository queries for querying REPO tables(if you have enough access) to get data related with all the mappings,transformations,sessions etc.
Please use the below link to get almost all kind of repo queries.Ur answers can be find in the below link.
https://uisapp2.iu.edu/confluence-prd/display/EDW/Querying+PowerCenter+data
select *--distinct sbj.SUBJECT_AREA,m.PARENT_MAPPING_NAME
from REP_SUBJECT sbj,REP_ALL_MAPPINGS m,REP_WIDGET_INST w,REP_WIDGET_ATTR wa
where sbj.SUBJECT_ID = m.SUBJECT_ID AND
m.MAPPING_ID = w.MAPPING_ID AND
w.WIDGET_ID = wa.WIDGET_ID
and sbj.SUBJECT_AREA in ('TLR','PPM_PNLST_WEB','PPM_CURRENCY','OLA','ODS','MMS','IT_METRIC','E_CONSENT','EDW','EDD','EDC','ABS')
and (UPPER(ATTR_VALUE) like '%PSA_CONTACT_EVENT%'
-- or UPPER(ATTR_VALUE) like '%PSA_MEMBER_CHARACTERISTIC%'
-- or UPPER(ATTR_VALUE) like '%PSA_REPORTING_HH_CHRSTC%'
-- or UPPER(ATTR_VALUE) like '%PSA_REPORTING_MEMBER_CHRSTC%'
)
--and m.PARENT_MAPPING_NAME like '%ARM%'
order by 1
Please let me know if you have any issues.
Another less scientific way to do this is to export the workflow(s) as XML and use a text editor to search through them for the stored procedure name.
If you have read access to the schema where the informatica repository resides, try this.
SELECT DISTINCT f.subj_name folder, e.mapping_name, object_type_name,
b.instance_name, a.attr_value
FROM opb_widget_attr a,
opb_widget_inst b,
opb_object_type c,
opb_attr d,
opb_mapping e,
opb_subject f
WHERE a.widget_id = b.widget_id
AND b.widget_type = c.object_type_id
AND ( object_type_name = 'Source Qualifier'
OR object_type_name LIKE '%Lookup%'
)
AND a.widget_id = b.widget_id
AND a.attr_id = d.attr_id
AND c.object_type_id = d.object_type_id
AND attr_name IN ('Sql Query')--, 'Lookup Sql Override')
AND b.mapping_id = e.mapping_id
AND e.subject_id = f.subj_id
AND a.attr_value is not null
--AND UPPER (a.attr_value) LIKE UPPER ('%currency%')
Yes. There is a small java based tool called Informatica Meta Query.
Using that tool, you can search for any information that is present in the Informatica meta data tables.
If you cannot find that tool, you can write queries directly in the Informatica Meta data tables to get the required information.
Adding few more lines to solution provided by Data Origin and Sandeep.
It is highly advised not to query repository tables directly. Rather, you can create synonyms or views and then query those objects to avoid any damage to rep tables.
In our dev/ prod environment application programmers are not granted any direct access to repo. tables.
As querying the Informatica database isn't the best idea, I would suggest you to export all the workflows in your folder into xml using Repository Manager. From Rep Mgr you can select all of them once and export them at once. Then write a java program to search the pattern from the xml's you have.
I have written a sample prog here, please modify it as per your requirement:
make a spec file with workflow names(specFileName).
main()
{
try {
File inFile = new File(specFileName);
BufferedReader reader = new BufferedReader(newFileReader(infile));
String tectToSearch = '<YourString>';
String currentLine;
while((currentLine = reader.readLine()) != null)
{
//trim newline when comparing with String
String trimmedLine = currentLine.trim();
if(currentline has the string pattern)
{
SOP(specFileName); //specfile name
}
}
reader.close();
}
catch(IOException ex)
{
System.out.println("Error reading to file '" + specFileName +"'");
}
}

Zend Framework - Doctrine2 - Repository Query Caching

I'm looking at Caching and how to use it in Doctrine.
I've got the following in my Zend Framework Bootstrap.php:
// Build Configuration
$orm_config = new \Doctrine\ORM\Configuration();
// Caching
$cacheOptions = $options['cache']['backendOptions'];
$cache = new \Doctrine\Common\Cache\MemcacheCache();
$memcache = new Memcache;
$memcache->connect($cacheOptions['servers']['host'], $cacheOptions['servers']['port']);
$cache->setMemcache($memcache);
$orm_config->setMetadataCacheImpl($cache);
$orm_config->setQueryCacheImpl($cache);
$orm_config->setResultCacheImpl($cache);
I'm running a very simple query on my DB using:
self::_instance()->_em->getRepository('UserManagement\Users')->find('1');
And I'm not sure if I'm using caching properly, because with it on (as
per the above config) the query seems to take twice as long to execute
as with it disabled, is this right?
Thanks in advance,
Steve
I seem to have sorted this myself, sort of related to enter link description here. Basically, from what I understand a repository query like:
self::_instance()->_em->getRepository('UserManagement\Users')->find('1');
Will not cache the results. If the same query is executed again throughout the script processing, it will not perform the search and use the result it has in memory - this isn't the same as real caching, in my case using Memcache.
The only way to achieve this, is to override the Doctrine EntityRepository find() method in a custom repository with something like:
public function find($id)
{
// Retrieve an instance of the Entity Manager
$qb = $this->getEntityManager()->createQueryBuilder();
$qb->select('u')
->from('UserManagement\Users', 'u')
->where('u.id = :id')
->setParameter('id', $id);
$query = $qb->getQuery();
$query->useResultCache(TRUE);
$result = $query->getSingleResult();
return $result;
}
Notably, the most important line from the above is $query->useResultCache(TRUE); - this informs the Application to cache the results.
Hope this helps.

Why does WebSharingAppDemo-CEProviderEndToEnd sample still need a client db connection after scope creation to perform sync

I'm researching a way to build an n-tierd sync solution. From the WebSharingAppDemo-CEProviderEndToEnd sample it seems almost feasable however for some reason, the app will only sync if the client has a live SQL db connection. Can some one explain what I'm missing and how to sync without exposing SQL to the internet?
The problem I'm experiencing is that when I provide a Relational sync provider that has an open SQL connection from the client, then it works fine but when I provide a Relational sync provider that has a closed but configured connection string, as in the example, I get an error from the WCF stating that the server did not receive the batch file. So what am I doing wrong?
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder();
builder.DataSource = hostName;
builder.IntegratedSecurity = true;
builder.InitialCatalog = "mydbname";
builder.ConnectTimeout = 1;
provider.Connection = new SqlConnection(builder.ToString());
// provider.Connection.Open(); **** un-commenting this causes the code to work**
//create anew scope description and add the appropriate tables to this scope
DbSyncScopeDescription scopeDesc = new DbSyncScopeDescription(SyncUtils.ScopeName);
//class to be used to provision the scope defined above
SqlSyncScopeProvisioning serverConfig = new SqlSyncScopeProvisioning();
....
The error I get occurs in this part of the WCF code:
public SyncSessionStatistics ApplyChanges(ConflictResolutionPolicy resolutionPolicy, ChangeBatch sourceChanges, object changeData)
{
Log("ProcessChangeBatch: {0}", this.peerProvider.Connection.ConnectionString);
DbSyncContext dataRetriever = changeData as DbSyncContext;
if (dataRetriever != null && dataRetriever.IsDataBatched)
{
string remotePeerId = dataRetriever.MadeWithKnowledge.ReplicaId.ToString();
//Data is batched. The client should have uploaded this file to us prior to calling ApplyChanges.
//So look for it.
//The Id would be the DbSyncContext.BatchFileName which is just the batch file name without the complete path
string localBatchFileName = null;
if (!this.batchIdToFileMapper.TryGetValue(dataRetriever.BatchFileName, out localBatchFileName))
{
//Service has not received this file. Throw exception
throw new FaultException<WebSyncFaultException>(new WebSyncFaultException("No batch file uploaded for id " + dataRetriever.BatchFileName, null));
}
dataRetriever.BatchFileName = localBatchFileName;
}
Any ideas?
For the Batch file not available issue, remove the IsOneWay=true setting from IRelationalSyncContract.UploadBatchFile. When the Batch file size is big, ApplyChanges will be called even before fully completing the previous UploadBatchfile.
// Replace
[OperationContract(IsOneWay = true)]
// with
[OperationContract]
void UploadBatchFile(string batchFileid, byte[] batchFile, string remotePeer1
I suppose it's simply a stupid example. It exposes "some" technique but assumes you have to arrange it in proper order by yourself.
http://msdn.microsoft.com/en-us/library/cc807255.aspx