Pentaho kettle dataflow - kettle

Soure: Machine1
Destination: Machine2
Pentaho kettle running on Machine3
A transformation developed and executed on Machine3 that hits database on Machine1 selects data and inserts into a table into another database on Machine2.
Does that data flow through Machine3 or a direct channel is established between Machine1 and Machine2?

Pentaho will not bridge connections between Machine 1 and Machine 2... The data will flow through Machine 3 ONLY. Data from Machine 1 will be processed on Machine 3, and from Machine 3 will be uploaded to Machine 2, no bridges from 1 to 2 directly.

Related

Data Fusion pipelines fail without execute

I have more than 50 datafusion pipelines running concurrently in an Enterprise istance of DataFusion.
About 4 of them randomly fail at each concurrent run, showing in the logs only the operation of provision followed by the deprovision of the Dataproc cluster, as in this log:
2021-04-29 12:52:49,936 - INFO [provisioning-service-4:i.c.c.r.s.p.d.DataprocProvisioner#203] - Creating Dataproc cluster cdap-fm-smartd-cc94285f-a8e9-11eb-9891-6ea1fb306892 in project project-test, in region europe-west2, with image 1.3, with system labels {goog-datafusion-version=6_1, cdap-version=6_1_4-1598048594947, goog-datafusion-edition=enterprise}
2021-04-29 12:56:08,527 - DEBUG [provisioning-service-1:i.c.c.i.p.t.ProvisioningTask#116] - Completed PROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
2021-04-29 13:04:01,678 - DEBUG [provisioning-service-7:i.c.c.i.p.t.ProvisioningTask#116] - Completed DEPROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
When a failed pipeline is restarted it completes the execution with success.
All the pipeline are started and monitored via Composer using async start and custom wait SensorOperator.
There is no warning of quota exceeded.
Additional info:
Data Fusion 6.1.4
with Dataporc ephemeral cluster with 1 master 2 workers. Image version 1.3.89
EDIT
The appfabric log releted to each failed pipeline are:
WARN [program.status:i.c.c.i.a.r.d.DistributedProgramRuntimeService#172] - Twill RunId does not exist for the program program:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow, runId f34a6fb4-acb2-11eb-bbb2-26edc49aada0
WARN [pool-11-thread-1:i.c.c.i.a.s.RunRecordCorrectorService#141] - Fixed RunRecord for program run program_run:default.[piepleine_name].-SNAPSHOT.workflow.DataPipelineWorkflow.fdc22f56-acb2-11eb-bbcf-26edc49aada0 in STARTING state because it is actually not running
Further research connected somehow the problem to an inconsistent state in the CDAP run records, when many concurrent requests (via REST API) are made.

Spring Session using Lettuce connecting to AWS ElastiCache Redis

I am using Spring Session to externalize our session to Redis (AWS ElastiCache). Lettuce is being used as our client to Redis.
My AWS Redis configuration is the following:
Redis Cluster enabled
Two shards (i.e. two masters)
One slave per master
My Lettuce configuration is the following:
<!-- Lettuce Configuration -->
<bean class="org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory">
<constructor-arg ref="redisClusterConfiguration"/>
</bean>
<!-- Redis Cluster Configuration -->
<bean id="redisClusterConfiguration" class="org.springframework.data.redis.connection.RedisClusterConfiguration">
<constructor-arg>
<list>
<value><!-- AMAZON SINGLE ENDPOINT HERE --></value>
</list>
</constructor-arg>
</bean>
The issue appears when we trigger a failover of a master node. I get the following events logged during a test failover
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:25:32 PM UTC+3
Finished recovery for cache nodes 0001
myserver-0001-001
cache-cluster
Monday, July 6, 2020 at 8:20:38 PM UTC+3
Recovering cache nodes 0001
myserver
replication-group
Monday, July 6, 2020 at 8:19:14 PM UTC+3
Failover to replica node myserver-0001-002 completed
myserver
replication-group
Monday, July 6, 2020 at 8:17:59 PM UTC+3
Test Failover API called for node group 0001
AWS customer support claims that as long as the Redis client used is Redis Cluster aware when the Failover to replica node myserver-0001-002 completed event is fired (i.e. 1m and 15s after triggering the failover) it should be able to connect to it (i.e. the newly promoted master). Our client seems to reconnect only after the Finished recovery for cache nodes 0001 event is fired (i.e. 7m and 32s later). Meanwhile we get errors like the following
org.springframework.data.redis.RedisSystemException: Error in execution; nested exception is io.lettuce.core.RedisCommandExecutionException: CLUSTERDOWN The cluster is down
While the failover is taking place, the following information can be seen from the redis-cli.
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114396000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114396872 2 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114395000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114461262 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114460000 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 master - 0 1594114460256 0 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114458000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114509000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114510552 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114510000 9 connected
c18fee0e47800d792676c7d14782d81d7d1684e8 172.31.10.64:6379#1122 master,fail - 1594114079948 1594114077000 8 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114508000 2 connected 8192-16383
endpoint:6379> cluster nodes
ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 172.31.11.216:6379#1122 master - 0 1594114548000 9 connected 0-8191
f8ff7a20f4c493b63ba65f107f575631faa4eb1b 172.31.11.52:6379#1122 slave 4ab10ca0a6a932179432769fcba7fab0faba01f7 0 1594114548783 2 connected
6a7339ae4df3c78e31c9cc8fd8cec4803eed5fc1 172.31.10.64:6379#1122 slave ffe51ecc6a8c1f32ab3774eb8f159bd64392dc14 0 1594114547776 9 connected
4ab10ca0a6a932179432769fcba7fab0faba01f7 172.31.10.84:6379#1122 myself,master - 0 1594114547000 2 connected 8192-16383
As far as I understand, Lettuce used by Spring Session is Redis Cluster aware hence the RedisClusterConfiguration class used in the XML configuration. Checking the documentation and some similar questions here on SO as well as Lettuce's GitHub issues page, didn't make it clear to me how Lettuce works in Redis Cluster mode and specifically with AWS's trickery hiding the IPs under a common endpoint.
Shouldn't my configuration be enough for Lettuce to connect to the newly promoted master? Do I need to enable a different mode in Lettuce for it to be able to receive the notification from Redis and switch to the new master (e.b. topology refresh)?
Also, how is Lettuce handling the single endpoint from AWS? Is it resolving the IPs and then uses them? Are they cached?
If I want to enable reads to happen from all four nodes, is my configuration enough? In a Redis Cluster (i.e. even outside of AWS's context) when a slave is promoted to master is the client polling to get the information or is the cluster pushing it somehow to the client?
Any resources (even Lettuce source files) you have that could clarify the above as well as the different modes in the context of Lettuce, Redis, and AWS would be more than welcome.
As you can see I am a bit confused on this still.
Thanks a lot in advance for any help and information provided.
UPDATE
Debugging was enabled and breakpoints were used to intercept bean creation and configure topology refreshing that way. It seems that enabling the ClusterTopologyRefreshTask through the constructor of ClusterClientOptions:
protected ClusterClientOptions(Builder builder) {
super(builder);
this.validateClusterNodeMembership = builder.validateClusterNodeMembership;
this.maxRedirects = builder.maxRedirects;
ClusterTopologyRefreshOptions refreshOptions = builder.topologyRefreshOptions;
if (refreshOptions == null) {
refreshOptions = ClusterTopologyRefreshOptions.builder() //
.enablePeriodicRefresh(DEFAULT_REFRESH_CLUSTER_VIEW) // Breakpoint here and enter to enable refreshing
.refreshPeriod(DEFAULT_REFRESH_PERIOD_DURATION) // Breakpoint here and enter to set the refresh interval
.closeStaleConnections(builder.closeStaleConnections) //
.build();
}
this.topologyRefreshOptions = refreshOptions;
}
It seems to be refreshing OK, but the problem now is how to configure this when Lettuce is used through Spring Session and not as a plain client to Redis?
As I was going through my questions I realized I haven't answered that one! So here it is in case someone has the same issue.
What I ended up doing is creating a configuration bean for Redis instead of using XML. The code is as follows:
#EnableRedisHttpSession
public class RedisConfig {
private static final List<String> clusterNodes = Arrays.asList(System.getProperty("redis.endpoint"));
#Bean
public static ConfigureRedisAction configureRedisAction() {
return ConfigureRedisAction.NO_OP;
}
#Bean(destroyMethod = "destroy")
public LettuceConnectionFactory lettuceConnectionFactory() {
RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
return new LettuceConnectionFactory(redisClusterConfiguration, getLettuceClientConfiguration());
}
private LettuceClientConfiguration getLettuceClientConfiguration() {
ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder().enablePeriodicRefresh(Duration.ofSeconds(30)).build();
ClusterClientOptions clusterClientOptions = ClusterClientOptions.builder().topologyRefreshOptions(topologyRefreshOptions).build();
return LettucePoolingClientConfiguration.builder().clientOptions(clusterClientOptions).build();
}
}
Then, instead of registering my ContextLoaderListener through XML I used an initializer instead like so:
public class Initializer extends AbstractHttpSessionApplicationInitializer {
public Initializer() {
super(RedisConfig.class);
}
}
This seems to set up refreshing OK, but I don't know if it is the proper way to do it! If anyone has any idea of a more proper solution please feel free to comment here.

How to modify blocktime of a running private ethereum network

I have a private Ethereum network that is running on geth 1.8 using PoA consensus. It consists of two nodes - one sealer node and one bootnode/RPC API node. When I created the genesis file I set the blocktime to 3s but it generates too much data this way and I want to set it to ~10s. How can I do this without loosing previous transactions and data?
Once you have started with Block Time then it will be fixed forever in PoA consensus. There is no command line option for it. In genesis of clique (geth implimention of PoA), we can see "period": 3 (3 second) in
"clique": {
"period": 3,
"epoch": 30000
}
I think you are aware of this now so unless you change in current protocol on how to cope with blockchain data on changing block time or how to change the block time, you have no other option till now.
You will have to change the genesis block with the new seconds and restart the network. There is no way to do this once the network is started.
Note: Of course the network will lose all data.

HdfsRpcException: Failed to invoke RPC call "getFsStats" on server

I've installed a single node Hadoop Cluster on EC2 instance. I then stored some test data on hdfs and I'm trying to load the hdfs data to SAP Vora. I'm using SAP Vora 2.0 for this project.
To create the table and load the data to Vora, this is the query I'm running:
drop table if exists dims;
CREATE TABLE dims(teamid int, team string)
USING com.sap.spark.engines.relational
OPTIONS (
hdfsnamenode "namenode.example.com:50070",
files "/path/to/file.csv",
storagebackend "hdfs");
When I run the above query, I get this error message:
com.sap.vora.jdbc.VoraException: HL(9): Runtime error.
(could not handle api call, failure reason : execution of scheduler plan failed:
found error: :-1, CException, Code: 10021 : Runtime category : an std::exception wrapped.
Next level: v2 HDFS Plugin: Exception at opening
hdfs://namenode.example.com:50070/path/to/file.csv:
HdfsRpcException: Failed to invoke RPC call "getFsStats" on server
"namenode.example.com:50070" for node id 20
with error code 0, status ERROR_STATUS
Hadoop and Vora are running on different nodes.
You should specify the HDFS Namenode port, which is typically 8020. 50700 is the port of the WebUI. See e.g. Default Namenode port of HDFS is 50070.But I have come across at some places 8020 or 9000

How to properly Load balance between two Spark Controllers

We are attempting to load balance between two Spark Controllers that connect to Vora...
We are capable of connecting and the query gets sent to the controller.
the problem occurs when the result is supposed to be passed back to hana the process hangs and will never finish.
The last lines in logs state :
17/02/14 14:24:12 INFO CommandRouter$$anon$1: Created broadcast 7 from executeSelectTask at CommandRouter.scala:650
17/02/14 14:24:12 INFO CommandRouter$$anon$1: Starting job: executeSelectTask at CommandRouter.scala:650
17/02/14 14:24:12 INFO CommandRouter$$anon$1: Created broadcast 8 from broadcast at DAGScheduler.scala:1008
17/02/14 14:24:14 INFO CommandRouter$$anon$1: Created broadcast 9 from broadcast at DAGScheduler.scala:1008
Is there something specific to be configured for allowing to load balance between the two controllers?
The reason the process hangs forever is because the nodes where the Spark executor jobs are running do not know the hostname of the HANA host and therefore are never able to return the resultset. This must be added to the node's /etc/hosts file.