Simple query takes minutes to execute on a killed/inactive session - c++

I'm trying to add simple failover functionality to my application that talks to Oracle 8 11 database. To test that my session is up I issue a simple query (select 1 from dual).
Now when I try to simulate a network outage by killing my Oracle session by doing "alter system kill session 'sid,serial';" and execute this test query it takes up to 5 minutes for the application to process it and return error from Execute method (I'm using OCI API, C++):
Tue Feb 21 21:22:47 HKT 2012: Checking connection with test query...
Tue Feb 21 21:28:13 HKT 2012: Warning - OCI_SUCCESS_WITH_INFO: 3113: ORA-03113: end-of-file on communication channel
Tue Feb 21 21:28:13 HKT 2012: Test connection has failed, attempting to re-establish connection...
If I kill session with the 'immediate' keyword at the end of the query, then the test query returns error instantly.
Question 1: why it takes 5 minutes to execute my query? Are there any Oracle/PMON logs that can shed some light on what is happening during this delay?
Question 2: is it a good choice to use 'alter system kill session ' to simulate network failure? How close the outcomes of this query to a real-world network failure between application and Oracle DB?
Update:
Oracle version:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

There is a good chance that the program is waiting for rollback to complete.

Related

Obfuscation process using Informatica

How to check obfuscation process status on Informatica, as I started the process about 8 hours ago and due to idle time was exceeded, my VM got logged off shutting all the applications.
You can still check the session logs from the integration server. They'll be in the installation directory / SessLogs
The Session Logs will be in readable format , only if you have selected - "Write backwards session logs" Option.
You can check session log in server in \server\infa_shared\SessLogs. Session logs will be saved in the server with time-stamp. To read the content of the log you can either open it through the workflow monitor, right-click on the session and select "Get session log".

ALTER DATABASE - Cannot process request. Not enough resources to process request.

I am working to automate some of my performance tests on Azure SQL Data Warehouse. I had been scaling up/down the databases using the Azure portal. I read in https://msdn.microsoft.com/en-us/library/mt204042.aspx that it is possible to use T-SQL to accomplish this via
ALTER DATABASE ...
In my first attempt using T-SQL, the attempt failed:
RunScript:INFO: Mon Feb 6 20:11:06 UTC 2017 : Connecting to host "logicalserver.database.windows.net" database "master" as "myuser"
RunScript:INFO: stdout from sqlcmd will follow...
ALTER DATABASE my_db MODIFY ( SERVICE_OBJECTIVE = 'DW1000' ) ;
Msg 49918, Level 16, State 1, Server logicalserver, Line 1
Cannot process request. Not enough resources to process request. Please retry you request later.
RunScript:INFO: Mon Feb 6 20:11:17 UTC 2017 : Return Code = "1" from host "logicalserver.database.windows.net" database "master" as "myuser"
RunScript:INFO: stdout from sqlcmd has ended ^.
I immediately went to the Azure portal, requested a scale, and it worked (taking 10 minutes to complete).
Is there any explanation?

Alternative to KILL 'SID' on Azure SQL Data Warehouse

If I submit a series of SQL statements (each with GO in sqlcmd) that I want to make an reasonable attempt to run on an Azure SQL Data Warehouse, I've found in sqlcmd how to ignore errors. But I've seen if I want to abort a statement in that sequence of statements with:
KILL "SIDxxxxxxx";
The whole session ends:
Msg 111202, Level 16, State 1, Server adws_database, Line 1
111202;Query QIDyyyyyyyyyy has been cancelled.
Is there a way to not end a query session in Azure SQL Data Warehouse? Similar to how postgres's
pg_cancel_backend()
works?
In postgres the
pg_terminate_backed(<pid>)
seems to be working similarly to the ADW
KILL 'SIDxxxx'
command.
Yes, a client can cancel a running request without aborting the whole session. In SSMS this is what the red square does during query execution.
Sqlcmd doesn't expose any way to cancel a running request, though. Other client interfaces do, like the .NET SqlClient you can use SqlCommand.Cancel()
David

Cassandra Python driver OperationTimedOut issue

I have a python script which is used to interact with cassandra with datastax python driver
It has been running since March 14th, 2016 and had not problem until today.
2016-06-02 13:53:38,362 ERROR ('Unable to connect to any servers', {'172.16.47.155': OperationTimedOut('errors=Timed out creating connection (5 seconds), last_host=None',)})
2016-06-02 13:54:18,362 ERROR ('Unable to connect to any servers', {'172.16.47.155': OperationTimedOut('errors=Timed out creating connection (5 seconds), last_host=None',)})
Below is the function used for creating a session, and shutdown the session (session.shutdown()) every time a query is done.(Every day we only have less than 100 queries from the subscribers side, therefore I chose build connection, do the query and close it instead of leaving the connection alive)
The session is not shared between threads and processes. If i invoke the below function in python console, it connects with the DB properly, but the running script cannot connect to the DB anymore.
Any one can help or shed some light on this issue? Thanks
def get_cassandra_session(stat=None):
"""creates cluster and gets the session base on key space"""
# be aware that session cannot be shared between threads/processes
# or it will raise OperationTimedOut Exception
if config.CLUSTER_HOST2:
cluster = cassandra.cluster.Cluster([config.CLUSTER_HOST1, config.CLUSTER_HOST2])
else:
# if only one address is available, we have to use older protocol version
cluster = cassandra.cluster.Cluster([config.CLUSTER_HOST1], protocol_version=2)
if stat and type(stat) == BatchStatement:
retry_policy = cassandra.cluster.RetryPolicy()
retry_policy.on_write_timeout(BatchStatement, ConsistencyLevel, WriteType.BATCH_LOG, ConsistencyLevel.ONE,
ConsistencyLevel.ONE, retry_num=0)
cluster.default_retry_policy = retry_policy
session = cluster.connect(config.KEY_SPACE)
session.default_timeout = 30.0
return session
Specs:
python 2.7
Cassandra 2.1.11
Quotes from datastax doc:
The operation took longer than the specified (client-side) timeout to complete. This is not an error generated by Cassandra, only the driver.
The problem is I didn't touch the driver. I set the default timeout to 30.0 seconds but why it timedout in 5 seconds(it is said in the log)
The default connect timeout is five seconds. In this case you would need to set Cluster.connect_timeout. The Session default_timeout applies to execution requests.
It's still a bit surprising when any TCP connection takes more than five seconds to establish. One other thing to check would be monkey patching. Did something in the application change patching for Gevent or Eventlet? That could cause a change in default behavior for the driver.
I've learned that the gevent module interferes with the cassandra-driver
cassandra-driver (3.10)
gevent (1.1.1)
Uninstalling gevent solved the problem for me
pip uninstall gevent

Amazon SWF: at least one worker has to be running, why?

I've just started out using the AWS Ruby SDK to manage as simple workflow. One behavior I noticed right away is that at least one relevant worker and one relevant decider must be running prior to submitting a new workflow execution.
If I submit a new workflow execution before starting my worker and decider, then the tasks are never picked up, even when I'm still well within time-out limits. Why is this? Based on the description of how the HTTP long polling works, I would expect either app to receive the relevant tasks when the call to poll() is reached.
I encounter other deadlocking situations after a job fails (e.g. due to a worker or decider bug, or due to being terminated). Sometimes, re-running or even just starting an entirely new workflow execution will result in a deadlocked workflow execution. The initial decision tasks are shown in the workflow execution history in the AWS console, but the decider never receives them. Admittedly, I'm having trouble confirming/reducing this issue to a test case, but I suspect it is related to the above issue. This happens roughly 10 to 20% of the time; the rest of the time, everything works.
Some other things to mention: I'm using a single task list for two separate activity tasks that run in sequence. Both the worker and the decider are polling the same task list.
Here is my worker:
require 'yaml'
require 'aws'
config_file_path = File.join(File.dirname(File.expand_path(__FILE__)), 'config.yaml')
config = YAML::load_file(config_file_path)
swf = AWS::SimpleWorkflow.new(config)
domain = swf.domains['test-domain']
puts("waiting for an activity")
domain.activity_tasks.poll('hello-tasklist') do |activity_task|
puts activity_task.activity_type.name
activity_task.complete! :result => name
puts("waiting for an activity")
end
EDIT
Another user on the AWS forums commented:
I think the cause is in SWF not immediately recognizing a long poll connection shutdown. When you kill a worker its connection for some time can be considered open by the service. So it still can dispatch a task to it. To you it looks like the new worker never getting it. The way to verify it is to check the workflow history. You'll see activity task started event with identify field that contains host and pid of the dead worker. Eventually such task is going to time out and can be retried by the decider.
Note that such condition is common during unit tests that frequently terminate connections and is not really a problem for any production applications. The common workaround is to use different task list for each unit test.
This seems to be a pretty reasonable explanation. I'm going to try to confirm this.
You've raised two issues: one regarding start of an execution with no active deciders and the other regarding actors crashing in the middle of a task. Let me address them in order.
I have carried out an experiment based on your observations and indeed, when a new workflow execution starts and no deciders are polling SWF still thinks that a new decision task gets started. The following is my event log from the AWS console. Note what happens:
Fri Feb 22 22:15:38 GMT+000 2013 1 WorkflowExecutionStarted
Fri Feb 22 22:15:38 GMT+000 2013 2 DecisionTaskScheduled
Fri Feb 22 22:15:38 GMT+000 2013 3 DecisionTaskStarted
Fri Feb 22 22:20:39 GMT+000 2013 4 DecisionTaskTimedOut
Fri Feb 22 22:20:39 GMT+000 2013 5 DecisionTaskScheduled
Fri Feb 22 22:22:26 GMT+000 2013 6 DecisionTaskStarted
Fri Feb 22 22:22:27 GMT+000 2013 7 DecisionTaskCompleted
Fri Feb 22 22:22:27 GMT+000 2013 8 ActivityTaskScheduled
Fri Feb 22 22:22:29 GMT+000 2013 9 ActivityTaskStarted
Fri Feb 22 22:22:30 GMT+000 2013 10 ActivityTaskCompleted
...
The first decision task was immediately scheduled (which is expected) and started right away (i.e. allegedly dispatched to a decider, even though no decider was running). I started a decider in the meantime, but the workflow didn't move until the timeout of the original decision task, 5 minutes later. I can't think of a scenario where this would be the desired behavior. Two possible defenses against that: have deciders running before starting a new execution or set an acceptably low timeout on a decision task (these tasks should be immediate anyway).
The issue of crashing actor (either decider or worker) is one that I'm familiar with. A short background note first:
Both activity and decision tasks are recored by the service in 3 stages:
Scheduled = ready to be picked up by an actor.
Started = already picked up by an actor.
Completed/Failed or Timed out = the actor either or completed failed or not finished the task within deadline.
Once the actor picked up a task and crashed, it is obviously not going to report anything back to the service (unless it is able to recover and still remembers task token of the dispatched task - but most crashing actors wouldn't be that smart). The next time a decision task will be scheduled, will be upon time-out of the recently dispatched task, which is why all actors seem to be blocked for the duration of a task timeout. This is actually the desired behavior: The service can't know whether the task is being worked on or not as long as the worker still works within its deadline. There is a simple way to deal with this: fit your actors with a try-catch block and fail a task when an unexpected crash happens. I would discourage from using separate tasklists for each integ test. Instead, I'd recommend failing the task in the teardown() block. SWF allows to specify a reason for failing a task, which is one way of logging failures and viewing them later through the AWS console.