Splunk - To search for concurrent run of processes - concurrency

I want to check if there are multiple instances of a job/process running .
Ex: My Splunk search :
index=abc <jobname> | stats earliest(_time) AS earliest_time, latest(_time) AS latest_time count by source | convert ctime(earliest_time), ctime(latest_time) | sort - count
returns :
source earliest_time latest_time count
logA 06/06/2020 15:24:09 06/06/2020 15:24:59 1
logB 06/06/2020 15:24:24 06/06/2020 15:25:12 2
In the above since logB indicates job run before logA completion time, it is indication of concurrent run of process. I would like to generate a list of all such jobs if it is possible , any help is appreciated .
Thank you.

There is an inbuilt Splunk command for this, concurrency. This command requires an event start time and the duration, which we can calculate as the difference between the earliest and latest times. This command will create a new field called concurrency which is a measurement represent[ing] the total number of events in progress at the time that each particular event started, including the event itself.
index=abc <jobname> | stats earliest(_time) as et latest(_time) as lt count by source | eval duration=lt-et | concurrency start=et duration=duration | where concurrency>1
Docs for concurrency can be found at https://docs.splunk.com/Documentation/Splunk/8.0.4/SearchReference/Concurrency

Related

Informatica Looping

I am looking for information on looping in Informatica. Specifically, I need to check if a source table has been loaded, if it has, move to next step, if not wait X minutes and check the status table again. I would prefer direction to a place I can learn this on my own, but I need to confirm this is even possible as I have not found anything on my google searches.
You can use a simple shell script to do this wait and watch capability.
#/bin/sh
# call it as script_name.sh
# it will wait for 10 min and check again for data, in total it will wait for 2hours. change them if you want to
# Source is assumed as oracle. change it as per your source.
interval=600
loop_count=10
counter=0
while true
do
$counter=`expr $counter + 1 `
db_value=`sqlplus -s user/pass#local_SID <<EOF
set heading off
set feedback off
SELECT count(*) FROM my_source_table;
exit
EOF`;
if [ $db_value -gt 0 ]; then
echo "Data Found."
exit 0
else
if [ $counter -eq $loop_count ]
then
echo "No data found in source after 2hours"
exit 1
else
sleep $interval
fi
fi
done
And add this shell script(in a CMD task) to the beginning of the workflow.
Then use informatica link condition as if status= 0, proceed else email that wait time is over.
You can refer to the pic below. This will send a mail if wait time is over and still data is not there in source.
In general, looping is not supported in Informatica PowerCenter.
One way is to use scripts, as discussed by Koushik.
Another way to do that is to have a Continuously Running Workflow with a timer. This is configurable on Scheduler tab of your workflow:
Such configuration makes the workflow start again right after it succeeds. Over and over again.
Workflow would look like:
Start -> s_check_source -> Decision -> timer
|-> s_do_other_stuff -> timer
This way it will check source. Then if it has not been loaded trigger the timer. Thet it will succeed and get triggered again.
If source turns out to be loaded, it will trigger the other session, complete and probably you'd need another timer here to wait till next day. Or basically till whenever you'd like the workflow to be triggered again.

ray.tune.run stuck before starting training infinitely

I use ray.tune.run to train a customized gym model by using rllib after ray.init(num_cpus=2)
I do not need to search hyperparameters so I give parameters in model config specific values. But tune.run seems still stuck in parameter searching infinitely with repeatedly outputting in terminal the following results
== Status ==
Current time: 2021-12-15 10:14:37 (running for 00:22:40.96)
Memory usage on this node: 17.1/62.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 1.0/2 CPUs, 1.0/1 GPUs, 0.0/36.44 GiB heap, 0.0/18.22 GiB objects
Result logdir: /
Number of trials: 1/1 (1 RUNNING)
+---------------------------+----------+-----------------------+
| Trial name | status | loc |
|---------------------------+----------+-----------------------|
| PPO_sim-v0_a3b49_00000 | RUNNING | 192.168.11.86:1869965 |
+---------------------------+----------+-----------------------+
I tried to use ray.init() and it can start training but return nan reward for all iters.
And then I changed nothing and re-call ray.tune.run, it is again stuck in searching with above Status information.
I have no idea why this happens even I don't specify hyperparameter searching.
So can anyone help me with this problem?
Appreciate any help about this.

Dataflow reading from Kafka without data loss?

We're currently big users of Dataflow batch jobs and wanting to start using Dataflow streaming if it can be done reliably.
Here is a common scenario: We have a very large Kafka topic that we need to do some basic ETL or aggregation on and a non idempotent upstream queue. Here is an example of our Kafka data:
ID | msg | timestamp (mm,ss)
-----------------------
  1   | A   |  01:00
  2  | B   |  01:01
  3  | D |  06:00
  4  | E   |  06:01
  4.3 |   F   |  06:01
.... | ...... | ...... (millions more)
4.5 | ZZ | 19:58


Oops, the data changes from integers to decimals at some point, which will eventually cause some elements to fail, necessitating us to kill the pipeline, possibly modify the downstream service, and possibly make minor code changes to the Dataflow pipeline.
In Spark Structured Streaming, because of the ability to use external checkpoints, we would be able to restart a streaming job and resume processing the queue where the previous job left off (successfully processing) for exactly once processing. In a vanilla or spring boot Java Application we could loop through with a Kafka consumer, and only after writing results to our 'sink', commit offsets.
My overall question is can we achieve similar functionality in Dataflow? I'll list some of my assumptions and concerns:
It seems here
 in KafkaIO 

there is not a relationship between the offset commit PCollection and the User's one, does that mean they can drift apart?
It seems here in KafkaOffsetCommit 
this is taking a window of five minutes and emitting the highest offset, but this is not wall time, this is kafka record time. Going back to our sample data, to me it looks like the entire queue's offset would be committed (in chunks of five minutes) as fast as possible! This means that if we have only finished processing up to record F in the first five minutes, we may have committed almost the entire queue's offests?
Now in our scenario our Pipeline started failing around F, it seems our only choice is to start from the beginning or lose data? 

I believe this might be overcome with a lot of custom code (Custom DoFn to ensure the Kafka Consumer never commits) and some custom code for our upstream sink that would eventually commit offsets. Is there a better way to do this, and/or are some my assumptions wrong about how offset management is handled in Dataflow?


Thank you for the detailed question!
In Beam (hence Dataflow), all of the outputs for a "bundle" are committed together, along with all state updates, checkpoints, etc, so there is no drift between different output PCollections. In this specific case, the offsets are extracted directly from the elements to be output so they correspond precisely. The outputs and offsets are both durably committed to Dataflow's internal storage before the offset is committed back to Kafka.
You are correct that the offsets from the elements already processed are grouped into 5 minute event time windows (Kafka record time) and the maximum offset is taken. While 5 minutes is an arbitrary duration, the offsets correspond to elements that have been successfully pulled off the queue.

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like:
ID | TS | EventType
1 | 1 | Create
1 | 2 | Comment
2 | 2 | Create
1 | 4 | Comment
And I want to end up with a stream coming from Dataflow that looks like:
ID | TS | num_comments
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
1 | 4 | 2
I want the job that does this rollup to run as a stream process, with new counts being populated as new events come in. My question is, where is the idiomatic place for the job to store the state for the current topic id and comment counts? Assuming that topics can live for years. Current ideas are:
Write a 'current' entry for the topic id to BigTable and in a DoFn query what the current comment count for the topic id is coming in. Even as I write this I'm not a fan.
Use side inputs somehow? It seems like maybe this is the answer, but if so I'm not totally understanding.
Set up a streaming job with a global window, with a trigger that goes off every time it gets a record, and rely on Dataflow to keep the entire pane history somewhere. (unbounded storage requirement?)
EDIT: Just to clarify, I wouldn't have any trouble implementing any of these three strategies, or a million different other ways of doing it, I'm more interested in what is the best way of doing it with Dataflow. What will be most resilient to failure, having to re-process history for a backfill, etc etc.
EDIT2: There is currently a bug with the dataflow service where updates fail if adding inputs to a flatten transformation, which will mean you'll need to discard and rebuild any state accrued in the job if you make a change to a job that includes adding something to a flatten operation.
You should be able to use triggers and a combine to accomplish this.
PCollection<ID> comments = /* IDs from the source */;
PCollection<KV<ID, Long>> commentCounts = comments
// Produce speculative results by triggering as data comes in.
// Note that this won't trigger after *every* element, but it will
// trigger relatively quickly (as the system divides incoming data
// into work units). You could also throttle this with something
// like:
// AfterProcessingTime.pastFirstElementInPane()
// .plusDelayOf(Duration.standardMinutes(5))
// which will produce output every 5 minutes
.apply(Window.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.accumulatingFiredPanes())
// Count the occurrences of each ID
.apply(Count.perElement());
// Produce an output String -- in your use case you'd want to produce
// a row and write it to the appropriate source
commentCounts.apply(new DoFn<KV<ID, Long>, String>() {
public void processElement(ProcessContext c) {
KV<ID, Long> element = c.element();
// This includes details about the pane of the window being
// processed, and including a strictly increasing index of the
// number of panes that have been produced for the key.
PaneInfo pane = c.pane();
return element.key() + " | " + pane.getIndex() + " | " + element.value();
}
});
Depending on your data, you could also read whole comments from the source, extract the ID, and then use Count.perKey() to get the counts for each ID. If you want a more complicated combination, you could look at defining a custom CombineFn and using Combine.perKey.
Since BigQuery does not support overwriting rows, one way to go about this is to write the events to BigQuery, and query the data using COUNT:
SELECT ID, COUNT(num_comments) from Table GROUP BY ID;
You can also do per-window aggregations of num_comments within Dataflow before writing the entries to BigQuery; the query above will continue to work.

Redis SortedSet: Does the ZUNIONSTORE command block other concurrency commands?

I want to create a temporary sorted-set based on the origin one in a timer, maybe the interval is 4 hour, I'm using spring-data-redis api to do this.
ZUNIONSTORE tmp 2 A B AGGREGATE MAX
when the ZUNIONSTORE commmand is executing, will it block any other commands like
ZADD,ZREM,ZRANGE,ZINCRBY based on the SortedSet A or B ? I don't know if this will cause
concurrency problems, please give me some suggestions.