Quartz wait for a set of jobs to finish - mapreduce

We have a quatz job that does a lot of calculations and is taking a while to complete. In order to speed it up we want to split the primary job to start multiple smaller jobs that do the calculations and return the result. After all the small jobs complete we need a final job that will pull the subtotals together.
Currently the idea is each small job will write to a store, and when creating the final job we pass in all small job names to it with MapData. The final job will look for these jobs and reschedule if any are found, else it will run the totals.
Is there a better way to accomplish this in quartz?

This isn't necessarily answering the question, but I'm afraid I don't think Quartz is the tool for the job here. It's a scheduler, not a mechanism for load balancing. You could look at using Quartz in combination with NServiceBus or MassTransit. The job could fire multiple messages for the small jobs, maybe even using the same message type and a Distributor and use a Saga to pull everything back together.

Related

how to read/write a value from/in the context object cdk step functions

I am writing in AWS CDK a Step Function which run two tasks in parallel. I would like to access from one of the tasks , a value of the second tasks , which runs in parallel (for example, I would like to know in task 1, which is the time started task 2, or maybe id from task 2).
Here an screenshot of the state machine definition in Step Function.
In the example of the screenshot, I would like to use the Id of the GlueStartRunJob (1) in GlueStartRunJob.
I was thinking about using the Context Object for that purpose. Nevertheless, I am not sure if this is the right approach...
The Context Object is read-only and allows a give state to access contextual information about it's self, not about other states from elsewhere in the workflow.
I'm not 100% clear what you are aiming to accomplish here, but I can see a couple of possible approaches.
First, you might just want to order these Glue Jobs to run sequentially so the output from the first can be used in the second.
Second, if you need the workflow to take action after the Glue Jobs have started but before they have completed, you'd need to take an approach that does not use the .sync integration pattern. With this integration pattern, Step Functions puts a synchronous facade over an asynchronous interaction, taking care of the steps to track completion and return you the results. You could instead use the default RequestResponse pattern to start the jobs in your parallel state, then do whatever you needed to after. You'd need to then include your own polling logic if you wanted the workflow to wait for completion of the jobs and return data on them or take action on completion. You can see an example of such polling for Glue Crawlers in this blog post (for which you can find sample code here).

Jenkins builds and queue management

I'm trying to improve our queue manager, and what I'd like to do is this:
There are two types triggers that can start a job (in this case regular, and upstream). If in the queue, there is ever a regular build and an upstream build, the upstream job will always execute, and we cancel the regular build. And if there are ever multiple instances with the same trigger (for the same job), we always take the first one, and cancel the rest, we don't want duplicate jobs in the queue.
These are triggers for the same job, and has nothing to do with concurrency of other jobs!
How can I achieve this? Using groovy, how can I get a list of triggers for the job and apply the logic I described above? Is there a plugin that'll solve my problem?
new to groovy, and jenkins, so maybe I'm trying to re-invent the wheel here
It might not do exactly what you want, but take a look at the Accelerated Build Now plugin in combination with the Priority Sorter plugin

How to see the full build queue in Jenkins

Our Jenkins instance has a job for our main application. It builds all git branches in the one job, and so can sometimes get pretty far behind. However, the Build Queue on the lefthand side only ever shows the next job, not all the others. Is there a way to see all the queued executions of a single job? Ideally it'd even show the branch as well.
I'm aware of solutions like creating a new job for each branch, but this really clutters up the already horrible interface, and I'd rather avoid that.
For a single job, with same parameters, Jenkins doesn't place a build in the queue if it already contained in the queue. You can use a simple trick to add an unused parameter and set some random value to this parameter every time you run the job. Now you can have multiple jobs in the queue for the same job.

How do you model a business workflow in ColdFusion?

Since there's no complete BPM framework/solution in ColdFusion as of yet, how would you model a workflow into a ColdFusion app that can be easily extensible and maintainable?
A business workflow is more then a flowchart that maps nicely into a programming language. For example:
How do you model a task X that follows by multiple tasks Y0,Y1,Y2 that happen in parallel, where Y0 is a human process (need to wait for inputs) and Y1 is a web service that might go wrong and might need auto retry, and Y2 is an automated process; follows by a task Z that only should be carried out when all Y's are completed?
My thoughts...
Seems like I need to do a whole lot of storing / managing / keeping
track of states, and frequent checking with cfscheuler.
cfthread ain't going to help much since some tasks can take days
(e.g. wait for user's confirmation).
I can already image the flow is going to be spread around in multiple UDFs,
DB, and CFCs
any opensource workflow engine in other language that maybe we can port over to CF?
Thank you for your brain power. :)
Study the Java Process Definition Language specification where JBoss has an execution engine for it. Using this Java based engine may be your easiest solution, and it solves many of the problems you've outlined.
If you intend to write your own, you will probably end up modelling states and transitions, vertices and edges in a directed graph. And this as Ciaran Archer wrote are the components of a State Machine. The best persistence approach IMO is capturing versions of whatever data is being sent through workflow via serialization, capturing the current state, and a history of transitions between states and changes to that data. The mechanism probably needs a way to keep track of who or what has responsibility for taking the next action against that workflow.
Based on your question, one thing to consider is whether or not you really need to represent parallel tasks in your solution. Where instead it might be possible to en-queue a set of messages and then specify a wait state for all of those to complete. Representing actual parallelism implies you are moving data simultaneously through several different processes. In which case when they join again you need an algorithm to resolve deltas, which is very much a non trivial task.
In the context of ColdFusion and what you're trying to accomplish, a scheduled task may be necessary if the system you're writing needs to poll other systems. Consider WDDX as a serialization format. JSON, while seductively simple, I recall has some edge cases around numbers and dates that can cause you grief.
Finally see my answer to this question for some additional thoughts.
Off the top of my head I'm thinking about the State design pattern with state persisted to a database. Check out the Head First Design Patterns's Gumball Machine example.
Generally this will work if you have something (like a client / order / etc.) going through a number of changes of state.
Different things will happen to your object depending on what state you are in, and that might mean sitting in a database table waiting for a flag to be updated by a user manually.
In terms of other languages I know Grails has a workflow module available. I don't know if you would be better off porting to CF or jumping ship to Grails (right tool for the job and all that).
It's just a thought, hope it helps.

Is MapReduce right for me?

I am working on a project that deals with analyzing a very large amount of data, so I discovered MapReduce fairly recently, and before i dive any further into it, i would like to make sure my expectations are correct.
The interaction with the data will happen from a web interface, so response time is critical here, i am thinking a 10-15 second limit. Assuming my data will be loaded into a distributed file system before i perform any analysis on it, what kind of a performance can i expect from it?
Let's say I need to filter a simple 5GB XML file that is well formed, has a fairly flat data structure and 10,000,000 records in it. And let's say the output will result in 100,000 records. Is 10 seconds possible?
If it, what kind of hardware am i looking at?
If not, why not?
I put the example down, but now wish that I didn't. 5GB was just a sample that i was talking about, and in reality I would be dealing with a lot of data. 5GB might be data for one hour of the day, and I might want to identify all the records that meet a certain criteria.
A database is really not an option for me. What i wanted to find out is what is the fastest performance i can expect out of using MapReduce. Is it always in minutes or hours? Is it never seconds?
MapReduce is good for scaling the processing of large datasets, but it is not intended to be responsive. In the Hadoop implementation, for instance, the overhead of startup usually takes a couple of minutes alone. The idea here is to take a processing job that would take days and bring it down to the order of hours, or hours to minutes, etc. But you would not start a new job in response to a web request and expect it to finish in time to respond.
To touch on why this is the case, consider the way MapReduce works (general, high-level overview):
A bunch of nodes receive portions of
the input data (called splits) and do
some processing (the map step)
The intermediate data (output from
the last step) is repartitioned such
that data with like keys ends up
together. This usually requires some
data transfer between nodes.
The reduce nodes (which are not
necessarily distinct from the mapper
nodes - a single machine can do
multiple jobs in succession) perform
the reduce step.
Result data is collected and merged
to produce the final output set.
While Hadoop, et al try to keep data locality as high as possible, there is still a fair amount of shuffling around that occurs during processing. This alone should preclude you from backing a responsive web interface with a distributed MapReduce implementation.
Edit: as Jan Jongboom pointed out, MapReduce is very good for preprocessing data such that web queries can be fast BECAUSE they don't need to engage in processing. Consider the famous example of creating an inverted index from a large set of webpages.
A distributed implementation of MapReduce such as Hadoop is not a good fit for processing a 5GB XML
Hadoop works best on large amounts of data. Although 5GB is a fairly big XML file, it can easily be processed on a single machine.
Input files to Hadoop jobs need to be splittable so that different parts of the file can be processed on different machines. Unless your xml is trivially flat, the splitting of the file will be non deterministic so you'll need a pre processing step to format the file for splitting.
If you had many 5GB files, then you could use hadoop to distribute the splitting. You could also use it to merge results across files and store the results in a format for fast querying for use by your web interface as other answers have mentioned.
MapReduce is a generic term. You probably mean to ask whether a fully featured MapReduce framework with job control, such as Hadoop, is right for you. The answer still depends on the framework, but usually, the job control, network, data replication, and fault tolerance features of a MapReduce framework makes it suitable for tasks that take minutes, hours, or longer, and that's probably the short and correct answer for you.
The MapReduce paradigm might be useful to you if your tasks can be split among indepdent mappers and combined with one or more reducers, and the language, framework, and infrastructure that you have available let you take advantage of that.
There isn't necessarily a distinction between MapReduce and a database. A declarative language such as SQL is a good way to abstract parallelism, as are queryable MapReduce frameworks such as HBase. This article discusses MapReduce implementations of a k-means algorithm, and ends with a pure SQL example (which assumes that the server can parallelize it).
Ideally, a developer doesn't need to know too much about the plumbing at all. Erlang examples like to show off how the functional language features handle process control.
Also, keep in mind that there are lightweight ways to play with MapReduce, such as bashreduce.
I recently worked on a system that processes roughly 120GB/hour with 30 days of history. We ended up using Netezza for organizational reasons, but I think Hadoop may be an appropriate solution depending on the details of your data and queries.
Note that XML is very verbose. One of your main cost will reading/writing to disk. If you can, chose a more compact format.
The number of nodes in your cluster will depend on type and number of disks and CPU. You can assume for a rough calculation that you will be limited by disk speed. If your 7200rpm disk can scan at 50MB/s and you want to scan 500GB in 10s, then you need 1000 nodes.
You may want to play with Amazon's EC2, where you can stand up a Hadoop cluster and pay by the minute, or you can run a MapReduce job on their infrastructure.
It sounds like what you might want is a good old fashioned database. Not quite as trendy as map/reduce, but often sufficient for small jobs like this. Depending on how flexible your filtering needs to be, you could either just import your 5GB file into a SQL database, or you could implement your own indexing scheme yourself, by either storing records in different files, storing everything in memory in a giant hashtable, or whatever is appropriate for your needs.