What is a regular expression that satisfies all valid options for a JOB card in JCL? - regex

I'm working on a program that will need to remove a JOB card from a JCL member. I'm having a lot of trouble building something that satisfies all possible options and configurations.
Below is a good guide on the JOB statement:
Some issues though:
There may be multiple job cards in a member
There may be comments in the job card
There may be characters in columns 73-80
There may be a SYSAFF, SET or similar statement directly following the JOB statement that should be retained but may begin with slashes and spaces just like a job card
Any help would be appreciated. Currently I have the following regular expression:
Ultimately I only need to change the JOB name to fit the restriction of the FTP JES reader which requires your job name to be the submitting USERID plus exactly one character under JESINTERFACELEVEL 1 which is used by our site. Changing only the job name would also be acceptable.

With the information from your comment on Joe's answer, your task becomes easier.
//JJJJJAAA JOB other-stuff
If the second word is JOB and the first two characters of the first word are // and the third character is not *, then you have a JOB card. Remove the first word, replacing it with //JJJJJx, where x is your additional single character. JJJJJ represents the user-id.
This does assume that the user-id of the existing JOBs will be the same as the user-id of the new JOBs, in which case the replacement JOB name is not going to cause the extension of the JOB card.
If this is not the case, if the user-id on the original JOB cards is shorter, or indeed not a user-id at all and is shorter, either all or some, then I'd recommend splitting the JOB card after the first comma (if present).
In the unlikely event that you have very long accounting information and nothing else, this may cause a JCL error when the above is true. If so, fix the accounting information or get around the user-id limit. This is an unlikely situation :-)
If there is no accounting information but there is a long comment, this may cause a JCL error by accidentally hitting column 72 with data (so it will think the next line is a Continuation). In the unlikely even of that happening, fix it.
Neither of these two are worth coding for. They are worth verifying for, though the simplest way to do that is to watch and pick them up if they fall over.
You do have one more thing to watch for, and this is whether any of your steps use DD * or DD DATA. If they do, then you have to discover if any use DLM=. If they do, you will have to switch off the search for the JOB card when encountering DLM=, and switch it on again when you reach the delimiter value starting in column one.
Your single character may cause you problems. You will have a limited number of jobnames possible per userid. Unless allowed, JOBs with the same name will not run at the same time.

You will need to account for the two positional parameters -- 142 bytes of accounting information and 30ish bytes for programmers name. Also, you will have to account for the optional keyword parameters:
Dealing with the JES commands like SYSAFF and other JCL commands like SET make it very complicated.
You might want to approach it in steps -- regex to handle the "//" followed by up to 69 bytes and continued with a comma except in cases of comments where it starts with "//*".
It might help to know what you are trying to accomplish. You can ask JES to process the JCL for you and there are ways you can inspect the parsed JCL via macros, exits and control blocks.

In most cases it's the first card anyway. Or at least the first non-comment card.


Map-Reduce with a wait

The concept of map-reduce is very familiar. It seems like a great fit for a problem I'm trying to solve, but it's either missing something (or I lack enough understanding of the concept).
I have a stream of items, structured as follows:
"jobId": 777,
"numberOfParts": 5,
"data": "some data..."
I want to do a map-reduce on many such items.
My mapping operation is straightforward - take the jobId.
My reduce operation is irrelevant for this phase, but all we know is that it takes multiple strings (the "some data..." part) and somehow reduces them to a single object.
The only problem is - I need all five parts of this job to complete before I can reduce all the strings into a single object. Every item has a "numberOfParts" property which indicates the number of items I must have before I apply the reduce operation. The items are not ordered, therefore I don't have a "partId" field.
Long story short - I need to apply some kind of a waiting mechanism that waits for all parts of the job to complete before initiating the reduce operation, and I need this waiting mechanism to rely on a value that exists within the payload (therefore solutions like kafka wouldn't work).
Is there a way to do that, hopefully using a single tool/framework?
I only want to write the map/reduce part and the "waiting" logic, the rest I believe should come out of the box.
**** EDIT ****
I'm currently in the design phase of the project and therefore not using any framework (such as spark, hadoop, etc...)
I asked this because I wanted to find out the best way to tackle this problem.
"Waiting" is not the correct approach.
Assuming your jobId is the key, and data contains some number of parts (zero or more), then you must have multiple reducers. One that gathers all parts of the same job, then another that processes all jobs with a collection of parts greater than or equal to numberOfParts while ignoring others

one line regex independent the number of items

Can I have a one-line regex code that matches the values between a pipe line "|" independent of the number if items between the pipe lines. E.g. I have the following regex:
which works only if I have 12 items. How can I make the same work for e.g. 6 items as well?
This is the pattern I've used in the past for that purpose. It matches 1 or more group that does not contain the pipe delimeter.
For Adobe Classification Rule Builder (CRB), there is no way to write a regex that will match an arbitrary number of your pattern and push them to $n capture group. Most regex engines do not allow for this, though some languages offer certain ways to more or less effectively do this as returned arrays or whatever. But CRB doesn't offer that sort of thing.
But, it's mostly pointless to want this anyways, since there's nothing upstream or downstream that really dynamically/automatically accommodates this sort of thing anyways.
For example, there's no way in the CRB interface to dynamically populate the output value with an arbitary $1$2$3[$n..] value, nor is there a way to dynamically generate an arbitrary number of rules in the rule set.
In addition, Adobe Analytics (AA) does not offer arbitrary on-the-fly classification column generation anyways (unless you want to write a script using the Classification API, but you can't say the same for CRBs anyways).
For example if you have
And you want to classify this into 2 classification columns/reports, you have to go and create them in the classification interface. And then let's say your next value sent in is:
Well AA does not automatically create a new classification level for you; you have to go in and add a 3rd, and so on.
So overall, even though it is not possible to return an arbitrary number of captured groups $n in a CRB, there isn't really a reason you need to.
Perhaps it would help if you explain what you are actually trying to do overall? For example, what report(s) do you expect to see?
One common reason I see this sort of "wish" come up for is when someone wants to track stuff like header or breadcrumb navigation links that have an arbitrary depth to them. So they push e.g. a breadcrumb
Home > Electronics > Computers > Monitors > LED Monitors
...or whatever to an eVar (but pipe delimited, based on your question), and then they want to break this up into classified columns.
And the problem is, it could be an arbitrary length. But as mentioned, setting up classifications and rules for them doesn't really accommodate this sort of thing.
Usually the best practice for a scenario like this is to to look at the raw data and see how many levels represents the bulk of your data, on average. For example if you look at your raw eVar report and see even though upwards of like 5 or 6 levels in the values can be found, but you can also see that most of values on average are between 1-3 levels, then you should create 4 classification columns. The first 3 classifications represent the first 3 levels, and the 4th one will have everything else.
So going back to the example value:
Home|Electronics|Computers|Monitors|LED Monitors
You can have:
Level1 => Home
Level2 => Electronics
Level3 => Computers
Level4+ => Monitors|LED Monitors
Then you setup a CRB with 4 rules, one for each of the levels. And you'd use the same regex in all 4 rule rows:
Which will return the following captured groups to use in the CRB outputs:
$1 => Home
$2 => Electronics
$3 => Computers
$4 => Monitors|LED Monitors
Yeah, this isn't the same as having a classification column for every possible length, but it is more practical, because when it comes to analytics, you shouldn't really try to be too granular about things in the first place.
But if you absolutely need to have something for every possible amount of delimited values, you will need to find out what the max possible is and make that many, hard coded.
Or as an alternative to classifications, consider one of the following alternatives:
Use a list prop
Use a list variable (e.g. list1)
Use a Merchandising eVar (product variable syntax)
This isn't exactly the same thing, and they each have their caveats, but you didn't provide details for what you are ultimately trying to get out of the reports, so this may or may not be something you can work with.
Well anyways, hopefully some of this is food for thought for you.

Storm and stop words

I am new in storm framework(https://storm.incubator.apache.org/about/integrates.html),
I test locally with my code and I think If I remove stop words, it will perform well, but i search on line and I can't see any example that removing stopwords in storm.
If the size of the stop words list is small enough to fit in memory, the most straighforward approach would be to simply filter the tuples with an implementation of storm Filter that knows that list. This Filter could possibly poll the DB every so often to get the latest list of stop words if this list evolves over time.
If the size of the stop words list is bigger, then you can use a QueryFunction, called from your topology with the stateQuery function, which would:
receive a batch of tuples to check (say 10000 at a time)
build a single query from their content and look up corresponding stop words in persistence
attach a boolean to each tuple specifying what to with each one
+ add a Filter right after that to filter based on that boolean.
And if you feel adventurous:
Another and faster approach would be to use a bloom filter approximation. I heard that Algebird is meant to provide this kind of functionality and targets both Scalding and Storm (how cool is that?), but I don't know how stable it is nor do I have any experience in practically plugging it into Storm (maybe Sunday if it's rainy...).
Also, Cascading (which is not directly related to Storm but has a very similar set of primitive abstractions on top of map reduce) suggests in this tutorial a method based on left joins. Such joins exist in Storm and the right branch could possibly be fed with a FixedBatchSpout emitting all stop words every time, or even a custom spout that reads the latest version of the list of stop words from persistence every time, so maybe that would work too? Maybe? This also assumes the size of the stop words list is relatively small though.

repeated Regexp in BigQuery

we have a debate regarding the best way to use regex expression in case clause...
we need a case operation on an extracted object.
this can be expressed in several ways.
the question is: which one will be more effective? do BQ process the regex several time if it appears in several locations?
i have adapted my code to run on the wikipedia data sample.
Select case when PS_Version='1' then '1st'
when PS_Version='2' then '2nd'
when PS_Version='3' then '3rd'
else 'other' end as PS_VersionOrder
(SELECT regexp_extract(title,r'PlayStation (\d+)') as PS_Version
FROM [publicdata:samples.wikipedia] A
where title like '%PlayStation%'
limit 100)
Select case when regexp_extract(title,r'PlayStation (\d+)')='1' then '1st'
when regexp_extract(title,r'PlayStation (\d+)')='2' then '2nd'
when regexp_extract(title,r'PlayStation (\d+)')='3' then '3rd'
else 'other' end as PS_VersionOrder
FROM [publicdata:samples.wikipedia] A
where title like '%PlayStation%'
limit 100
the regex people claim the 1st will be more efficient. the DB man prefer the 2nd one as it does not involve subqueries...
I'd agree with what Alex said, but I'll add that the first query is also going to be better from an execution standpoint. BigQuery does subqueries very efficiently, but may not do the common subexpression elimination in the case clause (it might, however, but you shouldn't rely on it).
IMO, I would choose the 1st.
Why ?
1. Maintenance
Although the 2nd doesn't contain subqueries, it duplicates the regex. If you decide to change this regex later, it makes the maintenance more difficult.
2. Readability
The 2nd is less readable. You must read long redundant case statements before understand the code.
3. User experience
The 2nd and the 1st may defer in performance. You should measure time needed to perform the two queries. Then check if the difference in time, if it exists, has a noticeable impact on your final user experience.
If the 2nd beats the 1st with 100 ms for instance, a human won't notice it.
If the query is involved in a nightly batch, use the 1st approach.

Underlying mechanism in firing SQL Queries in Oracle

When we fire a SQL query like
What exactly happens internally? Is there any parser at work? Is it in C/C++ ?
Can any body please explain ?
Thanks in advance to all.
Short answer is yes, of course there is a parser module inside Oracle that interprets the statement text. My understanding is that the bulk of Oracle's source code is in C.
For general reference:
Any SQL statement potentially goes through three steps when Oracle is asked to execute it. Often, control is returned to the client between each of these steps, although the details can depend on the specific client being used and the manner in which calls are made.
(1) Parse -- I believe the first action is actually to check whether Oracle has a cached copy of the exact statement text. If so, it can save the work of parsing your statement again. If not, it must of course parse the text, then determine an execution plan that Oracle thinks is optimal for the statement. So conceptually at least there are two entities at work in this phase -- the parser and the optimizer.
(2) Execute -- For a SELECT statement this step would generally run just enough of the execution plan to be ready to return some rows to the client. Depending on the details of the plan, that might mean running the whole thing, or might mean doing just a very small fraction of the work. For any other kind of statement, the execute phase is when all of the work is actually done.
(3) Fetch -- This is when rows are actually returned to the client. Generally the client has a predetermined fetch array size which sets the maximum number of rows that will be returned by a single fetch call. So there may be many fetches made for a single statement. Of course if the statement is one that cannot return rows, then there is no fetch step necessary.
I think internally Oracle would have its own parser, which does parsing and tries compiling the query. Think its not related to C or C++.
But need to confirm.
-Justin Samuel.