extract alpha numeric id - regex

I am looking to use RE to extract an id and description from the input which is in following format:
TTTT.1.A8This is important
AA.1.2.2ANothing is sometimes important
AAC.1A.2Everything sometimes is not important
Expected result:
ID description
TTTT.1.A8 This is important
AA.1.2.2A Nothing is sometimes important
AAC.1A.2 Everything sometimes is not important
I tried to achieve it as below:
img1 = re.compile(r"\w+\.\d+")
for in in input:
if re.search(img1,i.text):
req_id = str.strip(re.search(img1,i.text).group(0))
req_text = str.strip(re.split(img1,i.text)[1])
control_ids[req_id] = req_text

Find an upper case letter followed by a lower case letter:
/([A-Z][a-z])/g
then replace it as a space and itself $1 for JavaScript \g<1> for Python.
JavaScript: Regex101
Python: Regex101
const str =`TTTT.1.A8This is important
AA.1.2.2ANothing is sometimes important
AAC.1A.2Everything sometimes is not important`;
const rgx = new RegExp(/([A-Z][a-z])/, 'g');
const output = str.replace(rgx, ` $1`);
console.log(output);

Related

How to extract part of a string with slash constraints?

Hello I have some strings named like this:
BURGERDAY / PPA / This is a burger fest
I've tried using regex to get it but I can't seem to get it right.
The output should just get the final string of This is a burger fest (without the first whitespace)
Here, we can capture our desired output after we reach to the last slash followed by any number of spaces:
.+\/\s+(.+)
where (.+) collects what we wish to return.
const regex = /.+\/\s+(.+)/gm;
const str = `BURGERDAY / PPA / This is a burger fest`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
DEMO
Advices
Based on revo's advice, we can also use this expression, which is much better:
\/ +([^\/]*)$
According to Bohemian's advice, it may not be required to escape the forward slash, based on the language we wish to use and this would work for JavaScript:
.+/\s+(.+)
Also, we assume in target content, we would not have forward slash, otherwise we can change our constraints based on other possible inputs/scenarios.
Note: This is a pythonish answer (my mistake). I'll leave this for it's value as it could apply many languages
Another approach is to split it and then rejoin it.
data = 'BURGERDAY / PPA / This is a burger fest'
Here it is in four steps:
parts = data.split('/') # break into a list by '/'
parts = parts[2:] # get a new list excluding the first 2 elements
output = '/'.join(parts) # join them back together with a '/'
output = output.strip() # strip spaces from each side of the output
And in one concise line:
output= str.join('/', data.split('/')[2:]).strip()
Note: I feel that str.join(..., ...) is more readable than '...'.join(...) in some contexts. It is the identical call though.

How to specify regex for route on content processor in NiFi?

In nifi, I am routing based on content. I am using nifi's RouteOnContent So, how can I route by specifying the regex
My input content is:
{
"testreg":{
"test1":"test2",
"test3":"test4"
}
}
I wanted to test whether testreg whole content(word) present in the flowfile content.
So, I checked with
testreg
(testreg)
.*testreg.*
(.*testreg.*)
But it is not matching with content, So, what is the correct regex to be used in Nifi.
Edit: It'ld make very much sense to check if the pattern we are looking for is surrounded by quotes and followed by a colon, since the patern testreg can simply occur somewhere else too. In this case we get the last match, which is not OK. So, eventually, this:
[\s\S]*?(?<=")(testreg)(?=":)[\s\S]*?
would be the ideal answer that we are looking for.
Maybe, here we want to have an expression that would pass the new lines. I'm not so sure what our desired output would be, however we can start testing against a few options, such as these expressions:
[\s\S]*(testreg)[\s\S]*
[\w\W]*(testreg)[\w\W]*
[\d\D]*(testreg)[\d\D]*
([\s\S].*?)(testreg)?
Demo
This demo shows that we can capture and return our desired testreg:
const regex = /[\s\S]*(testreg)[\s\S]*/gm;
const str = `{
"testreg":{
"test1":"test2",
"test3":"test4"
}
}`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

Find and replace between second and third slash

I have urls with following formats ...
/category1/1rwr23/item
/category2/3werwe4/item
/category3/123wewe23/item
/category4/132werw3/item
/category5/12werw33/item
I would replace the category numbers with {id} for further processing.
/category1/{id}/item
How do i replace category numbers with {id}. I have spend last 4 hours with out proper conclusion.
Assuming you'll be running regex in JavaScript, your regex will be.
/^(\/.*?\/)([^/]+)/gm
and replacement string should look like $1whatever
var str = "your url strings ..."
var replStr = 'replacement';
var re = /^(\/.*?\/)([^/]+)/gm;
var result = str.replace(re, '$1'+replStr);
console.log(result);
based on your input, it should print.
/category1/replacement/item
/category2/replacement/item
/category3/replacement/item
/category4/replacement/item
/category5/replacement/item
See DEMO
We devide it into 3 groups
1.part before replacement
2.replacement
3.part after replacement
yourString.replace(//([^/]*\/[^/]+\/)([^/]+)(\/[^/]+)/g,'$1' + replacement+ '$3');
Here is the demo: https://jsfiddle.net/9sL1qj87/

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))

to give space between two continuous uppercase letter

I need to know how to give space between two uppercase letter continuously.
Ihave large list of customer. with first name middle name and last name. GaryACloud should be split as Gary A Cloud. I used (.)([A-Z]) And replaced with \1 \2. I have no clue what it means. So if anyone can explain i will be really grateful. the above gave me a partial output only. i got Gary ACloud but how to provide space before every upper case letter? and also if you can expalin the solution, it will be very helpful
You can match:
"([A-Z])(?=[A-Z])"
And replace with:
"\1 "
var input = "CategoryName";
var result = Regex.Replace(input, "([a-z])([A-Z])", #"$1 $2"); //Category Name
UPDATE (this will treat sequence of capital letters as one word)
var input = "SimpleHTTPRequest";
var result = Regex.Replace(input, "([a-z]|[A-Z]{2,})([A-Z])", #"$1 $2");
//Simple HTTP Request