Extract a between a specific pattern - regex

I have to extract some substrings, this is like an XML markup in a plain text doc, like
lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf
Can i extract this pattern in a single command?
In a case like this, I tried to use a matcher, the group command to extract this single match.
I don't want to do something like
String pattern = /<AAA>(.*)<\/AAA>/;
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher("lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf");
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
}
There must be a more elegant way.
Edit :
Thank you time_yates, i was looking for something like that.
Could you explain a little why you use [0][1] on the result of
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
Answer by tim_yates :
=~ returns a Matcher, and so [0] gets the first match, which is 2 groups, the first is the String that had the match in it (your whole string) the second [1] is the group you defined in your expression
Thank you so much for your help, and thanks to all the readers.
Power of a community !!!

Can't you just do:
def input = 'lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf'
def extract = (input =~ '<AAA>(.+?)</AAA>')[0][1]
assert extract == 'myString'

This is the shortest (not the best) way I can think of without external libs:
String str = "lsdkfjsdklfj sdklfsdklfjsd <AAA>myString</AAA>sdfsdfsdfsdf";
System.out.println(str.substring(str.indexOf(">") + 1, str.lastIndexOf("<")));
Or using StringUtils (which is million times better than my previous sugestion with substring):
StringUtils.substringBetween(str, "<AAA>", "</AAA>");
Still I'd go with matcher() like you proposed among all these.

Related

The regex in string.format of LUA

I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.

Scala regex find/replace with additional formatting

I'm trying to replace parts of a string that contains what should be dates, but which are possibly in an impermissible format. Specifically, all of the dates are in the form "mm/dd/YYYY" and they need to be in the form "YYYY-mm-dd". One caveat is that the original dates may not exactly be in the mm/dd/YYYY format; some are like "5/6/2015". For example, if
val x = "where date >= '05/06/2017'"
then
x.replaceAll("'([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})'", "'$3-$1-$2'")
performs the desired replacement (returns "2017-05-06"), but for
val y = "where date >= '5/6/2017'"
this does not return the desired replacement (returns "2017-5-6" -- for me, an invalid representation). With the Joda Time wrapper nscala-time, I've tried capturing the dates and then reformatting them:
import com.github.nscala_time.time.Imports._
import org.joda.time.DateTime
val f = DateTimeFormat.forPattern("yyyy-MM-dd")
y.replaceAll("'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'",
"'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime("$1"))+"'")
But this fails with a java.lang.IllegalArgumentException: Invalid format: "$1". I've also tried using the f interpolator and padding with 0s, but it doesn't seem to like that either.
Are you not able to do additional processing on the captured groups ($1, etc.) inside the replaceAll? If not, how else can I achieve the desired result?
The $1 like backreferences can only be used inside string replacement patterns. In your code, "$1" is not a backreference any longer.
You may use a "callback" with replaceAllIn to actually get the match object and access its groups to further manipulate them:
val pattern = "'([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'".r
y = pattern replaceAllIn (y, m => "'"+f.print(DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(m.group(1)))+"'")
Regex.replaceAllIn is overloaded and can take a Match => String.

Replace multiple words in pig

I am new to Pig. In the script that I am writing I want to perform an operation similar to this:
foreach X GENERATE REPLACE(word,'.*abc.*','abc') OR REPLACE(word,'.*def.*','def').
If the first pattern matches then abc is replaced else if second pattern is matched then def is replaced. But I suppose the syntax is incorrect. Can someone help me with the syntax?
There are a few ways to do this, but since if the regex doesn't match the string, you'll just get your string back, this is pretty compact:
Y = FOREACH X GENERATE REPLACE(REPLACE(word, '.*abc.*', 'abc'), '.*def.*', 'def');

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.

RegEx : Replace parts of dynamic strings

I have a string
IsNull(VSK1_DVal.RuntimeSUM,0),
I need to remove IsNull part, so the result would be
VSK1_DVal.RuntimeSUM,
I'm absolute new to RegEx, but it wouldn't be a problem, if not one thing :
VSK1 is dynamic part, can be any combination of A-Z,0-9 and any length. How to replace strings with RegEx? I use MSSQL 2k5, i think it uses general set of RegEx rules.
EDIT : I forgot to say, that I'm doing replacement in SSMS Query window's Replace Box (^H) - not building RegEx query
br
marius
here's a regex that should work:
[^(]+\(([^,]+),[^)]\)
Then use $1 capture group to extract the part that you need.
I did a sanity check in ruby:
orig = "IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /[^(]*\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => VSK1_DVal.RuntimeSUM,
It gets trickier if you have a prefix that you want to retain. Like if you have this:
"somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
In this case, you need someway to identify the start of the pattern. Maybe you can use '=' to identify the start of the pattern? If so, this should work:
orig = "somestuff = IsNull(VSK1_DVal.RuntimeSUM,0),"
regex = /=\s*\w+\(([^,]+),[^)]\)/
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
But then the case where you don't have an equals sign will fail. Maybe you can use 'IsNull' to identify the start of the pattern? If so, try this (note the '/i' representing case insensitive matching):
orig = "somestuff = isnull(VSK1_DVal.RuntimeSUM,0),"
regex = /IsNull\(([^,]+),[^)]\)/i
result = orig.sub(regex){$1} # result => somestuff = VSK1_DVal.RuntimeSUM,
/IsNULL\((A-Z0-9+),0\)/
Then pick group match number 1.
Here's a very useful site: http://www.regexlib.com/RETester.aspx
They have a tester and a cheatsheet that are very useful for quick testing of this sort.
I tested the solution by Dave and it works fine except it also removes the trailing comma you wanted retained. Minor thing to fix.
Try this:
IsNULL\((.*,)0\)
You say in your question
I use MSSQL 2k5, i think it uses
general set of RegEx rules.
This is not true unless you enable CLR and compile and install an assembly. You can use its native pattern matching syntax and LIKE for this as below.
WITH T(C) AS
(
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,0),' UNION ALL
SELECT 'IsNull(VSK1_DVal.RuntimeSUM,123465),' UNION ALL
SELECT 'No Match'
)
SELECT SUBSTRING(C,8,1+LEN(C)-8-CHARINDEX(',',REVERSE(C),2))
FROM T
WHERE C LIKE 'IsNull(%,_%),'