Example Text:
[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]
Current RegEx:
[ABC.*?(?:value='(.*?)')+.*?]]]
What I want to achive:
There's an extremely long text (HTTP Response) with data I want to grab. A single dataset contains multiple lines. On every line the data I want to collect is located inside the "value:''" tag. On each line there are multiple of those value tags. Is it somehow possible to use (optimize) the above regex to get the data of all value tags with just a single capturing group in the regex pattern?
To clarify what I want: alternatively I would have to use the following pattern:
[ABC.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?value='(.*?)'.*?]]]
Using Perl, you can easily get at all matches of a regular expression, and most of the other regular expression libraries have similar capabilities. As you want to match a header, doing a repeated match with an anchor ( \G )is the easiest:
use strict;
#use Regexp::Debugger;
my $data = "[ABC[[value='123'SomeTextHere[]]][value='5463',SomedifferentTextwithdifferentlength]][[value='Text';]]]]][ABC [...]";
my #matches = $data =~ /(?:^\[ABC|\G).*?\bvalue='([^']*)'/g;
print "[$_]" for #matches;
__END__
[123][5463][Text]
Most likely you will need to add the "global" flag to whatever regex library you are using for matching.
Personally, I would split this up into a two-step process. First, extract the string between [ABC[[ and ]]], and then extract all value='...' parts from that string. Also, most likely, you can parse the string [ABC[[...]]] in a sane way, counting opening and closing brackets. Or maybe that string is even JSON and you can just use a proper parser there?
Related
I would like to use regular expression for replacing a certain pattern in the Kettle. For example, AAAA >5< BBBB, I want to replace this with AAAA 555 BBBB. I know how to find the pattern, but I am not sure how to replace that with new string. The one thing I have to keep is that I have to find pattern together ><, not separately like > or < because there is another pattern <5>.
You can use the "Replace in String" step in a transformation.
Set use RegEx to "Y", type your regex on the Search box, with capturing groups if necessary, and the replacement string in the replacement box, referring to capture groups as $1, $2, ...
It'll replace all occurrences of the regex in the original string.
If the Out Stream field is ommitted, it'll overwrite the In stream field.
If you want the pattern >\d< replaced by a triple of the found digit, you can use Replace-In-String in regex mode:
Search: (.*)(>(\d)<)(.*)
Replace: $1$3$3$3$4
If you want all such patterns treated the same:
Search: (>(\d)<)
Replace: $2$2$2
EDIT due to your improved requirement
Since you intend to convert your "simple" markup to a more HTML-like markup, you better use a User-Defined-Java-Expression. Also, you must avoid to reintroduce simple markup when replacing repeatedly.
I have the following string:
{"name":"db-mysql","authenticate":false,"remove":false,"skip":false,"options":{"create":{"Image":"mysql:5.6/image-name:0.3.44","Env":{"MYSQL_ROOT_PASSWORD":"dummy_pass","MYSQL_DATABASE":"dev_db"}}}}
I need to get the version: 0.3.44
The pattern will always be "Image":"XXX:YYY/ZZZ:VVV"
Any help would be greatly appreciated
Rubular link
This regular expression will reliably match the given "pattern" in any string and capture the group designated VVV:
/"Image":"[^:"]+:[^"\/]+\/[^":]+:([^"]+)"/
where the pattern is understood to express the following characteristics of the input to be matched:
there is no whitespace around the colon between "Image" and the following " (though that could be matched with a small tweak);
the XXX and ZZZ substrings do contain any colons;
the YYY substring does not contain a forward slash (/); and
none of the substrings XXX, YYY, ZZZ, or VVV contains a literal double-quote ("), whether or not escaped.
Inasmuch as those constraints are stronger than JSON or YAML requires to express the given data, you'd probably be better off using using a bona fide JSON / YAML parser. A real parser will cope with semantically equivalent inputs that do not satisfy the constraints, and it will recognize invalid (in the JSON or YAML sense) inputs that contain the pattern. If none of that is a concern to you, however, then the regex will do the job.
This is extremely hard to do correctly using just a regular expression, and it would require an regex engine capable of recursion. Rather than writing a 50 line regexp, I'd simply use an existing JSON parser.
$ perl -MJSON::PP -E'
use open ":std", ":encoding(UTF-8)";
my $json = <>;
my $data = JSON::PP->new->decode($json);
my $image = $data->{options}{create}{Image};
my %versions = map { split(/:/, $_, 2) } split(/\//, $image);
say $versions{"image-name"};
' my.json
0.3.44
I'm still learning regex, and have a long ways to go so would appreciate help from any of you with more regex experience. I'm working on a perl script to parse multiple log files, and parse for certain values. In this case, I'm trying to get a list of user names.
Here's what my log file looks like:
[date timestamp]UserName = Joe_Smith
[date timestamp]IP Address = 10.10.10.10
..
Just testing, I've been able to pull it out using \UserName\s\=\s\w+, however I just want the actual UserName value, and not include the 'UserName =' part. Ideally if I can get this to work, I should be able to apply the same logic for pulling out the IP Address etc, but just hoping to get list of Usernames for the moment.
Also, the usernames are always in the format above of Firstname_Lastname, so I believe \w+ should always get everything I need.
Appreciate any help!
You should capture the part of the matched string that you are interested in using parentheses in the regular expression.
If the match succeeds, then captures are available in the built-in variables $1, $2 etc, numbered in the order that their opening parenthesis appears in the regular expressions.
In this case you need only a single capture so you need look only at $1.
Beware that you should always check that a regex match succeeded before using the values in the capture variables, as they retain the values from the last successful match and a failed match doesn't reset them.
use strict;
use warnings;
my $str = '[date timestamp]UserName = Joe_Smith';
if ($str =~ /UserName = (\w+)/) {
print $1, "\n";
}
output
Joe_Smith
Another way to do it:
my ($username) = $str =~ /UserName\s\=\s(\w+)/
or warn "no username parsed from '$str'\n";
You should make the regex as \UserName\s\=\s(\w+)$ And after this the part in the bracket will be available in the variable $1. My perl is a bit rusty, so if it doesnt work right, look at http://www.troubleshooters.com/codecorn/littperl/perlreg.htm#StringSelections
The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.
How can I create a regular expression that will grab delimited text from a string? For example, given a string like
text ###token1### text text ###token2### text text
I want a regex that will pull out ###token1###. Yes, I do want the delimiter as well. By adding another group, I can get both:
(###(.+?)###)
/###(.+?)###/
if you want the ###'s then you need
/(###.+?###)/
the ? means non greedy, if you didn't have the ?, then it would grab too much.
e.g. '###token1### text text ###token2###' would all get grabbed.
My initial answer had a * instead of a +. * means 0 or more. + means 1 or more. * was wrong because that would allow ###### as a valid thing to find.
For playing around with regular expressions. I highly recommend http://www.weitz.de/regex-coach/ for windows. You can type in the string you want and your regular expression and see what it's actually doing.
Your selected text will be stored in \1 or $1 depending on where you are using your regular expression.
In Perl, you actually want something like this:
$text = 'text ###token1### text text ###token2### text text';
while($text =~ m/###(.+?)###/g) {
print $1, "\n";
}
Which will give you each token in turn within the while loop. The (.*?) ensures that you get the shortest bit between the delimiters, preventing it from thinking the token is 'token1### text text ###token2'.
Or, if you just want to save them, not loop immediately:
#tokens = $text =~ m/###(.+?)###/g;
Assuming you want to match ###token2### as well...
/###.+###/
Use () and \x. A naive example that assumes the text within the tokens is always delimited by #:
text (#+.+#+) text text (#+.+#+) text text
The stuff in the () can then be grabbed by using \1 and \2 (\1 for the first set, \2 for the second in the replacement expression (assuming you're doing a search/replace in an editor). For example, the replacement expression could be:
token1: \1, token2: \2
For the above example, that should produce:
token1: ###token1###, token2: ###token2###
If you're using a regexp library in a program, you'd presumably call a function to get at the contents first and second token, which you've indicated with the ()s around them.
Well when you are using delimiters such as this basically you just grab the first one then anything that does not match the ending delimiter followed by the ending delimiter. A special caution should be that in cases as the example above [^#] would not work as checking to ensure the end delimiter is not there since a singe # would cause the regex to fail (ie. "###foo#bar###). In the case above the regex to parse it would be the following assuming empty tokens are allowed (if not, change * to +):
###([^#]|#[^#]|##[^#])*###