Get first instance and Get last instance of string - regex

I am trying to match the first instance of the value of Timestamp in one expression and the last instance of the value of Timestamp in another expression:
{'Latitude': 50.00001,'Longitude': 2.00002,'Timestamp': '00:10:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:20:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:25:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:37:00'}
Anyone know how to do that.

Take advantage of regexp's greediness: the * operator will take as many matches as it can find. So the approach here is to match the explicit pattern at the beginning and end of the regexp with a .* in the middle. The .* will slurp up as many characters as it can subject to the rest of the regexp also matching.
/(${pattern}).*(${pattern})/
Where here, ${} represents extrapolation. This will vary on your language. In Ruby it would be #{}. I have chosen to capture the entire pattern; you can instead put the () capture around the timestamp value but I find this easier to read and maintain. This regexp will match two instances of $pattern with as much stuff in between as it can fit, thus guaranteeing that you have the first and last.
If you want to be more strict, you could enforce the pattern in the middle as well, *'ing the full pattern rather than just .:
/${pattern},\s*(?:${pattern},\s*)*${pattern}/
Ask in the comments if you don't understand any piece of this regexp.
One pattern we can use is /\{[^}]+\'Timestamp\'[^}]+\}/.Note that this pattern assumes that Timestamp is the LAST key; if this is not always true you need to add a bit more to this pattern.
So the total pattern for the first example will be:
str =~ /(${pattern}.*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}).*({[^}]+'Timestamp'[^}]+})/
Then, $1 and $2 are the first and last hashes that match the Timestamp key. Again, this matches the entire pattern rather than only the timestamp value itself, but it should be straightforward from there to extract the actual timestamp value.
For the second, more strict example, and the reason I did not want to capture the timestamp value inside the pattern itself, we have:
str =~ /(${pattern}),\s*(?:${pattern},\s*)*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}), *(?:{[^}]+'Timestamp'[^}]+}, *)*({[^}]+'Timestamp'[^}]+})/
We still have the correct results in $1 and $2 because we explicitly chose NOT to put a capturing group inside the pattern.

Related

adding regex pattern in string in perl

I want to compare string from one file to another. but another file may contains some element and that element can occur anywhere and it can occur many times also.
Note : these tags needs to be retain in final output.
For e.g.:
I want to compare word ‘scripting’.. tag indicates the word to be matched from str2.
$str1 = “perl is an <match>scripting</match> language”;
$str2 = “perl is an s<?..?>cr<?..?>ipti<?..?>ng langu<?..?>age”;
Output required :
perl is an <match>s<?..?>cr<?..?>ipti<?..?>ng</match> langu<?..?>age
I am adding pattern after each character:
$str1 =~ {(.)}
{
‘$&(?:(?:<?...?>|\n)+)?’
}esgi;
These works for few case but for few its goes on running. Please suggest.
(?:(?:<?...?>|\n)+)? is the same as (?:<?...?>|\n)* Also you dont want to add the pattern after each character; just in between the characters of the matched part of $str1. So no pattern before the first character and no pattern after the last. Otherwise the replace statement will around those tags, and you want them to around the front and back of the words. My guess is that if you are runnging that first replace command over all of $str1 you may end up with quite a large string. Also see my answer for related question here

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

How can I match the shortest possible sequence of characters that satisfy my regex pattern?

I've a string "ajjjjjjjjjaab"
I want a pattern which will match the last "ab" and not the whole string or even "aab".
/a.*?b/ # returns two groups
or
/a.??b/ # matches last aab
Neither works.
A simple way around your problem is to match:
.*(a.*b)
With the first .* being greedy, it matches as much as it can. Then you get a captured group with the match you really need, ($1). Note that this assumes you're matching the last occurrence of the pattern. You may want .*(a.*?b) if you have multiple bs near the end of the string, and you want the first one after the last a.
One of:
/a[^a]*b/
/a[^ab]*b/
If a and b are actually more complex patterns, one can use the following:
/a(?:(?!a).)*b/s
/a(?:(?!a|b).)*b/s
If a and b represent long/complex patterns, one can avoid repeating them using variables like in any other code.
my $re1 = qr/a/;
my $re2 = qr/b/;
/$re1(?:(?!$re1|$re2).)*$re2/s
One can also use subpatterns.
/
(?&A) (?:(?!(?&A)|(?&B)).)* (?&B)
(?(DEFINE)
(?<A> a )
(?<B> b )
)
/xs
The pattern matching in Perl is Left Most, Longest* by default. Using ??, *?, or +? will change that portion to Left Most, Shortest, but Left Most still takes precedence.
There is a way to get Perl to match Right Most, which might get you your desired effect, but it will also confuse the hell out of the next person to read your code, so use it with care.
The basic idea is to reverse everything related to the pattern match, so right becomes left.
my $subject = 'ajjjjjjjjjaab';
my $rev_sub = reverse $subject; # reverse the string being matched.
my $result;
if ($rev_sub =~ /(b.*?a)/) { # reverse the pattern to match.
$result = reverse $1; # reverse the results of the match.
}
print $result;
The solutions provided by ikegami and Kobi both find similar results for your example. Depending on your real patterns and strings you might find very different performance for each method. Always Benchmark based off your real needs.
*Longest only for the immediate token being matched, excluding alternations which are tried in order left to right, etc.
Ok, but then use just /ab/ for matching and you go it. Or /a{1}b/. Or?

Extracting some data items in a string using regular expression

<![Apple]!>some garbage text may be here<![Banana]!>some garbage text may be here<![Orange]!><![Pear]!><![Pineapple]!>
In the above string, I would like to have a regex that matches all <![FruitName]!>, between these <![FruitName]!>, there may be some garbage text, my first attempt is like this:
<!\[[^\]!>]+\]!>
It works, but as you can see I've used this part:
[^\]!>]+
This kills some innocents. If the fruit name contains any one of these characters: ] ! > It'd be discarded and we love eating fruit so much that this should not happen.
How do we construct a regex that disallows exactly this string ]!> in the FruitName while all these can still be obtained?
The above example is just made up by me, I just want to know what the regex would look like if it has to be done in regex.
The simplest way would be <!\[.+?]!> - just don't care about what is matched between the two delimiters at all. Only make sure that it always matches the closing delimiter at the earliest possible opportunity - therefore the ? to make the quantifier lazy.
(Also, no need to escape the ])
About the specification that the sequence ]!> should be "disallowed" within the fruit name - well that's implicit since it is the closing delimiter.
To match a fruit name, you could use:
<!\[(.*?)]!>
After the opening <![, this matches the least amount of text that's followed by ]!>. By using .*? instead of .*, the least possible amount of text is matched.
Here's a full regex to match each fruit with the following text:
<!\[(.*?)]!>(.*?)(?=(<!\[)|$)
This uses positive lookahead (?=xxx) to match the beginning of the next tag or end-of-string. Positive lookahead matches but does not consume, so the next fruit can be matched by another application of the same regex.
depending on what language you are using, you can use the string methods your language provide by doing simple splitting (and simple regex that is more understandable). Split your string using "!>" as separator. Go through each field, check for <!. If found, replace all characters from front till <!. This will give you all the fruits. I use gawk to demonstrate, but the algorithm can be implemented in your language
eg gawk
# set field separator as !>
awk -F'!>' '
{
# for each field
for(i=1;i<=NF;i++){
# check if there is <!
if($i ~ /<!/){
# if <! is found, substitute from front till <!
gsub(/.*<!/,"",$i)
}
# print result
print $i
}
}
' file
output
# ./run.sh
[Apple]
[Banana]
[Orange]
[Pear]
[Pineapple]
No complicated regex needed.

Need regexp to find substring between two tokens

I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.