I am looking for a regex which captures all stuff between the first " and the last " of a string than may contain further ".
$a='"xyz"kljhkljh"lkjhlkj"';
#b=$a=~ m/^"(.*)"$/m;
seems not to work?
There is no \n at the line end.
The reason that yours is not working is that you are trying to restrict the first quotation mark to occurring at the beginning of the string or immediately after a newline anywhere therein and the last quotation mark to occurring either at the end of the string or immediately before a newline anywhere therein.
That is not what your data contains. Don’t make this harder than it need be.
If you want everything between the first double quote and the last one, including others, then you want
($content) = $string =~ /"(.*)"/sx;
If you want lots of them, and no double quotes inside, you want:
(#contents) = $string =~ /"([^"]*)"/gx;
In your second comment to tchrist's answer you say that the first and last quotes should be at the beginning and end of the string? If that's the case, you don't even need a regular expression at all, just take the entire string minus the first and last characters:
substr($a, 1, -1)
For some reason I can't add a comment, so I'm creating an answer to answer bootware's comment on tchrist's answer. The difference between ($content)=$string=~/"(.*)"/sx and $content=$string=~/"(.*)"/sx is that the former matches in list context and the later in scalar context. In scalar context the result is simply a 1 or 0 indicating whether the string matched the regex. In list context, a list is returned for the substring that matched each parenthesized portion of the regex, in order from left to right. In this case there was one set of parentheses in the regex, so the list returned had one element, the portion of the string that was inside the quotes.
Bonus: You can refer to the substrings matched in each set of parenthesis using $1, $2, ...
Related
I am working with government measures and am required to parse a string that contains variable information based on delimiters that come from issuing bodies associated with the fda.
I am trying to retrieve the delimiter and the value after the delimiter. I have searched for hours to find a regex solution to retrieve both the delimiter and the value that follows it and, though there seems to be posts that handle this, the code found in the post haven't worked.
One of the major issues in this task is that the delimiters often have repeated characters. For instance: delimiters are used such as "=", "=,", "/=". In this case I would need to tell the difference between "=" and "=,".
Is there a regex that would handle all of this?
Here is an example of the string :
=/A9999XYZ=>100T0479&,1Blah
Notice the delimiters are:
"=/"
"=>'
"&,1"
Any help would be appreciated.
You can use a regex like this
(=/|=>|&,1)|(\w+)
Working demo
The idea is that the first group contains the delimiters and the 2nd group the content. I assume the content can be word characters (a to z and digits with underscore). You have then to grab the content of every capturing group.
You need to capture both the delimiter and the value as group 1 and 2 respectively.
If your values are all alphanumeric, use this:
(&,1|\W+)(\w+)
See live demo.
If your values can contain non-alphanumeric characters, it get complicated:
(=/|=>|=,|=|&,1)((?:.(?!=/|=>|=,|=|&,1))+.)
See live demo.
Code the delimiters longest first, eg "=," before "=", otherwise the alternation, which matches left to right, will match "=" and the comma will become part of the value.
This uses a negative look ahead to stop matching past the next delimiter.
I am trying to match the first instance of the value of Timestamp in one expression and the last instance of the value of Timestamp in another expression:
{'Latitude': 50.00001,'Longitude': 2.00002,'Timestamp': '00:10:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:20:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:25:00'},{'Latitude': 50.0,'Longitude': 2.0,'Timestamp': '00:37:00'}
Anyone know how to do that.
Take advantage of regexp's greediness: the * operator will take as many matches as it can find. So the approach here is to match the explicit pattern at the beginning and end of the regexp with a .* in the middle. The .* will slurp up as many characters as it can subject to the rest of the regexp also matching.
/(${pattern}).*(${pattern})/
Where here, ${} represents extrapolation. This will vary on your language. In Ruby it would be #{}. I have chosen to capture the entire pattern; you can instead put the () capture around the timestamp value but I find this easier to read and maintain. This regexp will match two instances of $pattern with as much stuff in between as it can fit, thus guaranteeing that you have the first and last.
If you want to be more strict, you could enforce the pattern in the middle as well, *'ing the full pattern rather than just .:
/${pattern},\s*(?:${pattern},\s*)*${pattern}/
Ask in the comments if you don't understand any piece of this regexp.
One pattern we can use is /\{[^}]+\'Timestamp\'[^}]+\}/.Note that this pattern assumes that Timestamp is the LAST key; if this is not always true you need to add a bit more to this pattern.
So the total pattern for the first example will be:
str =~ /(${pattern}.*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}).*({[^}]+'Timestamp'[^}]+})/
Then, $1 and $2 are the first and last hashes that match the Timestamp key. Again, this matches the entire pattern rather than only the timestamp value itself, but it should be straightforward from there to extract the actual timestamp value.
For the second, more strict example, and the reason I did not want to capture the timestamp value inside the pattern itself, we have:
str =~ /(${pattern}),\s*(?:${pattern},\s*)*(${pattern})/
Or, without extrapolation:
str =~ /({[^}]+'Timestamp'[^}]+}), *(?:{[^}]+'Timestamp'[^}]+}, *)*({[^}]+'Timestamp'[^}]+})/
We still have the correct results in $1 and $2 because we explicitly chose NOT to put a capturing group inside the pattern.
I have a string which can contain any number of the delimiter §\n. I would like to remove all delimiters from a string, except the last occurrence which should be left as-is. The last delimiter can be in three states: \n, §\n or §§\n. There will never be any characters after the last variable delimiter.
Here are 3 examples with the different state delimiters:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
I would like to remove all delimiters except the last occurrence.
So the result of gsub for the three examples above should be:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
Using regular expressions, one could use §\\n(?=.), which matches properly for all three cases using positive lookahead, as there will never be any characters after the last variable delimiter.
I know I could check if the string has the delimiter at the end, and then after a substitution using the Lua pattern §\n I could add the delimiter back onto the string. That is however a very inelegant solution to a problem which should be possible to solve using a Lua pattern alone.
So how could this be done using a Lua pattern?
str:gsub( '§\\n(.)', '%1' ) should do what you want. This deletes the delimiter given that it is followed by another character, putting this character back into to string.
Test code
local str = {
'abc§\\ndef§\\nghi\\n',
'abc§\\ndef§\\nghi§\\n',
'abc§\\ndef§\\nghi§§\\n',
}
for i = 1, #str do
print( ( str[ i ]:gsub( '§\\n(.)', '%1' ) ) )
end
yields
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
EDIT: This answer doesn't work specifically for lua, but if you have a similar problem and are not constrained to lua you might be able to use it.
So if I understand correctly, you want a regex replace to make the first example look like the second. This:
/(.*?)§\\n(?=.*\\n)/g
will eliminate the non-last delimiters when replaced with
$1
in PCRE, at least. I'm not sure what flavor Lua follows, but you can see the example in action here.
REGEX:
/(.*?)§\\n(?=.*\\n)/g
TEST STRING:
abc§\ndef§\nghi\n
abc§\ndef§\nghi§\n
abc§\ndef§\nghi§§\n
SUBSTITUTION:
$1
RESULT:
abcdefghi\n
abcdefghi§\n
abcdefghi§§\n
I want to compare string from one file to another. but another file may contains some element and that element can occur anywhere and it can occur many times also.
Note : these tags needs to be retain in final output.
For e.g.:
I want to compare word ‘scripting’.. tag indicates the word to be matched from str2.
$str1 = “perl is an <match>scripting</match> language”;
$str2 = “perl is an s<?..?>cr<?..?>ipti<?..?>ng langu<?..?>age”;
Output required :
perl is an <match>s<?..?>cr<?..?>ipti<?..?>ng</match> langu<?..?>age
I am adding pattern after each character:
$str1 =~ {(.)}
{
‘$&(?:(?:<?...?>|\n)+)?’
}esgi;
These works for few case but for few its goes on running. Please suggest.
(?:(?:<?...?>|\n)+)? is the same as (?:<?...?>|\n)* Also you dont want to add the pattern after each character; just in between the characters of the matched part of $str1. So no pattern before the first character and no pattern after the last. Otherwise the replace statement will around those tags, and you want them to around the front and back of the words. My guess is that if you are runnging that first replace command over all of $str1 you may end up with quite a large string. Also see my answer for related question here
I suspect this has already been answered somewhere, but I can't find it, so...
I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)
myString = "A=abc;B=def_3%^123+-;C=123;" ;
myB = getInnerString(myString, "B=", ";" ) ;
method getInnerString(inStr, startToken, endToken){
return inStr.replace( EXPRESSION, "$1");
}
so, when I run this using expression ".+B=(.+);.+"
I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.
I've tried using (?=) in search of that first ';' but it gives me the same result.
I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.
any and all help greatly appreciated.
Similar question on SO:
Regex: To pull out a sub-string between two tags in a string
Regex to replace all \n in a String, but no those inside [code] [/code] tag
Replace patterns that are inside delimiters using a regular expression call
RegEx matching HTML tags and extracting text
You're using a greedy pattern by not specifying the ? in it. Try this:
".+B=(.+?);.+"
Try this:
B=([^;]+);
This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.
(This is a continuation of the conversation from the comments to Evan's answer.)
Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.
All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):
String s = "A=abc;B=def_3%^123+-;C=123;";
Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
System.out.println(m.group(1));
}
Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:
print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;
...so we content ourselves with hacks like this:
System.out.println("A=abc;B=def_3%^123+-;C=123;"
.replaceFirst(".+B=(.*?);.+", "$1"));
Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.