How to make this Regex match all occurrences of markdown links? - regex

I am trying to implement a RegEx that will get all the occurrences of markdown links in this format [link_description](link_destination).
However, I have some requirements:
link destination MUST HAVE a space
link destination MUST NOT start with <
I got to this RegEx:
REGEX = /
(?<description>\[.*?\])
\(
(?<destination>
(?!<) # Do not start with greater than symbol
.*\s.* # Have at least one empty space
)
\)
/x.freeze
It works great when there is only one occurrence, such:
'[Contact us](mailto:foo#foo space)'.scan(REGEX)
=> [["[Contact us]", "mailto:foo#foo space"]]
However, current output for multiple occurrences:
"[Contact us](mailto:foo#foo space>) [Contact us](mailto:foo#foo space>)"
=> [["[Contact us]", "mailto:foo#foo space>) [Contact us](mailto:foo#foo space>"]]
Expected output:
"[Contact us](mailto:foo#foo space>) [Contact us](mailto:foo#foo space>)"
=> [["[Contact us]", "mailto:foo#foo space>"], ["[Contact us]", "mailto:foo#foo space>"]]
I tried changing it and added a [^)] to the end of the second capture, but still failing:
REGEX = /
(?<description>\[.*?\])
\(
(?<destination>
(?!<) # Do not start with greater than symbol
.*\s.*
[^)]
)
\)
/x.freeze
What am I doing wrong?

The issue is that the second capture group (?<destination>.*\s.*[^)]) matches everything until the last ) in the input string, which is not what you want. To fix this, you need to use a non-greedy quantifier (.*?) to match the minimum amount of characters until the first closing parenthesis ).
This should give you the expected output for multiple occurrences.

Related

Using RegEx to grab a field in brackets

I have multiple square bracketed data in the log file of a splunk log. I am attempting to find a particular field named UserDataGuid and then gather the data in the bracket after this. My only option seems to be regular expressions in a standard that seems similar to perl to me. Yet does not work what am I doing wrong here ?
| rex "\]\s(?<UserDataGuid>.*?)\s*$"
// this trial looks more promising but grabs the last bracket :( and doesn't name the field, to be used in a subSearch.
| rex "(?i)UserDataGuid\s*\[([^\}]*)\]
the data looks like this
[21] INFO UserDataGuid [fas08f0da-faf6-4308-aad6-hfld5643gs] [(null)] [(null)] [(null)]
and I want only the guid
fas08f0da-faf6-4308-aad6-hfld5643gs
and I would love for it to be a field I could reuse like fields are used in splunk.
It looks like you want
(?<=UserDataGuid\s\[)([^\]]*)
I'd try the following regex:
(?<=UserDataGuid \[).*?(?=\])/g
This will capture fas08f0da-faf6-4308-aad6-hfld5643gs. See a demo here.
With
\]\s(?<UserDataGuid>.*?)\s*$
you say: match a ] > \], follow by any space character (only one) > \s, follow by a group with name UserDataGuid > (?<UserDataGuid> ... ) that contains any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ), follow by any space character (zero times, to unlimited times) > \s*, follow by end of string > $
I think that you don't want this (?<UserDataGuid> ... );
you want match (in some way) UserDataGuid, no call UserDataGuid at the group that match " any character, except newline (zero times, to unlimited times) > .*? ( in lazy mode, ? ) "
In
(?i)UserDataGuid\s*\[([^\}]*)\]
change the }, for a ], and then, you captured your GUID in group #1
but, you don't need match "UserDataGuid\s[*"
you could use:
(?<=UserDataGuid \[)([^\]]*)
and then, you only match the GUID, and find it in the group #1
you can remove the parenthesis of group #1, because is a full match:
(?<=UserDataGuid \[)[^\]]*
https://regex101.com/r/sI3kW4/1

Replacing a single term in a regex pattern

I am using regexp_filter in Sphinx to replace terms
In most cases I can do so e.g. misspellings are easy:
regexp_filter = Backround => Background
Even swapping using capturing group notation:
regexp_filter = (Left)(Right) => \2\1
However I am having more trouble when using a pattern match to find a given words I want to replace:
regexp_filter = (PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?(SearchTerm)\b => NewSearchTerm
Where NewSearchTerm would be the term I want to replace just \2 with (leaving \1 and the rest of the pattern alone). So
So if I had text 'Pizza and Taco Parlor' then:
regexp_filter = (Pizza)\W+(?:\w+\W+){1,6}?(Parlor)\b => Store
Would convert to 'Pizza and Taco Store'
I know in this case the SearchTerm is /2 but not sure how to convert. I know I could append e.g. /2s to make it plural but how can I in fact replace it since it is just a single capturing group of several and I just want to replace that group?
So, if I understand the question. You have a strings that match the following criteria:
Begin with PattenWord1 or PatternWord2
Immediately followed by an uppercase word
Maybe followed by another word that is between 1 and 6 characters -- recommend using [A-z] rather than \w+\W+
Followed by "SearchTerm"
Let use this as a baseline:
PatternWord1HelloSearchTerm
And you only want to replace SearchTerm from the string.
So you need another pattern group around everything you want to keep:
regexp_filter = ((PatternWord1|PatternWord2)\W+(?:\w+\W+){1,6}?)(SearchTerm)\b => \1World
Your pattern group matches would be:
PatternWord1Hello
PatternWord1
SearchTerm
Your result would be:
PatternWord1HelloWorld

Search with regular expression in Sublime Text 2

I want to create a rule to remove array( and ) from this text:
"price"=> array(129),
to get:
"price"=> 129,
I tried this expression without success:
(?<="price"=>\s*)array\((?=\d*)\)(?=,)
Then I decided to made replacement in 2 steps. Firstly, I removed array(:
(?<="price"=>\s\s\s\s\s)array\(
And got:
"price"=> 129),
So I had to remove only a closing parenthesis ). I tried without success:
(?<="price"=>\s*\d*)\)(?=,)
This works, but only for a known number of whitespaces and digits:
(?<="price"=>\s\s\s\s\s\d\d\d)\)(?=,)
Try this for the find:
("price"=>\s+)array\((\d+)\)
and this for the replace:
\1\2
you can match whole line with this
\"price"[^a)]+(array\()\d+(\),)
it contains one group for "array(" and another for "),"
Try this:
(?:(?<=\"price\"=>\s*)array\((?=\d+\)))|(?<=\"price\"=>\s*array\(\d+)\)
The regex consists mainly two parts (the pipe in the middle is an alternation symbol which means if the first part doesn't match it should look for the second part).
The first part checks if array( is preceded by "price"=> ... and is succeded by ) by using the look-behind (?<= ... ) and look-ahead (?= ... ) symbol respectively.
(?:(?<=\"price\"=>\s*)array\((?=\d+\)))
Then we have a pipe (explained above)..
|
The second part checks if ) is preceded by everything we've matched before ("price"=> array(129) also using the look-behind symbol (<= ... ):
(?<=\"price\"=>\s*array\(\d+)\)
Thus for the string "price"=> array(129), the result should be two matches: array( and ).
Please let me know if this works for you.

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`