What would be the regex emulating GitHub's autolinked references?
It takes Markdown on input and outputs enriched Markdown where strings like #123 are converted to [#123](https://github.com/owner/repo/issues/123).
These are some examples of the transformations that I'd like the regex to do:
Input:
1. #123
2. https://github.com/owner/repo/issues/123
3. https://github.com/shoptet/sofa/pull/456
4. owner/repo#123
5. https://github.com/owner/repo/issues/123#issuecomment-123456789
Output:
1. [#123](https://github.com/owner/repo/issues/123)
2. [#123](https://github.com/owner/repo/issues/123)
3. [#123](https://github.com/owner/repo/pull/456)
4. [owner/repo#123](https://github.com/owner/repo/issues/123)
5. [#123 (comment)](https://github.com/owner/repo/issues/123#issuecomment-123456789)
I'd prefer one giant regex if possible (I know it's not going to be nice but would allow me to process Markdown in a couple of my favorite editors directly).
If you don't mind changing the format a little (using [#123-comment] instead of [#123 (comment)] for comments), you may use this:
(?:(owner/repo)?#(\d+)\b|https?://github\.com/([^/]+/[^/]+/(?:issues|pull))/(\d+)(#issue(comment)(-)\d+)?)
Replace by: [\1#\2\4\7\6](https://github.com/owner/repo/issues/\2\4\5)
You have a demo here.
I'd still prefer a (complex) regex but if anyone is looking for the same post-processing like me, this package can solve it in a Node.js script:
https://github.com/remarkjs/remark-github
Two cases
1. Key<A, M> desc = newKey();
2. Property<B, N> type = newKey("type", B.bar);
The RegExp and replace
find: (?:Key|Property)<(.*), (.*)> (.*) = newKey\((.*)\);
rep.: Foo<C$1, $2> $3 = pl.nP("$3", $2.class); // ($4)
The Result
1. Foo<CA, M> desc = pl.nP("desc", M.class); //
2. Foo<CB, N> type = pl.nP("type", N.class); // ("type", B.bar)
The Problem:
Now I want to avoid the empty comment at the line 1.
Is there a way to write the $4 and the stuff around it only if $4
isn't empty?
You could remove empty comments afterwards with another regular expression.
EDIT
Another solution would be to deal with the special case separately (... = newKey\(\)).
Perhaps you could automate this process with a simple script, if the tedium of repetitive typing becomes too great(eg. when dealing with multiple conditionals).
As far as I know, there isn't any 'intelligence' built into the replace field in Sublime Text; all you can do is to assemble the captured pieces to your liking.
While skimming through a few Google search results yesterday, I found an article about conditional patterns in Perl, but nothing pertaining to the problem at hand.
For the sake of full disclosure, I should say that I am in no sense an expert in the field, so I could be wrong. I do however have some experience with the Python API for
Sublime Text. It might be possible to implement this functionality yourself, if it doesn't already exist within the plethora of extensions available.
I'm sorry if this sounds like a very long-winded 'uh uh', but I'll be on the lookout for a general solution.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to search for the word Gadaffi, which can be spelled in many different ways. What's the best regular expression to search for this?
This is a list of 30 variants:
Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi
My best attempt so far is:
\b[KG]h?add?af?fi$\b
But I still seem to be missing some variants. Any suggestions?
Easy... (Qadaffi|Khadafy|Qadafi|...)... it's self-documented, maintainable, and assuming your regexp engine actually compiles regular expressions (rather than interpreting them), it will compile to the same DFA that a more obfuscated solution would.
Writing compact regular expressions is like using short variable names to speed up a program. It only helps if your compiler is brain-dead.
\b[KGQ]h?add?h?af?fi\b
Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).
Btw, why is there a $ at the end of the regex?
Btw, nice article on the topic:
Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.
EDIT
To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D
\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b
One interesting thing to note from your list of potential spellings is that there's only 3 Soundex values for the contained list (if you ignore the outlier 'Kazzafi')
G310, K310, Q310
Now, there are false positives in there ('Godby' also is G310), but by combining the limited metaphone hits as well, you can eliminate them.
<?
$soundexMatch = array('G310','K310','Q310');
$metaphoneMatch = array('KTF','KTHF','FTF','KHTF','K0F');
$text = "This is a big glob of text about Mr. Gaddafi. Even using compound-Khadafy terms in here, then we might find Mr Qudhafi to be matched fairly well. For example even with apostrophes sprinkled randomly like in Kad'afi, you won't find false positives matched like godfrey, or godby, or even kabbadi";
$wordArray = preg_split('/[\s,.;-]+/',$text);
foreach ($wordArray as $item){
$rate = in_array(soundex($item),$soundexMatch) + in_array(metaphone($item),$metaphoneMatch);
if ($rate > 1){
$matches[] = $item;
}
}
$pattern = implode("|",$matches);
$text = preg_replace("/($pattern)/","<b>$1</b>",$text);
echo $text;
?>
A few tweaks, and lets say some cyrillic transliteration, and you'll have a fairly robust solution.
Using CPAN module Regexp::Assemble:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
$ra->add($_) for qw(Gadaffi Gadafi Gadafy Gaddafi Gaddafy
Gaddhafi Gadhafi Gathafi Ghadaffi Ghadafi
Ghaddafi Ghaddafy Gheddafi Kadaffi Kadafi
Kaddafi Kadhafi Kazzafi Khadaffy Khadafy
Khaddafi Qadafi Qaddafi Qadhafi Qadhdhafi
Qadthafi Qathafi Quathafi Qudhafi Kad'afi);
say $ra->re;
This produces the following regular expression:
(?-xism:(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi))
I think you're over complicating things here. The correct regex is as simple as:
\u0627\u0644\u0642\u0630\u0627\u0641\u064a
It matches the concatenation of the seven Arabic Unicode code points that forms the word القذافي (i.e. Gadaffi).
If you want to avoid matching things that no-one has used (ie avoid tending towards ".+") your best approach would be to create a regular expression that's just all the alternatives (eg. (Qadafi|Kadafi|...)) then compile that to a DFA, and then convert the DFA back into a regular expression. Assuming a moderately sensible implementation that would give you a "compressed" regular expression that's guaranteed not to contain unexpected variants.
If you've got a concrete listing of all 30 possibilities, just concatenate them all together with a bunch of "ors". Then you can be sure that it only matches the exact things you've listed, and no more. Your RE engine will probably be able to optimize in further, and, well, with 30 choices even if it doesn't it's still not a big deal. Trying to fiddle around with manually turning it into a "clever" RE can't possibly turn out better and may turn out worse.
(G|Gh|K|Kh|Q|Qh|Q|Qu)(a|au|e|u)(dh|zz|th|d|dd)(dh|th|a|ha|)(\x27|)(a|)(ff|f)(i|y)
Certainly not the most optimized version, split on syllables to maximize matches while trying to make sure we don't get false positives.
Well since you are matching small words why don't you try a similarity search engine with the Levenshtein distance? You can allow at most k insertions or deletions. This way you can change the distance function to other things that work better for your specific problem. There are many functions available in the simMetrics library.
A possible alternative is the online tool for generate regular expressions from examples http://regex.inginf.units.it.
Give it a chance!
Why not do a mixed approach? Something between a list of all possibilities and a complicated Regex that matches far too much.
Regex is about pattern matching and I can't see a pattern for all variants in the list. Trying to do so, will also find things like "Gazzafy" or "Quud'haffi" which are most probably not a used variant and definitly not on the list.
But I can see patterns for some of the variants, and so I ended up with this:
\b(?:Gheddafi|Gathafi|Kazzafi|Kad'afi|Qadhdhafi|Qadthafi|Qudhafi|Qu?athafi|[KG]h?add?h?aff?[iy]|Qad[dh]?afi)\b
At the beginning I list the ones where I can't see a pattern, then followed by some variants where there are patterns.
See it here on www.rubular.com
I know this is an old question, but...
Neither of these two regexes is the prettiest, but they are optimized and both match ALL the variations in the original post.
"Little Beauty" #1
(?:G(?:a(?:d(?:d(?:af[iy]|hafi)|af(?:f?i|y)|hafi)|thafi)|h(?:ad(?:daf[iy]|af?fi)|eddafi))|K(?:a(?:d(?:['dh]a|af?)|zza)fi|had(?:af?fy|dafi))|Q(?:a(?:d(?:(?:(?:hd)?|t)h|d)?|th)|u(?:at|d)h)afi)
"Little Beauty" #2
(?:(?:Gh|[GK])adaff|(?:(?:Gh|[GKQ])ad|(?:Ghe|(?:[GK]h|[GKQ])a)dd|(?:Gadd|(?:[GKQ]a|Q(?:adh|u))d|(?:Qad|(?:Qu|[GQ])a)t)h|Ka(?:zz|d'))af)i|(?:Khadaff|(?:(?:Kh|G)ad|Gh?add)af)y
Rest in Peace, Muammar.
Just an addendum: you should add "Gheddafi" as alternate spelling. So the RE should be
\b[KG]h?[ae]dd?af?fi$\b
[GQK][ahu]+[dtez]+\'?[adhz]+f{1,2}(i|y)
In parts:
[GQK]
[ahu]+
[dtez]+
\'?
[adhz]+
f{1,2}(i|y)
Note: Just wanted to give a shot at this.
What else starts with Q, G, or K, has a d, z or t in the middle, and ends in "fi" the people actually search for?
/\b[GQK].+[dzt].+fi\b/i
Done.
>>> print re.search(a, "Gadasadasfiasdas") != None
False
>>> print re.search(a, "Gadasadasfi") != None
True
>>> print re.search(a, "Qa'dafi") != None
True
Interesting that I'm getting downvoted. Can someone leave some false positives in the comments?
In the Visual Studio 2010 "Productivity Power Tools" plugin (which is great), you can configure file tabs to be color coded based on regular expressions.
I have a RegEx to differentiate the tab color of Interface files (IMyInterface.cs) from regular .cs files:
[I]{1}[A-Z]{1}.*\.cs$
Unfortunately this also color codes any file that starts with a capital "I" (Information.cs, for example).
How could this RegEx be modified to only include files where the first letter is "I" and the second letter is not lowercase?
Your regexp should work as it is. It is possible that it is executed in ignore case mode. Try to disable that mode inside your regexp with (?-i):
(?-i)[I]{1}[A-Z]{1}.*\.cs$
Try this
"(?-i)^I[A-Z].*\.cs$"
Sets case insensitve off first.
Regular Expression Options
Filenames in Windows are not case-sensitive, so obviously Power Tools will be using case-insensitive matching.
How about this:
^I([A-Z][A-Za-z0-9]*){1}\.cs$
so
IMyInterface.cs // matches, MyInterface
IB.cs // B
IBa.cs // Ba
IC1.cs // C1
I.cs // don't
Information.cs // don't
Prooflink
I based mine off the default patterns placed in there and used ^I[A-Z].*\.cs[ ]*(\[read only\])?$ - I think that there is a precedence question, though, so that if you leave the default .cs pattern matcher in there and add yours to the end, you might have yours hidden, because it matched the general one first.
And you can't re-order or delete them, so it's a little fiddly to get the ordering working well ...
FWIW, I don't think the case-sensitivity question ((?-i) makes any difference.
I am making a large catalogue of all of the possible OS names that can be supported by my particular version of VMWare. Originally I was writing them all as they stood in the VMX files but then I found a website that had them all listed, the problem is they are not properly cased to provide a "perfect" match, would this be the perfect time to use the regex attribute for case insensitivity?
Also as a side question, would it be possibly extract the list of OSs from the website?. They look to be in a HTML formated chart. It would save me a lot of time having to type them all out.
I looked at HTML::Table extract, and I don't really understand how to use it. As far the table is concerned I was able to find the section in the websites code and I copied to a new html file so I can have it on my desktop.
This is odd, I am probably missing something. But I am not able to match with case insensitivity. When end my regex with /xmi I get this output;
Use of uninitialized value $guest_os in concatenation (.) or string at discovery4.pl line 146.
Which I have discovered mean that there is no match to associate to the scalar I am trying to print.
Anyhow I know I am having a problem with it not wanting to match with no case because if I modify winnetstandard to winNetStandard it works and says,;
Windows Server 2003, Standard Edition. Which is what it should say.
HTML::TableExtract can be helpful. As far as matching goes, I'm not sure what it is that you are trying to match; if you are just comparing two names, uc($foo) eq uc($bar) makes more sense. But if you have a regex and want the whole match to be case insensitive, /i will do that.
Ah, so you want to get the supported os names and assemble them into a regex and match using it? Then, given #osnames, you might want something like this:
my $osnames = join('|', map quotemeta, sort { length($b) <=> length($a) } #osnames);
my $regex = qr/guestOS\s*=\s*"(?i:$osnames)"/;
The ?i: limits the scope of case insensitivity to just the OS names; only if you want guestOS to also be case insensitive would you use /i (and (?:$osnames)).
This would be the right time to use the /i attribute, as changing the case can't really harm anything. What I would do to get the list of Operating Systems would be to copy the html of the sections where the list is, use regex on the list so that it outputs in the format you need it to, and then use the outputted text.