Regex skipping first match? - regex

I'm trying to match every plain "twitter-like hashtags" in a text, and make hyperlinks of them.
I've had some success, but strangely enough my regular expression /(^|\s)#(\w*[a-zA-Z_]+\w*)/ is skipping the first case when the string starts with the # sign (the rest are properly matched). If I write anything else before, then the first case is properly matched. It fails only when a # sign appears to be the first character of the string.
Do you know why could it be?
function make_hashes_into_twitter_hashtag_urls($content){
$content = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '<span class="color_my_hash">\1#</span>\2 ', $content);
echo $content;
}
add_filter('the_content','make_hashes_into_twitter_hashtag_urls');
Thanks,

The the_content(); template tag outputs HTML and text, and HTML and regular expressions don't play nice together.
I don't know the details about why exactly something like <p>#, (which is what the_content(); was outputing), was making my regex to fail.
I'd love to know why, and how my regex should be - to avoid being trapped (even if, as stated, it's not recommended to mix regex and HTML).
My question is mainly answered, but any additional details on how to hack/workaround this situation would be hugely appreciated!

Related

Regex to match everything except a pattern

Regex noob here struggling with this, which I know it will be easy for some of you regex gods out there!
Given the following:
title: Some title
date: 2022-08-15
tags: <value to extract>
identifier: 1234567
---------------------------
Some text
some more text
I would like a regex to match everything except the value of tags (ie the "<value to extract>" text).
For context, this is supposed to run on emacs (in case it matters).
EDIT: Just to clarify as per #phils question, all I care about extracting the tags value. However, this is via a package setting that asks for a regex string and I don't have much control over how it gets use. It seems to expect a regex to strip what I don't need from the string rather than matching what I do want, which is slightly annoying.. Also, the since it seems to match everything with \\(.\\), I'm guessing it's using the global flag?
Please let me know if any of this isn't clear.
Emacs regular expressions can't trivially express "not foo" for arbitrary values of foo. (The likes of PCRE have non-regular extensions for zero-width negative look-ahead/behind assertions, but in Emacs that sort of functionality is generally done with the support of lisp code1.)
You can still do it purely with regexp matching, but it's simply very cumbersome. An Emacs regexp which matches any line which does not begin with tags: is:
^\(?:$\|[^t]\|t[^a]\|ta[^g]\|tag[^s]\|tags[^:]\).*
or if you need to enter it in the elisp double-quoted read syntax for strings:
"^\\(?:$\\|[^t]\\|t[^a]\\|ta[^g]\\|tag[^s]\\|tags[^:]\\).*"
1 In lisp code you would instead simply check each line to see whether it does start with tags: and, if so, skip it (which is why Emacs generally gets away without the feature you're looking for, but of course that doesn't help you here).
After playing around with it for a bit and taken inspiration from #phils' answer, I've come up with the following:
"^\\(?:\\(#\\+\\)?\\(?:filetags:\s+\\|tags:\s+\\|title:.*\\|identifier:.*\\|date:.*\\)\\|.*\\)"
I've also added an extra \\(#\\+\\)? to account for org meta keys which would usually have the format #+key: value.

Removing empty bbcode tags using regex

Using regex I'm trying to remove empty bbcode tags. By empty I mean nothing in between them:
[tag][/tag]
If there is something between them then it should be kept.
I've searched a lot and played around with a regex tester but haven't come up with anything that works right.
Edit: I realize now why I was having a hard time with this. In addition to the example above, I also have one's like:
[url=http://www.somedomain.com/][/url]
I'm trying to cleanup bbcode when a form is submitted so it's not stored since it's unneeded.
In Javascript, you could do :
str.replace(/\[([^\[\]]*)\]\[\/\1\]/g, '');
The operative aspect of regex in this case is the use of internal backrefs; I'm not sure, off the top of my head, whether this is universally supported, but .NET, in any case, seems to use PCRE (is this true?).
The pattern, then, is [, a word, ][/, the same word, ]. If we assume the word has simply the quality of "does not contain ]", then an appropriate regex to match an empty tag is \[([^\]]+)\]\[/\1\], escaped as necessary in context.
For the second case, if assume the form [tag=arg][/tag], and that tag and arg each don't contain any ']' (not a reasonable assumption! but dealing with it is left as an exercise for the reader -- and I'm quite sure most bbcode implementations don't actually deal with that problem, either), one could use a regex \[([^\]=]+)(=[^\]]*)?\]\[/\1\].

Detect URL in a string without any whitespace regexp

So I know the idea of catching any URL is a very difficult task, and that's not what i'm wanting to do. I'm wanting to find a piece of regex that'll catch urls in the form of
http://something.xx.yy
http://www.something.xxx
www.something.xx.yy
in a string that will contain lots of other text and no whitespace, so for example
hellopleasevisitwww.something.xxthankyou
I've tried my best to detect something like that by myself, but it's been pretty fruitless. Any help would be great. Below are some of the expressions I tried to modify in order to have these requirements met
.*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|].*
\\b\\w*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]\\w*\\b
\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]
Thanks for your time
If it really can be as simple as you're saying...
(http://(www\\.)?|www\\.)[^.]+\\.(\\w{3}|\\w{2}\\.\\w{2})
The expressions you tried all have \\b which is a word boundary and your string unfortunately does not have word boundaries.
See it in action

Regular expression with negative look aheads

I am trying to contruct a regular expression to remove links from content unless it contains 1 of 2 conditions.
<a.*?href=[""'](http[s]?:\/\/(.*?)\.link\.com)?\/(?!m\/).*?<\/a>
This will match any link to link.com that does not have m/ at the end of the domain section. I want to change this slightly so it does't match URLs that are links to pdf files regardless of having the m/ in the url, I came up with:
<a.*?href=["'](http[s]?:\/\/(.*?)\.brodies\.com)?\/(?!m\/).*?\.(?!pdf)["'].*?<\/a>
Which is ooh so very close except now it will only match if the URL has a "." at the end - I can see why it's doing it. I can't seem to make the "." optional as this causes the non greedy pattern prior to the "." to keep going until it hits the ["']
Any help would be good to help solve this.
Thanks
Paul
You probably want to use (?<!\.pdf)["'] instead of \.(?!pdf)["'].
But note that this expression has several issues, best way to solve them is to use a proper HTML parser.
First, RegEx match open tags except XHTML self-contained tags.
That said, (since it probably will not deter,) here is a slightly-better-constrained version of what you're trying to, with the caveat that this is still not good enough!
<a[^>]+?href\s*=\s*["'](https?:\/\/[^"']*?\.link\.com)?\/(?!m\/)[^"']*?\.(?!pdf)[^"']*?["'][^>]*?>.*?<\/a>
You can see a running example of this regex at: http://rubular.com/r/obkKrKpB8B.
Your problem was actually just that you were looking for a quote character immediately after the dot, here: .(?!pdf)["'].

Adding http:// to all links without a protocol

I use VB.NET and would like to add http:// to all links that doesn't already start with http://, https://, ftp:// and so on.
"I want to add http here Google,
but not here Google."
It was easy when I just had the links, but I can't find a good solution for an entire string containing multiple links. I guess RegEx is the way to go, but I wouldn't even know where to start.
I can find the RegEx myself, it's the parsing and prepending I'm having problems with. Could anyone give me an example with Regex.Replace() in C# or VB.NET?
Any help appreciated!
Quote RFC 1738:
"Scheme names consist of a sequence of characters. The lower case letters "a"--"z", digits, and the characters plus ("+"), period ("."), and hyphen ("-") are allowed. For resiliency, programs interpreting URLs should treat upper case letters as equivalent to lower case in scheme names (e.g., allow "HTTP" as well as "http")."
Excellent! A regex to match:
/^[a-zA-Z0-9+.-]+:\/\//
If that matches your href string, continue on. If not, prepend "http://". Remaining sanity checks are yours unless you ask for specific details. Do note the other commenters' thoughts about relative links.
EDIT: I'm starting to suspect that you've asked the wrong question... that you perhaps don't have anything that splits the text up into the individual tokens you need to handle it. See Looking for C# HTML parser
EDIT: As a blind try at ignoring all and just attacking the text, using case insensitive matching,
/(<a +href *= *")(.*?)(" *>)/
If the second back-reference matches /^[a-zA-Z0-9+.-]+:\/\//, do nothing. If it does not match, replace it with
$1 + "http://" + $2 + $3
This isn't C# syntax, but it should translate across without too much effort.
In PHP (should translate somewhat easily)
$text = preg_replace('/href="(?:(http|ftp|https)\:\/\/)?([^"]*)"/', 'href="http://$1"', $text);
C#
result = new Regex("(href=\")([^(http|https|ftp)])", RegexOptions.IgnoreCase).Replace(input, "href=\"//$2");
If you aren't concerned with potentially messing up local links, and you can always guarantee that the strings will be fully qualified domain names, then you can simply use the contains method:
Dim myUrl as string = "someUrlString".ToLower()
If Not myUrl.Contains("http://") AndAlso Not myUrl.Contains("https://") AndAlso Not myUrl.Contains("ftp://") Then
'Execute your logic to prepend the proper protocol
myUrl = "http://" & myUrl
End If
Keep in mind this omits a lot of holes regarding the checking of which protocol should be used in the addition and if the url is relative or not.
Edit: I chose specifically not to offer a RegEx solution since this is a simple check and RegEx is a little heavy for it (IMO).