preg_replace words not inside a url - regex

I am using preg_replace to replace a list of words in a text that may contain some urls.
The problem is that I don't want to replace these words if they're part of a url.
These examples should be ignored:
foo.com
foo.com/foo
foo.com/foo/foo
For a basic example (written in php), I tried to ignore strings containing .com and optional slashes and chars, using a negative look ahead assertion, but with no success:
preg_replace("/(\b)foo(\b)/", "$1bar$2(?!(\w+\.\w+)*(\.com)([\.\/]\w+)*)", $text);
This call works just ignores the word before .com.
Any help would be really appreciated.

In cases like these, its much easier to think of the problem inverted. You want to match words not in an url. Instead think, you want to match the url and the words. So, your expression would look like this: url_match_here|(?:my|words|here). This will allow the regex engine to consume the URL first and then try to match those words. Thus, you never have to worry about matching the words inside an URL. If you want to maintain the text structure, you can use preg_replace, with the following expression (url_match_here)|(?:my|words|here) and replace by \1 to preserve the URL and the text.
I hope this helps.
Good luck.

Related

Regex - Find the Shortest Match Possible

The Problem
Given the following:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the \plain\f2\fs24\cf6{\txfielddef{\*\txfieldstart\txfieldtype1\txfieldflags144\txfielddataval44334\txfielddata 35003800380039000000}{\*\txfielddatadef\txfielddatatype1\txfielddata 340034003300330034000000}{\*\txfieldtext 20{\*\txfieldend}}{\field{\*\fldinst{ HYPERLINK "44334" }}{\fldrslt{20}}}}\plain\f2\fs24 part of the note.
I'd like to produce this:
\plain\f2 This is the first part of the note. This is the second part of the note. This is the third part of the note.
What I've Tried
The example input/output is a very simplified version of the data I need to parse and it would be nice to have a way to parse the data programmatically. I have a PHP application and I've been trying to use regex to match the segments that are important and then filter out the parts of the string that aren't required. Here's what I've come up with so far:
/\\plain.*?\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/2
It almost matches what I want, but it but grabs the longest possible match instead of the shortest. I want it to match only one \\plain before the \\field{.... Maybe after the \\plain, I could match anything except for a space? How would I go about doing that?
I'm no regex expert, but my use-case really calls for it. (Otherwise, I'd just write code to handle everything.) Any help would be much appreciated!
(?:(?!\\plain).)* will match any string unless it contains a match for \\plain. Here's the regex implementing this:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*? /gm
regex101: https://regex101.com/r/ILLZU6/5
Also, you can replace the space at the end with (?: |$) if you want to allow the end of the text to trigger it as well as a space:
/\\plain(?:(?!\\plain).)*\\field{\\\*\\fldinst{ HYPERLINK "(.*?)" }}{\\fldrslt{(.*?)}}}}\\plain.*?(?: |$)/gm
regex101: https://regex101.com/r/ILLZU6/4

Match a url that does not contain certain word

I need some help for a regular expressions for not matching urls like these one:
/Common/Download.php?file=/path/to/file.pdf
and instead to matching these static urls:
/path/to/file.pdf
I have read many post (also in this site) but nothing seems to works as expected.
Thanks for your helps.
Lorenzo.
UPDATE
Sorry if this post is not so complete. I post more information to obtain a better help.
The regular expression that I need must work with Apache module mod_rewrite (and also with the module mod_rewrite of IIS (maybe this is not the right name) that is compatible with the module of Apache (as from my knowledge), if possible ) and must redirect the matching static urls (only of the second type, as from my post) to a specific page.
Thanks again.
Lorenzo.
Without knowing more about your programming language and regex parser, I'm keeping my regex really generic, but something like this should get you close:
^/([A-Za-z0-9]+/)+[A-Za-z0-9]+\.[A-Za-z0-9]{3,4}$
This matches a string starting with a slash, one or more directories separated by slashes, and ending with a filename with a three or four character file extension.
This means /path/to/some/really/buried/file.html would match too.
Using an interactive regular expression evaluator is a great way to rapidly write and debug regular expressions, especially if you are new to them. I really like The Regex Coach for this.
Another option could be to repeat the forward slash lowercase characters pattern in a non capturing group and repeat that. Then match the file extension .pdf
^(?:/[a-z]+){3}\.pdf$
Explanation
From the beginning of the string ^
Non capturing group (?:
Match one or more lowercase characters [a-z]+
Close the non capturing group and match 3 times ){3}
Match a dot \. and pdf
The end of the string $
Or repeat the group 2 times and for the filename use \w+
^(?:/[a-z]+){2}/\w+\.pdf$
If you want to match your example static url and maybe longer or shorter paths like /path/file.pdf or /dir/path/to/file.pdf you could for example use:
^(?:/\w+)+\.\w+$

Regex captures all occurrences but the last of certain characters

I want to exclude common punctuation from my URL Regex detector when my clients type a sentence with a URL in it. A common scenario would be the URL example.com?q=this (which obviously needs to include the ?) versus a sentence saying
What do you think of example.com?
This expression suits my needs just fine:
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#/]\S*)?
However it includes all punctuation at the end, so I am iterating through each match to find and use this captured group to exclude said punctuation:
(.*?)[?,!.;:]+$
However, I'm not sure how to leverage the "end of string" technique when scanning the entire block of text which may have multiple URLs. Was hoping there'd be a way to capture the right blocks from the get-go without the extra work.
Just require non-whitespace after the punctuation instead of making it optional.
(?:https?\:\/\/)?(?:\w+\.)+\w{2,}(?:[?#\/]\S+)?
You will of course lose valid ending of URLs like example.com/ will become example.com but as far as I know there is no difference.

Regex: Search for verb roots

I've seen the results for classifying verbs by their endings. But I want to use Regular Expressions to find verb roots for regular verbs in Spanish.
I'm using this fancy site: http://regexpal.com/
Which I suspect may not be compatible with my end use, but will be a great starting point.
From what I have seen, the caret should identify all strings after it based on your supplied string-pattern.
So, to me:
ˆgust
Should find "gusta", "gustan", "gustamos", "gustas","gustar".
I know that I'm way off, but looking at many of the pages and tutorials and examples, I don't see anything that looks similar to what I want to do.
When you look for regex matching you'll get only the matching part, meaning, in case you have the word "gustan" and you're trying to match it with ^gust like you suggested, the output of the matcher will be "gust" - which is not what you want (you want the whole word).
So instead of matching to ^gust try matching to ^gust\w*$ which means anything that starts with "gust" and has zero or more characters following it.
^(gust[a-zA-Z]*)$
Edit live on Debuggex
^ denotes the start of the line
[a-zA-Z] letters only
* means zero or more
() is called a capture group
$ is the end of the line
If you want to edit with different words you could do this...
^((?:gust|otherwords)[a-zA-Z]*)$
Edit live on Debuggex
all you have to change/edit is |otherwords this will allow you to add more words that you want to match.
please read more about regex here and use debugexx.com to experiment.

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>