Search and modify with one regex - regex

Say, we have HTML-page, containing links:
a href="katalog/koshelki-i-klatchi/muzhskaya-sumka-planshet-polo-optom1"
a href="katalog/koshelki/kozhanaya-sumka-jeep-optom1"
I need to search using regex one time only (in one search query), and I want output to be:
koshelki-i-klatchi/muzhskaya-sumka-planshet-polo-optom1
koshelki/kozhanaya-sumka-jeep-optom1
What would regular expression for this task be like?

Do you want something like this:
http:\/\/[A-Za-z0-9\.]*(\/[A-Za-z0-9]*)?\/[A-Za-z0-9]+[0-9]{1}
Test it here: https://regex101.com/r/cnxvR0/1
It will match anything starting with http:// followed by any alphabet character, any digit or . (dot), optionally followed by another forward slash (/) and ends with 1 or more alphabet characters or digits and it has to end with a single digit.
I'm sure this will not help for all of your cases, but you have to be more specific, how many digits are there at the end, is it always only one ? Does the URL have to end with a digit or it's optional ? How many nested directories can there be (I made my regex for only one) ?
Let me know if the regex above will do what you need or post in the comment section answers to the questions above and I'll edit my answer accordingly.
OK SO AFTER YOU EDITED YOUR ORIGINAL QUESTION:
(?<=href=")(?:[\w-]+\/?)*
Try it here: https://regex101.com/r/q0tf5l/2
Let me know if this is what you wanted, you can iterate through all of the matches and print them out or whatever you need to do with them.

Related

Custom email validation regex pattern not working properly

So I've got /.+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(.{1})\w{2,}/ pattern I want to use for email validation on client-side, which doesn't work as expected.
I know that my pattern is simple and doesn't cover every standard possibility, but it's part of my regex training.
Local part of address should be valid only when it has at least one digit [0-9] or letter [a-zA-Z] and can be mixed with comma or plus sign or underscore (or all at once) and then # sign, then domain part, but no IP address literals, only domain names with at least one letter or digit, followed by one dot and at least two letters or two digits.
In test string form it doesn't validate a#b.com and does validate baz_bar.test+private#e-mail-testing-service..com, which is wrong - it should be vice versa - validate a#b.com and not validate baz_bar.test+private#e-mail-testing-service..com
What specific error I've got there and where?
I can't locate this, sorry..
You need to change your regex
From: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]\#[\w+-?]+(\.{1})\w{2,}
To: .+[^\x20-\x2A\x2C\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\xFF]?\#[\w+-]+(\.{1})\w{2,}
Notice that I added a ? before the # sign and removed the ? from the first "group" after the # sign. Adding that ? will make your regex to know that hole "group" is not mandatory.
See it working here: https://regex101.com/r/iX5zB5/2
You're requiring the local part (before #) to be at least two characters with the .+ followed by the character class [^...]. It's looking for any character followed by another character not in the list of exclusions you specify. That explains why "a#b.com" doesn't match.
The second problem is partly caused by the character class range +-? which includes the . character. I think you wanted [-\w+?]+. (Do you really want question marks?) And then later I think you wanted to look for a literal . character but it really ends up matching the first character that didn't match the previous block.
Between the regex provided and the explanatory text I'm not sure what rules you intend to implement though. And since this is an exercise it's probably better to just give hints anyway.
You will also want to use the ^ and $ anchors to makes sure the entire string matches.

Regex to "ignore" not "exclude"

I'm totally lost. I need a regular expression that
can detect any of the 4 starting urls like below
^(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)$
And ... .
should detect:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
And ... . which is important
Should Ignore, but NOT Exclude... . the following exact string (either it's present in the page or not)
http://www.w3.org
Which is complicated for me, because i still need to include it in the regex line
even if it's ignored, otherwise, it will match & be found in
(.*http://.*|.*http%3A%2F%2F.*|.*https://.*|.*https%3A%2F%2F.*)
And my aim is to find/match any url besides
http://www.w3.org
even if it's in the page, Or if it's not present.
so if there's only this in the page:
http://www.w3.org
& no other url.. then it shouldn't match.
Thanks Tyler but my regex knowledge is almost zero, i can only know what commands do when i right click on them to chose actions like in regulazy or regexr ((
So i updated my command according to the url i provided to you:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
& it works:
https?(://|%3A%2F%2F)(?!www.w3.org)(.*)
But because of my lack of knowledge, i don't understand how to do that below
"What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs"
I tried to add this, but it doesn't work:
(www.)
All i'm missing now is detection of urls starting with www
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
OK so try this:
/\bhttps?(://|%3A%2F%2F)(?!www\.w3\.org)(.*)\b/g
Test here: http://regexr.com?38jp5
That test link uses javascript-style regex, but should work elsewhere.
The important part is the second half - a negative lookahead, that checks what follows is not the exact text www.w3.org
I compressed what you had: mine matches http then an optional s then either :// or %3A%2F%2F.
I wrapped the whole thing in word boundaries, you could change that to quotes or whatever you need. The global flag lets you match multiple items.
In regards to OP's questions:
D%22
could appear before http or https
this one is missing & should match:
href%3D%22http%3A%2F%2Fwww%2Edommermuth%2D1%2Ecom
If this matters, just remove the word boundary \b before and after the regex, so the http can match anywhere.
The regex command should detect: (any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything)
This regex would fail to match a link like http://google.com - looking for www is really not a good way to check for a link on its own. What you could do is make the http part optional, or must match http or www or both. This type of regex came up in another question I answered recently - Multiple preg_replace RegEx for different URLs
Edit #2:
(any punctuation or space or backspace)(3 times the letter w in upper or lower case)(one dot)(anything till it reaches a space or the end of a line)
As I mention above, what you are describing will not match a url like http://google.com - but if that is what you want, use this:
(\W|^)[wW]{3}\.[^\s$]+
Instead of that, what I think you want is this, which is a combination of my first answer, and the link to a different post above.
((https?(://|%3A%2F%2F))(www\.)|(https?(://|%3A%2F%2F))|(www\.))(?!(www\.)?w3\.org)([^</\?\s]+)[^<\s]*
You'll want to use this regex with the Global and Insensitive flags

regex negative lookbehind - pcre

I'm trying to write a rule to match on a top level domain followed by five digits. My problem arises because my existing pcre is matching on what I have described but much later in the URL then when I want it to. I want it to match on the first occurence of a TLD, not anywhere else. The easy way to check for this is to match on the TLD when it has not bee preceeded at some point by the "/" character. I tried using negative-lookbehind but that doesn't work because that only looks back one single character.
e.g.: How it is currently working
domain.net/stuff/stuff=www.google.com/12345
matches .com/12345 even though I do not want this match because it is not the first TLD in the URL
e.g.: How I want it to work
domain.net/12345/stuff=www.google.com/12345
matches on .net/12345 and ignores the later match on .com/12345
My current expression
(\.[a-z]{2,4})/\d{5}
EDIT: rewrote it so perhaps the problem is clearer in case anyone in the future has this same issue.
You're pretty close :)
You just need to be sure that before matching what you're looking for (i.e: (\.[a-z]{2,4})/\d{5}), you haven't met any / since the beginning of the line.
I would suggest you to simply preppend ^[^\/]*\. before your current regex.
Thus, the resulting regex would be:
^[^\/]*\.([a-z]{2,4})/\d{5}
How does it work?
^ asserts that this is the beginning of the tested String
[^\/]* accepts any sequence of characters that doesn't contain /
\.([a-z]{2,4})/\d{5} is the pattern you want to match (a . followed by 2 to 4 lowercase characters, then a / and at least 5 digits).
Here is a permalink to a working example on regex101.
Cheers!
You can use this regex:
'|^(\w+://)?([\w-]+\.)+\w+/\d{5}|'
Online Demo: http://regex101.com/

Regex - How to search for singular or plural version of word [duplicate]

This question already has answers here:
Regex search and replace with optional plural
(4 answers)
Closed 6 years ago.
I'm trying to do what should be a simple Regular Expression, where all I want to do is match the singular portion of a word whether or not it has an s on the end. So if I have the following words
test
tests
EDIT: Further examples, I need to this to be possible for many words not just those two
movie
movies
page
pages
time
times
For all of them I need to get the word without the s on the end but I can't find a regular expression that will always grab the first bit without the s on the end and work for both cases.
I've tried the following:
([a-zA-Z]+)([s\b]{0,}) - This returns the full word as the first match in both cases
([a-zA-Z]+?)([s\b]{0,}) - This returns 3 different matching groups for both words
([a-zA-Z]+)([s]?) - This returns the full word as the first match in both cases
([a-zA-Z]+)(s\b) - This works for tests but doesn't match test at all
([a-zA-Z]+)(s\b)? - This returns the full word as the first match in both cases
I've been using http://gskinner.com/RegExr/ for trying out the different regex's.
EDIT: This is for a sublime text snippet, which for those that don't know a snippet in sublime text is a shortcut so that I can type say the name of my database and hit "run snippet" and it will turn it into something like:
$movies= $this->ci->db->get_where("movies", "");
if ($movies->num_rows()) {
foreach ($movies->result() AS $movie) {
}
}
All I need is to turn "movies" into "movie" and auto inserts it into the foreach loop.
Which means I can't just do a find and replace on the text and I only need to take 60 - 70 words into account (it's only running against my own tables, not every word in the english language).
Thanks!
- Tim
Ok I've found a solution:
([a-zA-Z]+?)(s\b|\b)
Works as desired, then you can simply use the first match as the unpluralized version of the word.
Thanks #Jahroy for helping me find it. I added this as answer for future surfers who just want a solution but please check out Jahroy's comment for more in depth information.
For simple plurals, use this:
test(?=s| |$)
For more complex plurals, you're in trouble using regex. For example, this regex
part(y|i)(?=es | )
will return "party" or "parti", but what you do with that I'm not sure
Here's how you can do it with vi or sed:
s/\([A-Za-z]\)[sS]$/\1
That replaces a bunch of letters that end with S with everything but the last letter.
NOTE:
The escape chars (backslashes before the parens) might be different in different contexts.
ALSO:
The \1 (which means the first pattern) may also vary depending on context.
ALSO:
This will only work if your word is the only word on the line.
If your table name is one of many words on the line, you could probably replace the $ (which stands for the end of the line) with a wildcard that represents whitespace or a word boundary (these differ based on context).

Mod Rewrite RegEx To Match Only If Previous Subset Matched

I am trying to make what I think is a simple regex for use with mod_rewrite.
I've tried various expressions, many of which I thought were promising, but all of which ultimately failed for one reason or another. They all also seem to fail once I add start/end string delimiters.
For example, ^user/(\d{1,10})(?=/)$ was one I tried, but among other things, it seems to group the trailing slash, and I only want to group the digits. I think I need to use a positive lookbehind, but I'm having difficulty because it's looking behind at a group.
What I am trying to match is strings that 1) begin with "user/" and 2) possibly end with (\d{1,10})/ (1 to 10 digits followed by a single slash)
Should Match:
user/
user/123/
user/1234567890/
Should not match:
user
user//
user/-4/
user/35.5/
user/123
user/123//
user/123/5/
user/12345678901/
Edit: Sorry about the formatting; I do not understand how to format anything via this markdown. Those examples are preceded by 4 spaces which I thought should make a code block, but obviously I thought wrong.
^user/(?:([0-9]{1,10})/)?$ should work just fine.
This: ^user(?=/)(/\d{1,10})?/$ Edit: if you want to group digits, ^user(?=/)(?:/(\d{1,10}))?/$