Regex to match text from multiple links - regex

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text

Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.

If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo

Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.

You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regex match last substring among same substrings in the string

For example we have a string:
asd/asd/asd/asd/1#s_
I need to match this part: /asd/1#s_ or asd/1#s_
How is it possible to do with plain regex?
I've tried negative lookahead like this
But it didn't work
\/(?:.(?!\/))?(asd)(\/(([\W\d\w]){1,})|)$
it matches this '/asd/asd/asd/asd/asd/asd/1#s_'
from this 'prefix/asd/asd/asd/asd/asd/asd/1#s_'
and I need to match '/asd/1#s_' without all preceding /asd/'s
Match should work with plain regex
Without any helper functions of any programming language
https://regexr.com/
I use this site to check if regex matches or not
here's the possible strings:
prefix/asd/asd/asd/1#s
prefix/asd/asd/asd/1s#
prefix/asd/asd/asd/s1#
prefix/asd/asd/asd/s#1
prefix/asd/asd/asd/#1s
prefix/asd/asd/asd/#s1
and asd part could be replaced with any word like
prefix/a1sd/a1sd/a1sd/1#s
prefix/a1sd/a1sd/a1sd/1s#
...
So I need to match last repeating part with everything to the right
And everything to the right could be character, not character, digit, in any order
A more complicated string example:
prefix/a1sd/a1sd/a1sd/1s#/ds/dsse/a1sd/22$$#!/123/321/asd
this should match that part:
/a1sd/22$$#!/123/321/asd
Try this one. This works in python.
import re
reg = re.compile(r"\/[a-z]{1,}\/\d+[#a-z_]{1,}")
s = "asd/asd/asd/asd/1#s_"
print(reg.findall(s))
# ['/asd/1#s_']
Update:
Since the question lacks clarity, this only works with the given order and hence, I suppose any other combination simply fails.
Edits:
New Regex
reg = r"\/\w+(\/\w*\d+\W*)*(\/\d+\w*\W*)*(\/\d+\W*\w*)*(\/\w*\W*\d+)*(\/\W*\d+\w*)*(\/\W*\w*\d+)*$"

parsing url for specific param value

im looking to use a regular expression to parse a URL to get a specific section of the url and nothing if I cannot find the pattern.
A url example is
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5#c452fds-634d-f424fds-cdsa&bf_action=jildape
I wish to get the bolded text in it.
Currently im using the regex "d=([^#]*)" but the problem is im also running across urls of this pattern:
and im getting the bold section of it
/te/file/value/jifle?uil=testing-cdas-feaw:jilk:&jklfe=https://value-value.jifels/temp.html/topic?id=e997aad4-92e0-j30e-a3c8-jfkaliejs5&bf_action=jildape
I would prefer it have no matches of this url because it doesnt contain the #
Regexes are not a magic tool that you should always use just because the problem involves a string. In this case, your language probably has a tool to break apart URLs for you. In PHP, this is parse_url(). In Perl, it's the URI::URL module.
You should almost always prefer an existing, well-tested solution to a common problem like this rather than writing your own.
So you want to match the value of the id parameter, but only if it has a trailing section containing a '#' symbol (without matching the '#' or what's after it)?
Not knowing the specifics of what style of regexes you're using, how about something like:
id=([^#&]*)#
regex = "id=([\\w-])+?#"
This will grab everything that is character class[a-zA-Z_0-9-] between 'id=' and '#' assuming everything between 'id=' and '#' is in that character class(i.e. if an '&' is in there, the regex will fail).
id=
-Self explanatory, this looks for the exact match of 'id='
([\\w-])
-This defines and character class and groups it. The \w is an escaped \w. '\w' is a predefined character class from java that is equal to [a-zA-Z_0-9]. I added '-' to this class because of the assumed pattern from your examples.
+?
-This is a reluctant quantifier that looks for the shortest possible match of the regex.
#
-The end of the regex, the last character we are looking for to match the pattern.
If you are looking to grab every character between 'id=' and the first '#' following it, the following will work and it uses the same logic as above, but replaces the character class [\\w-] with ., which matches anything.
regex = "id=(.+?)#"

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.

Regex with exception of particular words

I have problem with regex.
I need to make regex with an exception of a set of specified words, for example: apple, orange, juice.
and given these words, it will match everything except those words above.
applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)
apple (should not match)
orange (should not match)
juice (should not match)
If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):
rx = /^(?!apple$|orange$|juice$)/
I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.
So - lets build a set of characters that count as a "boundary":
/[^-a-z0-9A-Z_]/ // Will match any character that is <NOT> - _ or
// between a-z 0-9 A-Z
/(?:^|[^-a-z0-9A-Z_])/ // Matches the beginning of the string, or one of those
// non-word characters.
/(?:[^-a-z0-9A-Z_]|$)/ // Matches a non-word or the end of string
/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/
// This should >match< apple/orange/juice ONLY when not preceded/followed by another
// 'non-word' character just negate the result of the test to obtain your desired
// result.
In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...
If you are only testing 'single word' tests you can go with a much simpler:
/^(apple|orange|juice)$/ // and take the negation of this...
This gets some of the way there:
((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)
\A(?!apple\Z|juice\Z|orange\Z).*\Z
will match an entire string unless it only consists of one of the forbidden words.
Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines
^(?!apple$|juice$|orange$).*$
will also work.
Here's some easy copy-paste code that works for more than just exact-words exceptions.
Copy/Paste Code:
In the following regex, ONLY replace the all-caps sections with your regex.
Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript
Impossible as of 6/17/2020, and probably won't be possible in the near future.
Full Examples
REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)
Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)
How does it work?
This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.
The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).
The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.
The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.
Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.
This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.
Exact answer to the original question in Ruby
/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".
/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
Something like (PHP)
$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
// it's ok;
}
else
{
//throw validation error
}