Regex AND operator - regex

Based on this answer
Regular Expressions: Is there an AND operator?
I tried the following on http://regexpal.com/ but was unable to get it to work. What am missing? Does javascript not support it?
Regex: (?=foo)(?=baz)
String: foo,bar,baz

It is impossible for both (?=foo) and (?=baz) to match at the same time. It would require the next character to be both f and b simultaneously which is impossible.
Perhaps you want this instead:
(?=.*foo)(?=.*baz)
This says that foo must appear anywhere and baz must appear anywhere, not necessarily in that order and possibly overlapping (although overlapping is not possible in this specific case because the letters themselves don't overlap).

Example of a Boolean (AND) plus Wildcard search, which I'm using inside a javascript Autocomplete plugin:
String to match: "my word"
String to search: "I'm searching for my funny words inside this text"
You need the following regex: /^(?=.*my)(?=.*word).*$/im
Explaining:
^ assert position at start of a line
?= Positive Lookahead
.* matches any character (except newline)
() Groups
$ assert position at end of a line
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Test the Regex here: https://regex101.com/r/iS5jJ3/1
So, you can create a javascript function that:
Replace regex reserved characters to avoid errors
Split your string at spaces
Encapsulate your words inside regex groups
Create a regex pattern
Execute the regex match
Example:
function fullTextCompare(myWords, toMatch){
//Replace regex reserved characters
myWords=myWords.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
//Split your string at spaces
arrWords = myWords.split(" ");
//Encapsulate your words inside regex groups
arrWords = arrWords.map(function( n ) {
return ["(?=.*"+n+")"];
});
//Create a regex pattern
sRegex = new RegExp("^"+arrWords.join("")+".*$","im");
//Execute the regex match
return(toMatch.match(sRegex)===null?false:true);
}
//Using it:
console.log(
fullTextCompare("my word","I'm searching for my funny words inside this text")
);
//Wildcards:
console.log(
fullTextCompare("y wo","I'm searching for my funny words inside this text")
);

Maybe you are looking for something like this. If you want to select the complete line when it contains both "foo" and "baz" at the same time, this RegEx will comply that:
.*(foo)+.*(baz)+|.*(baz)+.*(foo)+.*

Maybe just an OR operator | could be enough for your problem:
String: foo,bar,baz
Regex: (foo)|(baz)
Result: ["foo", "baz"]

Related

Regex to match(extract) string between dot(.)

I want to select some string combination (with dots(.)) from a very long string (sql). The full string could be a single line or multiple line with new line separator, and this combination could be in start (at first line) or a next line (new line) or at both place.
I need help in writing a regex for it.
Examples:
String s = I am testing something like test.test.test in sentence.
Expected output: test.test.test
Example2 (real usecase):
UPDATE test.table
SET access = 01
WHERE access IN (
SELECT name FROM project.dataset.tablename WHERE name = 'test' GROUP BY 1 )
Expected output: test.table and project.dataset.tablename
, can I also add some prefix or suffix words or space which should be present where ever this logic gets checked. In above case if its update regex should pick test.table, but if the statement is like select test.table regex should not pick it up this combinations and same applies for suffix.
Example3: This is to illustrate the above theory.
INS INTO test.table
SEL 'abcscsc', wu_id.Item_Nbr ,1
FROM test.table as_t
WHERE as_t.old <> 0 AND as_t.date = 11
AND (as_t.numb IN ('11') )
Expected Output: test.table, test.table (Key words are INTO and FROM)
Things Not Needed in selection:as_t.numb, as_t.old, as_t.date
If I get the regex I can use in program to extract this word.
Note: Before and after string words to the combination could be anything like update, select { or(, so we have to find the occurrence of words which are joined together with .(dot) and all the number of such occurrence.
I tried something like this:
(?<=.)(.?)(?=.)(.?) -: This only selected the word between two .dot and not all.
.(?<=.)(.?)(?=.)(.?). - This everything before and after.
To solve your initial problem, we can just use some negation. Here's the pattern I came up with:
[^\s]+\.[^\s]+
[^ ... ] Means to make a character class including everything except for what's between the brackets. In this case, I put \s in there, which matches any whitespace. So [^\s] matches anything that isn't whitespace.
+ Is a quantifier. It means to find as many of the preceding construct as you can without breaking the match. This would happily match everything that's not whitespace, but I follow it with a \., which matches a literal .. The \ is necessary because . means to match any character in regex, so we need to escape it so it only has its literal meaning. This means there has to be a . in this group of non-whitespace characters.
I end the pattern with another [^\s]+, which matches everything after the . until the next whitespace.
Now, to solve your secondary problem, you want to make this match only work if it is preceded by a given keyword. Luckily, regex has a construct almost specifically for this case. It's called a lookbehind. The syntax is (?<= ... ) where the ... is the pattern you want to look for. Using your example, this will only match after the keywords INTO and FROM:
(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
Here (?:INTO|FROM) means to match either the text INTO or the text FROM. I then specify that it should be followed by a whitespace character with \s. One possible problem here is that it will only match if the keywords are written in all upper case. You can change this behavior by specifying the case insensitive flag i to your regex parser. If your regex parser doesn't have a way to specify flags, you can usually still specify it inline by putting (?i) in front of the pattern, like so:
(?i)(?<=(?:INTO|FROM)\s)[^\s]+\.[^\s]+
If you are new to regex, I highly recommend using the www.regex101.com website to generate regex and learn how it works. Don't forget to check out the code generator part for getting the regex code based on the programming language you are using, that's a cool feature.
For your question, you need a regex that understands any word character \w that matches between 0 and unlimited times, followed by a dot, followed by another series of word character that repeats between 0 and unlimited times.
So here is my solution to your question:
Your regex in JavaScript:
const regex = /([\w][.][\w])+/gm;
in Java:
final String regex = "([\w][.][\w])+";
in Python:
regex = r"([\w][.][\w])+"
in PHP:
$re = '/([\w][.][\w])+/m';
Note that: this solution is written for your use case (to be used for SQL strings), because now if you have something like '.word' or 'word..word', it will still catch it which I assume you don't have a string like that.
See this screenshot for more details

Regular expression to find specific text within a string enclosed in two strings, but not the entire string

I have this type of text:
string1_dog_bit_johny_bit_string2
string1_cat_bit_johny_bit_string2
string1_crocodile_bit_johny_bit_string2
string3_crocodile_bit_johny_bit_string4
string4_crocodile_bit_johny_bit_string5
I want to find all occurrences of “bit” that occur only between string1 and string2. How do I do this with regex?
I found the question Regex Match all characters between two strings, but the regex there matches the entire string between string1 and string2, whereas I want to match just parts of that string.
I am doing a global replacement in Notepad++. I just need regex, code will not work.
Thank you in advance.
Roman
If I understand correctly here a code to do what you want
var intput = new List<string>
{
"string1_dog_bit_johny_bit_string2",
"string1_cat_bit_johny_bit_string2",
"string1_crocodile_bit_johny_bit_string2",
"string3_crocodile_bit_johny_bit_string4",
"string4_crocodile_bit_johny_bit_string5"
};
Regex regex = new Regex(#"(?<bitGroup>bit)");
var allMatches = new List<string>();
foreach (var str in intput)
{
if (str.StartsWith("string1") && str.EndsWith("string2"))
{
var matchCollection = regex.Matches(str);
allMatches.AddRange(matchCollection.Cast<Match>().Select(match => match.Groups["bitGroup"].Value));
}
}
Console.WriteLine("All matches {0}", allMatches.Count);
This regex will do the job:
^string1_(?:.*(bit))+.*_string2$
^ means the start of the text (or line if you use the m option like so: /<regex>/m )
$ means the end of the text
. means any character
* means the previous character/expression is repeated 0 or more times
(?:<stuff>) means a non-capturing group (<stuff> won't be captured as a result of the matching)
You could use ^string1_(.*(bit).*)*_string2$ if you don't care about performance or don't have large/many strings to check. The outer parenthesis allow multiple occurences of "bit".
If you provide us with the language you want to use, we could give more specific solutions.
edit: As you added that you're trying a replacement in Notepad++ I propose the following:
Use (?<=string1_)(.*)bit(.*)(?=_string2) as regex and $1xyz$2 as replacement pattern (replace xyz with your string). Then perform an "replace all" operation until N++ doesn't find any more matches. The problem here is that this regex will only match 1 bit per line per iteration - and therefore needs to be applied repeatedly.
Btw. even if a regexp matches the whole line, you can still only replace parts of it using capturing groups.
You can use the regex:
(?:string1|\G)(?:(?!string2).)*?\Kbit
regex101 demo. Tried it on notepad++ as well and it's working.
There're description in the demo site, but if you want more explanations, let me know and I'll elaborate!

Regex to match alphanumerics, URL operators except forward slashes

I've been trying for the past couple of hours to get this regex right but unfortunately, I still can't get it. Tried searching through existing threads too but no dice. :(
I'd like a regex to match the following possible strings:
userprofile?id=123
profile
search?type=player&gender=male
someotherpage.htm
but not
userprofile/
helloworld/123
Basically, I'd like the regex to match alphanumerics, URL operators such as ?, = and & but not forward slashes. (i.e. As long as the string contains a forward slash, the regex should just return 0 matches.)
I've tried the following regexes but none seem to work:
([0-9a-z?=.]+)
(^[^\/]*$[0-9a-z?=.]+)
([0-9a-z?=.][^\/]+)
([0-9a-z?=.][\/$]+)
Any help will be greatly appreciated. Thank you so much!
The reason they all match is that your regexp matches part of the string and you've not told it that it needs to match the entire string. You need to make sure that it doesn't allow any other characters anywhere in the string, e.g.
^[0-9a-z&?=.]+$
Here's a small perl script to test it:
#!/usr/bin/perl
my #testlines = (
"userprofile?id=123",
"userprofile",
"userprofile?type=player&gender=male",
"userprofile.htm",
"userprofile/",
"userprofile/123",
);
foreach my $testline(#testlines) {
if ($testline =~ /^[0-9a-z&?=.]+$/) {
print "$testline matches\n";
} else {
print "$testline doesn't match - bad regexp, no cookie\n";
}
}
This should do the trick:
/\w+(\.htm|\?\w+=\w*(&\w+=\w*)*)?$/i
To break this down:
\w+ // Match [a-z0-9_] (1 or more), to specify resource
( // Alternation group (i.e., a OR b)
\.htm // Match ".htm"
| // OR
\? // Match "?"
\w+=\w* // Match first term of query string (e.g., something=foo)
(&\w+=\w*)* // Match remaining terms of query string (zero or more)
)
? // Make alternation group optional
$ // Anchor to end of string
The i flag is for case-insensitivity.

Regular Expression to find a string included between two characters while EXCLUDING the delimiters

I need to extract from a string a set of characters which are included between two delimiters, without returning the delimiters themselves.
A simple example should be helpful:
Target: extract the substring between square brackets, without returning the brackets themselves.
Base string: This is a test string [more or less]
If I use the following reg. ex.
\[.*?\]
The match is [more or less]. I need to get only more or less (without the brackets).
Is it possible to do it?
Easy done:
(?<=\[)(.*?)(?=\])
Technically that's using lookaheads and lookbehinds. See Lookahead and Lookbehind Zero-Width Assertions. The pattern consists of:
is preceded by a [ that is not captured (lookbehind);
a non-greedy captured group. It's non-greedy to stop at the first ]; and
is followed by a ] that is not captured (lookahead).
Alternatively you can just capture what's between the square brackets:
\[(.*?)\]
and return the first captured group instead of the entire match.
If you are using JavaScript, the solution provided by cletus, (?<=\[)(.*?)(?=\]) won't work because JavaScript doesn't support the lookbehind operator.
Edit: actually, now (ES2018) it's possible to use the lookbehind operator. Just add / to define the regex string, like this:
var regex = /(?<=\[)(.*?)(?=\])/;
Old answer:
Solution:
var regex = /\[(.*?)\]/;
var strToMatch = "This is a test string [more or less]";
var matched = regex.exec(strToMatch);
It will return:
["[more or less]", "more or less"]
So, what you need is the second value. Use:
var matched = regex.exec(strToMatch)[1];
To return:
"more or less"
Here's a general example with obvious delimiters (X and Y):
(?<=X)(.*?)(?=Y)
Here it's used to find the string between X and Y. Rubular example here, or see image:
You just need to 'capture' the bit between the brackets.
\[(.*?)\]
To capture you put it inside parentheses. You do not say which language this is using. In Perl for example, you would access this using the $1 variable.
my $string ='This is the match [more or less]';
$string =~ /\[(.*?)\]/;
print "match:$1\n";
Other languages will have different mechanisms. C#, for example, uses the Match collection class, I believe.
[^\[] Match any character that is not [.
+ Match 1 or more of the anything that is not [. Creates groups of these matches.
(?=\]) Positive lookahead ]. Matches a group ending with ] without including it in the result.
Done.
[^\[]+(?=\])
Proof.
http://regexr.com/3gobr
Similar to the solution proposed by null. But the additional \] is not required. As an additional note, it appears \ is not required to escape the [ after the ^. For readability, I would leave it in.
Does not work in the situation in which the delimiters are identical. "more or less" for example.
Most updated solution
If you are using Javascript, the best solution that I came up with is using match instead of exec method.
Then, iterate matches and remove the delimiters with the result of the first group using $1
const text = "This is a test string [more or less], [more] and [less]";
const regex = /\[(.*?)\]/gi;
const resultMatchGroup = text.match(regex); // [ '[more or less]', '[more]', '[less]' ]
const desiredRes = resultMatchGroup.map(match => match.replace(regex, "$1"))
console.log("desiredRes", desiredRes); // [ 'more or less', 'more', 'less' ]
As you can see, this is useful for multiple delimiters in the text as well
PHP:
$string ='This is the match [more or less]';
preg_match('#\[(.*)\]#', $string, $match);
var_dump($match[1]);
This one specifically works for javascript's regular expression parser /[^[\]]+(?=])/g
just run this in the console
var regex = /[^[\]]+(?=])/g;
var str = "This is a test string [more or less]";
var match = regex.exec(str);
match;
To remove also the [] use:
\[.+\]
I had the same problem using regex with bash scripting.
I used a 2-step solution using pipes with grep -o applying
'\[(.*?)\]'
first, then
'\b.*\b'
Obviously not as efficient at the other answers, but an alternative.
I wanted to find a string between / and #, but # is sometimes optional. Here is the regex I use:
(?<=\/)([^#]+)(?=#*)
Here is how I got without '[' and ']' in C#:
var text = "This is a test string [more or less]";
// Getting only string between '[' and ']'
Regex regex = new Regex(#"\[(.+?)\]");
var matchGroups = regex.Matches(text);
for (int i = 0; i < matchGroups.Count; i++)
{
Console.WriteLine(matchGroups[i].Groups[1]);
}
The output is:
more or less
If you need extract the text without the brackets, you can use bash awk
echo " [hola mundo] " | awk -F'[][]' '{print $2}'
result:
hola mundo

Regular Expression: how can I impose a perfect string matching?

Currently I am using this one ( edit: I missed to explain that I use this one for excluding exactly these words :p ):
String REGEXP = "^[^(REG_)?].*";
but matches (exluding) also ERG, EGR, GRE, etc... above
P.S.
I removed super because it is another keyword that I must filter, figure an array list composed with more of the following three words to be used as model:
REG_info1, info2, SUPER_info3, etc...
I need three filter matching one model at time, my question focus only on the second filter parsing keywords based on model "info2".
Just type it literally:
REG
This will only match REG.
So:
String REGEXP = "^(REG_|SUPER_)?.*";
Edit   After you clarified that you want to match every word that does not begin with REG_ or SUPER_, you could try this:
\b(?!REG_|SUPER_)\w+
The \b is a word boundary and the expression (?!expr) is a look-ahead assertion.
As everyone have already replied, if you want to match a line starting with REG, you use the regexp "^REG", if you want to match any line that starts REG or SUPER, you use "^(REG|SUPER)" and regular expression negation is, in general, a tricky problem.
To match all lines NOT starting with 'REG' you need to match "^[^R]|R[^E]|RE[^G]" and a regular expression to match all lines not starting with REG or SUPER can be constructed in a similar fashion (start by grouping the "not REG" in parentheses, then construct the "not SUPER" patterns as "[^S]|S[^U]|[SU[^P]...", group this and use alternation for both groups).
How about
\mREG\M
// \mREG\M
//
// Options: ^ and $ match at line breaks
//
// Assert position at the beginning of a word «\m»
// Match the characters “REG” literally «REG»
// Assert position at the end of a word «\M»
The [] indicate character classes. This is not what you want. You can just use "REG" to match REG. (You can use REG|SUPER for REG or SUPER)
REGEXP = "^(REG_|SUPER_)"
would match anything that haves REG_ or SUPER_ at the beginning of a string. You don't need more after the group "(..|..)"