How do I exclude a directory using regular expressions? - regex

I asked a question a little while ago about using regular expressions to extract a match from a URL in a particular directory.
eg: www.domain.com/shop/widgets/match/
The solution given was ^/shop.*/([^/]+)/?$
This would return "match"
However, my file structure has changed and I now need an expression that instead returns "match" in any directory excluding "pages" and "system"
Basically I need an expression that will return "match" for the following:
www.domain.com/shop/widgets/match/
www.domain.com/match/
But not:
www.domain.com/pages/widgets/match/
www.domain.com/pages/
www.domain.com/system/widgets/match/
www.domain.com/system/
I've been struggling for days without any luck.
Thanks

This is just an alternative to Grahams great answer above. Code in C# (but fot the regex part, that doesn't matter):
void MatchDemo()
{
var reg = new Regex("( " +
" (\\w+[.]) " +
" | " +
" (\\w+[/])+ " +
") " +
"(shop[/]|\\w+[/]) " + //the URL-string must contain the sequence "shop"
"(match) " ,
RegexOptions.IgnorePatternWhitespace);
var url = #"www.domain.com/shop/widgets/match/";
var retVal = reg.Match(url).Groups[5]; //do we have anything in the fifth parentheses?
Console.WriteLine(retVal);
Console.ReadLine();
}
/Hans

BRE and ERE do not provide a way to negate a portion of the RE, except within a square bracket expression. That is, you can [^a-z], but you can't express not /(abc|def)/. If your regex dialiect is ERE, then you must use two regexps. If you're using PREG, you can use a negative look-ahead.
For example, here's some PHP:
#!/usr/local/bin/php
<?php
$re = '/^www\.example\.com\/(?!(system|pages)\/)([^\/]+\/)*([^\/]+)\/$/';
$test = array(
'www.example.com/foo/bar/baz/match/',
'www.example.com/shop/widgets/match/',
'www.example.com/match/',
'www.example.com/pages/widgets/match/',
'www.example.com/pages/',
'www.example.com/system/widgets/match/',
'www.example.com/system/',
);
foreach ($test as $one) {
preg_match($re, $one, $matches);
printf(">> %-50s\t%s\n", $one, $matches[3]);
}
And the output:
[ghoti#pc ~]$ ./phptest
>> www.example.com/foo/bar/baz/match/ match
>> www.example.com/shop/widgets/match/ match
>> www.example.com/match/ match
>> www.example.com/pages/widgets/match/
>> www.example.com/pages/
>> www.example.com/system/widgets/match/
>> www.example.com/system/
Is that what you're looking for?

Related

Another regex expression

I need a regular expression for the next rules:
should not start or end with a space
should contain just letters (lower / upper), digits, #, single quotes, hyphens and spaces (spaces just inside, but not at the beginning and the end, as I already said)
should contain at least one letter (lower or upper).
Thank you
I think
^[^ ](?=.*[a-zA-Z]+)[a-zA-Z0-9#'\- ]*[^ ]$
should help you.
"Does it really matter guys?"
with regards to the dialect of regex: yes it does matter. Different languages may have different dialects. One example off the top of my head is that the RegEx library in PHP supports lookbehinds whereas RegEx library in JavaScript does not. This is why it is important for you to list the underlying language that you're using. Also for future reference, it is helpful for those wanting to answer your questions to provide us with sample input and sample matches from the input.
Using the information that you provided, this is also a question that I feel as though you should use RegEx and JavaScript to validate the input. Take a look at this example:
window.onload = function() {
var valid = "a1 - 'super' 1";
var invalid1 = " a1 - 'super' 1"; //leading ws
var invalid2 = "a1 - 'super' 1 "; //trailing ws
var invalid3 = "a1 - 'super' 1?"; //invalid (?) char
var invalid4 = "1 - '123'"; //no letters
console.log(valid + ": " + validation(valid));
console.log(invalid1 + ": " + validation(invalid1));
console.log(invalid2 + ": " + validation(invalid2));
console.log(invalid3 + ": " + validation(invalid3));
}
function validation(input) {
var acceptableChars = new RegExp(/[^a-zA-Z\d\s'-]/g);
var containsLetter = new RegExp(/[a-zA-Z]/);
return input.length > 1 && input.trim().length == input.length && !acceptableChars.test(input) && containsLetter.test(input);
}

PowerShell regular expression on logfile is capturing too much

I am trying to extract some text from a logfile, and I'm having problems.
Example text I am working on is:
ahksjhadjsadhsah
sakdsjakdjks
ksajdksaj
REF=35464
sadsad
213213
213
2
13
I need to extract the value "35464" (the REF number). I have limited knowledge of regular expressions, but thought 'REF=([0-9]+)' would do this.
Now I'm not sure how best I should be doing reading this file, so I've tried a couple of ways:
select-string -path e:\powershell\log.txt -pattern 'REF=([0-9]+)' | % { $_.Matches } | % { $_.Value }
Which gives me "REF=35464" - which I don't understand (why REF= is included), because I thought the 'capture' was only the parts in ()'s?
I also tried:
$data=Get-Content e:\powershell\log.txt
$data -match 'REF=([0-9]+)'
$Matches
But $Matches is empty.
I also tried a similar method to the above, but line by line, for example:
foreach ($line in $data)
{
$line -match 'REF=([0-9]+)'
}
I either get no matches or the full match (including the REF= part). I've also tried groups (that is, '(REF=)([0-9]+)'), and I can't get what I need.
How should I be reading the file? What is wrong with my regular expression?
I just need this extracted value as a usable variable.
It may be the way you are trying to access the capture group
I put this quick static class together to illustrate how to get the match you are looking for.
Note: I am using the # symbol on the regex and your input string to make them literals.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
namespace SkunkWorks.RegexPractice
{
public static class RegexPractice2
{
public static string input = #"ahksjhadjsadhsah
sakdsjakdjks
ksajdksaj
REF=35464
sadsad
213213
213
2
13";
static string pat = #"REF=([0-9]+)";
public static void Do()
{
Regex r = new Regex(pat, RegexOptions.IgnoreCase);
Match m = r.Match(input);
int matchCount = 0;
while (m.Success)
{
Console.WriteLine("Match" + (++matchCount));
for (int i = 1; i <= 2; i++)
{
Group g = m.Groups[i];
Console.WriteLine("Group" + i + "='" + g + "'");
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
System.Console.WriteLine("Capture" + j + "='" + c + "', Position=" + c.Index);
}
}
m = m.NextMatch();
}
}
}
}
What I usually do when I need to extract a substring from an array of strings is to use the automatic variable $Matches that is generated from using the -match operator in a Where statement. Like this:
$Data | Where{$_ -match "REF=([0-9]+)"} | ForEach{$Matches[1]}
Now, the $Matches variable there will be an array. The first entry will be the entire line that it matched, and the second object will be just the captured text, that is why I specify [1]. Now, about your RegEx that you're matching on... technically it's acceptable, but it's not very specific, so it really could return just the first number since [0-9]+ means 1 or more character that falls within the [0-9] scope. If you want to be sure that you get all of the numbers you can tell it to get everything to the end of the line by using the end-of-line character $ in your match like: REF=([0-9]+)$. We can't really tell if there's any whitespace after the numbers, so you might want to allow for that too using the \s notation that looks for any whitespace character (spaces, tabs, whatever), and using the asterisks after it which means zero or more. Then it becomes REF=([0-9]+)\s*$, which gets you exactly what you were looking for. Lastly, I would use \d instead of [0-9] because it does the same thing and it's shorter and simpler, and specifically made for the job. So, we have:
$Data | Where{$_ -match "REF=(\d+)\s*$"} | ForEach{$Matches[1]}
And that is broken down step by step and explained here: https://regex101.com/r/dG7jC7/1

Regular expression for removing white spaces but not those inside ""

I have the following input string:
key1 = "test string1" ; key2 = "test string 2"
I need to convert it to the following without tokenizing
key1="test string1";key2="test string 2"
You'd be far better off NOT using a regular expression.
What you should be doing is parsing the string. The problem you've described is a mini-language, since each point in that string has a state (eg "in a quoted string", "in the key part", "assignment").
For example, what happens when you decide you want to escape characters?
key1="this is a \"quoted\" string"
Move along the string character by character, maintaining and changing state as you go. Depending on the state, you can either emit or omit the character you've just read.
As a bonus, you'll get the ability to detect syntax errors.
Using ERE, i.e. extended regular expressions (which are more clear than basic RE in such cases), assuming no quote escaping and having global flag (to replace all occurrences) you can do it this way:
s/ *([^ "]*) *("[^"]*")?/\1\2/g
sed:
$ echo 'key1 = "test string1" ; key2 = "test string 2"' | sed -r 's/ *([^ "]*) *("[^"]*")/\1\2/g'
C# code:
using System.Text.RegularExpressions;
Regex regex = new Regex(" *([^ \"]*) *(\"[^\"]*\")?");
String input = "key1 = \"test string1\" ; key2 = \"test string 2\"";
String output = regex.Replace(input, "$1$2");
Console.WriteLine(output);
Output:
key1="test string1";key2="test string 2"
Escape-aware version
On second thought I've reached a conclusion that not showing escape-aware version of regexp may lead to incorrect findings, so here it is:
s/ *([^ "]*) *("([^\\"]|\\.)*")?/\1\2/g
which in C# looks like:
Regex regex = new Regex(" *([^ \"]*) *(\"(?:[^\\\\\"]|\\\\.)*\")?");
String output = regex.Replace(input, "$1$2");
Please do not go blind from those backslashes!
Example
Input: key1 = "test \\ " " string1" ; key2 = "test \" string 2"
Output: key1="test \\ "" string1";key2="test \" string 2"

Matching math expression with regular expression?

For example, these are valid math expressions:
a * b + c
-a * (b / 1.50)
(apple + (-0.5)) * (boy - 1)
And these are invalid math expressions:
--a *+ b # 1.5.0 // two consecutive signs, two consecutive operators, invalid operator, invalid number
-a * b + 1) // unmatched parentheses
a) * (b + c) / (d // unmatched parentheses
I have no problem with matching float numbers, but have difficulty with parentheses matching. Any idea? If there is better solution than regular expression, I'll accept as well. But regex is preferred.
========
Edit:
I want to make some comments on my choice of the “accepted answer”, hoping that people who have the same question and find this thread will not be misled.
There are several answers I consider “accepted”, but I have no idea which one is the best. So I chose the accepted answer (almost) randomly. I recommend reading Guillaume Malartre’s answer as well besides the accepted answer. All of them give practical solutions to my question. For a somewhat rigorous/theoretical answer, please read David Thornley’s comments under the accepted answer. As he mentioned, Perl’s extension to regular expression (originated from regular language) make it “irregular”. (I mentioned no language in my question, so most answerers assumed the Perl implementation of regular expression – probably the most popular implementation. So did I when I posted my question.)
Please correct me if I said something wrong above.
Use a pushdown automaton for matching paranthesis http://en.wikipedia.org/wiki/Pushdown_automaton (or just a stack ;-) )
Details for the stack solution:
while (chr available)
if chr == '(' then
push '('
else
if chr == ')' then
if stack.elements == 0 then
print('too many or misplaced )')
exit
else
pop //from stack
end while
if (stack.elements != 0)
print('too many or misplaced(')
Even simple: just keep a counter instead of stack.
Regular expressions can only be used to recognize regular languages. The language of mathematical expressions is not regular; you'll need to implement an actual parser (e.g. LR) in order to do this.
Matching parens with a regex is quite possible.
Here is a Perl script that will parse arbitrary deep matching parens. While it will throw out the non-matching parens outside, I did not design it specifically to validate parens. It will parse arbitrarily deep parens so long as they are balanced. This will get you started however.
The key is recursion both in the regex and the use of it. Play with it, and I am sure that you can get this to also flag non matching prens. I think if you capture what this regex throws away and count parens (ie test for odd parens in the non-match text), you have invalid, unbalanced parens.
#!/usr/bin/perl
$re = qr /
( # start capture buffer 1
\( # match an opening paren
( # capture buffer 2
(?: # match one of:
(?> # don't backtrack over the inside of this group
[^()]+ # one or more
) # end non backtracking group
| # ... or ...
(?1) # recurse to opening 1 and try it again
)* # 0 or more times.
) # end of buffer 2
\) # match a closing paren
) # end capture buffer one
/x;
sub strip {
my ($str) = #_;
while ($str=~/$re/g) {
$match=$1; $striped=$2;
print "$match\n";
strip($striped) if $striped=~/\(/;
return $striped;
}
}
while(<DATA>) {
print "start pattern: $_";
while (/$re/g) {
strip($1) ;
}
}
__DATA__
"(apple + (-0.5)) * (boy - 1)"
"((((one)two)three)four)x(one(two(three(four))))"
"a) * (b + c) / (d"
"-a * (b / 1.50)"
Output:
start pattern: "(apple + (-0.5)) * (boy - 1)"
(apple + (-0.5))
(-0.5)
(boy - 1)
start pattern: "((((one)two)three)four)x(one(two(three(four))))"
((((one)two)three)four)
(((one)two)three)
((one)two)
(one)
(one(two(three(four))))
(two(three(four)))
(three(four))
(four)
start pattern: "a) * (b + c) / (d"
(b + c)
start pattern: "-a * (b / 1.50)"
(b / 1.50)
I believe you will be better off implementing a real parser to accomplish what you're after.
A parser for simple mathematical expressions is "Parsing 101", and there are several examples to be found online.
Some examples include:
ANTLR: Expression Evaluator Sample (ANTLR grammars can target several languages)
pyparsing: http://pyparsing.wikispaces.com/file/view/fourFn.py (pyparsing is a Python library)
Lex & Yacc: http://epaperpress.com/lexandyacc/ (contains a PDF tutorial and sample code for a calculator)
Note that the grammar you will need for validating expressions is simpler than the examples above, since the examples also implement evaluation of the expression.
You can't use regex to do things like balance parenthesis.
This is tricky with one single regular expression, but quite easy using mixed regexp/procedural approach. The idea is to construct a regexp for the simple expression (without parenthesis) and then repeatedly replace ( simple-expression ) with some atomic string (e.g. identifier). If the final reduced expression matches the same `simple' pattern, the original expression is considered valid.
Illustration (in php).
function check_syntax($str) {
// define the grammar
$number = "\d+(\.\d+)?";
$ident = "[a-z]\w*";
$atom = "[+-]?($number|$ident)";
$op = "[+*/-]";
$sexpr = "$atom($op$atom)*"; // simple expression
// step1. remove whitespace
$str = preg_replace('~\s+~', '', $str);
// step2. repeatedly replace parenthetic expressions with 'x'
$par = "~\($sexpr\)~";
while(preg_match($par, $str))
$str = preg_replace($par, 'x', $str);
// step3. no more parens, the string must be simple expression
return preg_match("~^$sexpr$~", $str);
}
$tests = array(
"a * b + c",
"-a * (b / 1.50)",
"(apple + (-0.5)) * (boy - 1)",
"--a *+ b # 1.5.0",
"-a * b + 1)",
"a) * (b + c) / (d",
);
foreach($tests as $t)
echo $t, "=", check_syntax($t) ? "ok" : "nope", "\n";
The above only validates the syntax, but the same technique can be also used to construct a real parser.
For parenthesis matching, and implementing other expression validation rules, it is probably easiest to write your own little parser. Regular expressions are no good in this kind of situation.
Ok here's my version of parenthesis finding in ActionScript3, using this approach give a lot of traction to analyse the part before the parenthesis, inside the parenthesis and after the parenthis, if some parenthesis remains at the end you can raise a warning or refuse to send to a final eval function.
package {
import flash.display.Sprite;
import mx.utils.StringUtil;
public class Stackoverflow_As3RegexpExample extends Sprite
{
private var tokenChain:String = "2+(3-4*(4/6))-9(82+-21)"
//Constructor
public function Stackoverflow_As3RegexpExample() {
// remove the "\" that just escape the following "\" if you want to test outside of flash compiler.
var getGroup:RegExp = new RegExp("((?:[^\\(\\)]+)?) (?:\\() ( (?:[^\\(\\)]+)? ) (?:\\)) ((?:[^\\(\\)]+)?)", "ix") //removed g flag
while (true) {
tokenChain = replace(tokenChain,getGroup)
if (tokenChain.search(getGroup) == -1) break;
}
trace("cummulativeEvaluable="+cummulativeEvaluable)
}
private var cummulativeEvaluable:Array = new Array()
protected function analyseGrammar(matchedSubstring:String, capturedMatch1:String, capturedMatch2:String, capturedMatch3:String, index:int, str:String):String {
trace("\nanalyseGrammar str:\t\t\t\t'"+str+"'")
trace("analyseGrammar matchedSubstring:'"+matchedSubstring+"'")
trace("analyseGrammar capturedMatchs:\t'"+capturedMatch1+"' '("+capturedMatch2+")' '"+capturedMatch3+"'")
trace("analyseGrammar index:\t\t\t'"+index+"'")
var blank:String = buildBlank(matchedSubstring.length)
cummulativeEvaluable.push(StringUtil.trim(matchedSubstring))
// I could do soo much rigth here!
return str.substr(0,index)+blank+str.substr(index+matchedSubstring.length,str.length-1)
}
private function replace(str:String,regExp:RegExp):String {
var result:Object = regExp.exec(str)
if (result)
return analyseGrammar.apply(null,objectToArray(result))
return str
}
private function objectToArray(value:Object):Array {
var array:Array = new Array()
var i:int = 0
while (true) {
if (value.hasOwnProperty(i.toString())) {
array.push(value[i])
} else {
break;
}
i++
}
array.push(value.index)
array.push(value.input)
return array
}
protected function buildBlank(length:uint):String {
var blank:String = ""
while (blank.length != length)
blank = blank+" "
return blank
}
}
}
It should trace this:
analyseGrammar str: '2+(3-4*(4/6))-9(82+-21)'
analyseGrammar matchedSubstring:'3-4*(4/6)'
analyseGrammar capturedMatchs: '3-4*' '(4/6)' ''
analyseGrammar index: '3'
analyseGrammar str: '2+( )-9(82+-21)'
analyseGrammar matchedSubstring:'2+( )-9'
analyseGrammar capturedMatchs: '2+' '( )' '-9'
analyseGrammar index: '0'
analyseGrammar str: ' (82+-21)'
analyseGrammar matchedSubstring:' (82+-21)'
analyseGrammar capturedMatchs: ' ' '(82+-21)' ''
analyseGrammar index: '0'
cummulativeEvaluable=3-4*(4/6),2+( )-9,(82+-21)

Regular expression to match word pairs joined with colons

I don't know regular expression at all. Can anybody help me with one very simple regular expression which is,
extracting 'word:word' from a sentence. e.g "Java Tutorial Format:Pdf With Location:Tokyo Javascript"?
Little modification:
the first 'word' is from a list but second is anything. "word1 in [ABC, FGR, HTY]"
guys situation demands a little more
modification.
The matching form can be "word11:word12 word13 .. " till the next "word21: ... " .
things are becoming complex with sec.....i have to learn reg ex :(
thanks in advance.
You can use the regex:
\w+:\w+
Explanation:
\w - single char which is either a letter(uppercase or lowercase), digit or a _.
\w+ - one or more of above char..basically a word
so \w+:\w+
would match a pair of words separated by a colon.
Try \b(\S+?):(\S+?)\b. Group 1 will capture "Format" and group 2, "Pdf".
A working example:
<html>
<head>
<script type="text/javascript">
function test() {
var re = /\b(\S+?):(\S+?)\b/g; // without 'g' matches only the first
var text = "Java Tutorial Format:Pdf With Location:Tokyo Javascript";
var match = null;
while ( (match = re.exec(text)) != null) {
alert(match[1] + " -- " + match[2]);
}
}
</script>
</head>
<body onload="test();">
</body>
</html>
A good reference for regexes is https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/RegExp
Use this snippet :
$str=" this is pavun:kumar hello world bk:systesm" ;
if ( preg_match_all ( '/(\w+\:\w+)/',$str ,$val ) )
{
print_r ( $val ) ;
}
else
{
print "Not matched \n";
}
Continuing Jaú's function with your additional requirement:
function test() {
var words = ['Format', 'Location', 'Size'],
text = "Java Tutorial Format:Pdf With Location:Tokyo Language:Javascript",
match = null;
var re = new RegExp( '(' + words.join('|') + '):(\\w+)', 'g');
while ( (match = re.exec(text)) != null) {
alert(match[1] + " = " + match[2]);
}
}
I am currently solving that problem in my nodejs app and found that this is, what I guess, suitable for colon-paired wordings:
([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))
It also matches quoted value. like a:"b" c:'d e' f:g
Example coding in es6:
const regex = /([\w]+:)("(([^"])*)"|'(([^'])*)'|(([^\s])*))/g;
const str = `category:"live casino" gsp:S1aik-UBnl aa:"b" c:'d e' f:g`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
Example coding in PHP
$re = '/([\w]+:)("(([^"])*)"|\'(([^\'])*)\'|(([^\s])*))/';
$str = 'category:"live casino" gsp:S1aik-UBnl aa:"b" c:\'d e\' f:g';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
You can check/test your regex expressions using this online tool: https://regex101.com
Btw, if not deleted by regex101.com, you can browse that example coding here
here's the non regex way, in your favourite language, split on white spaces, go through the element, check for ":" , print them if found. Eg Python
>>> s="Java Tutorial Format:Pdf With Location:Tokyo Javascript"
>>> for i in s.split():
... if ":" in i:
... print i
...
Format:Pdf
Location:Tokyo
You can do further checks to make sure its really "someword:someword" by splitting again on ":" and checking if there are 2 elements in the splitted list. eg
>>> for i in s.split():
... if ":" in i:
... a=i.split(":")
... if len(a) == 2:
... print i
...
Format:Pdf
Location:Tokyo
([^:]+):(.+)
Meaning: (everything except : one or more times), :, (any character one ore more time)
You'll find good manuals on the net... Maybe it's time for you to learn...