X-Path + RegEx matching pattern - regex

Given the following,
<Line>
<Supplier>Fuel Surcharge - 36</Supplier>
<Supplier>Fuel Surcharge - 35</Supplier>
<Supplier>46081 46150 46250 46280 46286</Supplier>
<Supplier>Fuel Surcharge - 35451</Supplier>
<Supplier>46081</Supplier>
</Line>
The idea here is to return "true" when the node carries a number of 5 digits with iteration.
This is what I have done so far,
matches(./Supplier, "[^(\d{5}\s*)+]");
The regex here is to extract the value that has 5 digits with or without space regardless of the iteration.
The results I am getting is all true. Means its not right somewhere. Can you assist me with this.
Thanks.

There are two problems with your expression:
No semicolon in the end of an XPath expression (syntax error).
Your regex is messed up, it matches everything that does not contain anything out of the character class parentheses, digits, curly brackets, the digit 5, spaces, and the star and plus character.
fn:matches(xs:string?, xs:string) requires two strings as parameters, you're passing a sequence of strings for the first one.
To call a function for each node in an axis step, add it as another one (XPath 2.0 and above only). You can use the dot . (context) in the arguments.
Try something like
./Supplier/matches(., "^(\d{5}\s*)+$")
which will yield true for the third and fifth row. If it only must contain (and not fully constructed from) the repeating pattern of fife-digit-numbers and spaces, remove the ^ and $ from the regular expression.

Try this one: matches(./Supplier, '(.*)((\d{5}(\s*))+)(.*)')

Related

Regex substitution does not replace match character for character

I am trying to use Regex to dynamically capture all numbers in a string such as 1234-12-1234 or 1234-123-1234 without knowing the number of characters that will occur in each string segment. I have been able to capture this using positive look ahead via the following expression: [0-9]*(?=-). However, when I try to replace the numbers to Xs such that each number that occurs before the last dash is replaced by an X, the Regex does not return X's for numbers 1:1. Instead, each section returns exactly two X's. How can I get the regex to return the following:
1234-123-1234 -> XXXX-XXX-1234
1234-12-1234 -> XXXX-XX-1234
instead of the current
1234-123-1234 -> XX-XX-1234
?
Link to demo
The problem is that by placing the * directly after the digit match, more than one digit would get replaced with a single X. And then zero digits would get replaced with a single X. Therefore any number of digits would be effectively replaced as two X's.
Use this instead:
[0-9](?=.*-)

Regex - Match n last numbers without the last one

With regex - replace, I am trying to format a number like this:
The leading number should be separated by a +. Moreover, the last number should be separated by a + as well. The more tricky part is, that adjacent 1s to the + to the middle part should be removed, without touching the first and the last number, e.g.,
011023040 -> 0+02304+0
111023920443 -> 1+02392044+3
13242311 -> 1+32423+1
I almost achieved this with the following regex:
'^([0-9]{1})([1]+)?([0-9*)(0-9]{1}$'
And replace this with
'\1+\3+\4'
However, I have a problem with the last example, as this returns:
1+324231+1
However, the one before the second + should be removed.
Can anyone help me with this problem?
You have to use a non-greedy quantifier:
^([0-9])1*([0-9]*?)1*([0-9])$
^^
Live demo
I managed to group the numbers in the following way
^(\d)(1*)(\d+)(\d)$
by using multiline and global flags.
The replacement should look like \1+\3+\4

Need regex expression with multiple conditions

I need regex with following conditions
It should accept maximum of 5 digits then upto 3 decimal places
it can be negative
it can be zero
it can be only numbers (max. upto 5 digit place)
it can be null
I have tried following but its not, its not fulfilling all conditions
#"^([\-\+]?)\d{0,5}(.[0-9]{1,3})?)$"
E.g. maximum value can hold is from -99999.999 to 99999.999
Use this regex:
^[-+]?\d{0,5}(\.[0-9]{1,3})?$
I only made two changes here. First, you don't need to escape any characters inside a character class normally, except for opening and closing brackets, or possibly backslash itself. Hence, we can use [-+] to capture an initial plus or minus. Second, you need to escape the dot in your regex, to tell the engine that you want to match a literal dot.
However, I would probably phrase this regex as follows:
^[-+]?\d{1,5}(\.[0-9]{1,3})?$
This will match one to five digits, followed by an optional decimal point, followed by one to three digits.
Note that we want to capture things like:
0.123
But not
.123
i.e. we don't want to capture a leading decimal point should it not be prefixed by at least one number.
Demo here:
Regex101
I assume you're doing this in C# given the notation. Here's a little code you can use to test your expression, with two corrections:
You have to escape the dot, otherwise it means "any character". So, \. instead of .
There was an extraneous close parenthesis that prevented the expression from compiling
C#:
var expr = #"^([\-\+]?)\d{0,5}(\.[0-9]{1,3})?$";
var re = new Regex(expr);
string[] samples = {
"",
"0",
"1.1",
"1.12",
"1.123",
"12.3",
"12.34",
"12.345",
"123.4",
"12345.123",
".1",
".1234"
};
foreach(var s in samples) {
Console.WriteLine("Testing [{0}]: {1}", s, re.IsMatch(s) ? "PASS" : "FAIL");
}
Results:
Testing []: PASS
Testing [0]: PASS
Testing [1.1]: PASS
Testing [1.12]: PASS
Testing [1.123]: PASS
Testing [12.3]: PASS
Testing [12.34]: PASS
Testing [12.345]: PASS
Testing [123.4]: PASS
Testing [12345.123]: PASS
Testing [.1]: PASS
Testing [.1234]: FAIL
It should accept maximum of 5 digits
[0-9]{1,5}
then upto 3 decimal places
[0-9]{1,5}(\.[0-9]{1,3})?
it can be negative
[-]?[0-9]{1,5}(\.[0-9]{1,3})?
it can be zero
Already covered.
it can be only numbers (max. upto 5 digit place)
Already covered. 'Up to 5 digit place' contradicts your first rule, which allows 5.3.
it can be null
Not covered. I strongly suggest you remove this requirement. Even if you mean 'empty', as I sincerely hope you do, you should detect that case separately and beforehand, as you will certainly have to handle it differently.
Your regular expression contains ^ and $. I don't know why. There is nothing about start of line or end of line in the rules you specified. It also allows a leading +, which again isn't specified in your rules.

Regular Expression for the Pattern?

I'm required to write a regular expression that has the following rules:
Digits between 1 to 4
hyphen (only one and can occur at any position)
Length of Text must be less than or equal to 6 (including the potential hyphen)
May end with a letter or a number, but not a hyphen.
Some valid examples are:
1-3411
12-413
123-2A
11-1
These examples are invalid:
12--11 ( since it contains two hyphens)
1-2345 ( since it contains number 5)
11-2311 ( since length is more than 6)
The RegEx that I wrote is:
^[1-4]-[1-4]{4}|^[1-4]{2}-[1-4]{3}|^[1-4]{3}-[1-4]{2}|^[1-4]{4}-[1-4]
However, this does not seem to be working, and it doesn't handle the case of a single character being is present in the end.
Can some some please help me determine a way of handling this?
<>
is character occurs in last position then before character we must have a digit not hypen .
i.e 11-a ( must fail)
11-1a (must pass)
^(?!(?:[^-\n]*-){2})(?:[1-4-]{1,5}[1-4]|[1-4-]{1,5}[a-zA-Z])$
You can handle that using a lookahead.See demo.
https://regex101.com/r/tS1hW2/16
If you have such a complex requirement, it is always easy to use lookarrounds to form an and-pattern matching each condition at the same time. Sometimes you need to split up ONE condition into two:
Base-Match: 6 or less digits: ^.{1,6}$
(AND) Only 1-4 and hyphen and letter: ^[1-4a-z\-]+$ (not accurate, requires next line)
(AND) First 1...5 elements NO Letter: ^[1-4\-]{1,5}[1-4a-z]$
(AND) No double hypen and not at the end: ^[^-]*-[^-]+$
Putting all together leads to:
(?=^[1-4\-]{1,5}[1-4a-z]$)(?=^[^-]*-[^-]*$)(?=^[1-4a-z\-]+$)^.{1,6}$
Debuggex Demo

R digit-expression and unlist doesn't work

So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.
I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))
When I run this command, "yend_clean" is simply set to "character (empty)".
If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.
So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.
Was hoping someone in here could point me in the right direction.
This is a regular expression question. Your regular expression is wrong. Use:
unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
or equivalently
sub("^(\\d{4}).*", "\\1", "2003-")
of if really all you want is to remove the "-"
sub("-", "", "2003-")
Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:
match any single digit, followed by a 4, followed by the end of the string
When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).
The pattern I propose says instead:
match the beginning of the string (^), followed by a digit repeated four times.
The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].
A good website to learn about regex
A visualization tool to see how specific regular expressions match strings
If you mean the book Automated Data Collection with R, the code could be like this:
yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]{4}[-]$"))
yend_clean <- unlist(str_extract_all(yend_clean, "^[[:digit:]]{4}"))
Assumes that you have a string, "1993–2007, 2010-", and you want to get the last given year, which is "2010". The first line, which means four digits and a dash and end, return "2010-", and the second line return "2010".