Tcl regular expressions - regex

set d(aa1) 1
set d(aa2) 1
set d(aa3) 1
set d(aa4) 1
set d(aa5) 1
set d(aa6) 1
set d(aa7) 1
set d(aa8) 1
set d(aa9) 1
set d(aa10) 1
set d(aa11) 1
set regexp "a*\[1-9\]"
set res [array names d -glob $regexp]
puts "res = $res"
In this case, the result is:
res = aa11 aa6 aa2 aa7 aa3 aa8 aa4 aa9 aa5 aa1
But when I change the regexp from a*\[1-9\] to a*\[1-10\], the result becomes:
res = aa11 aa10 aa1

You have an error in your character class.
[1-10] does not mean a digit from 1 to 10
It means 1-1, which is a character ranging from 1 to 1 (i.e., simply a 1), or a 0. This explains your output.
to express a digit from 1 to 10, use this: (?:10?|[2-9]) (as one of several ways to do it.
therefore your regex becomes a*(?:10?|[2-9])
note that if your engine does not allow non-capturing group, you need to remove the ?:, for: a*(?:10?|[2-9])

You need to be sure what you're trying to match because glob style matching and regexp style matching are different in many aspects.
From the docs, glob has the following:
* matches any sequence of characters in string, including a null string.
? matches any single character in string.
[chars] matches any character in the set given by chars. If a sequence of the form x-y appears in chars, then any character between x and y, inclusive, will match. When used with -nocase, the end points of the range are converted to lower case first. Whereas {[A-z]} matches _ when matching case-sensitively (since _ falls between the Z and a), with -nocase this is considered like {[A-Za-z]} (and probably what was meant in the first place).
\x matches the single character x. This provides a way of avoiding the special interpretation of the characters *?[]\ in pattern.
Since you are using glob style matching, your current expression (a*\[1-9\]) matches an a, followed by any characters and any one of 1 through 9 (meaning it would also match something like abcjdne1).
If you want to match at least one a followed by numbers from 1 through 10, you will need something like this, using the -regexp mode:
set regexp {a+(?:[1-9]|10)}
set res [array names d -regexp $regexp]
Now, this regexp is I believe the more natural one for a beginner ((?:[1-9]|10) meaning either 1 through 9, or 10, but you can use the form that zx81 suggested with (?:10?|[2-9]) meaning 1, with an optional 0 for 10, or 2 through 9).
+ means that a must appear at least once for the array name to match.
If you now need to match the full names, you will need to use anchors:
^a+(?:[1-9]|10)$
Note: You cannot use glob matching if you want to match at least one a followed by digits, and alternation (the pipe used |) and quantifiers (? or + or *) the way they behave in regexp are not supported by glob matching.
One last thing, use braces to avoid escaping your pattern (unless you have a variable or running a function in your pattern and can't do otherwise).

Related

Regex split string by two consecutive pipe ||

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end
You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Regex Erasing all except numbers with limited digits

What I want to do is erase everything except \d{4,7} only by replacing.
Any ideas to get this?
ex)
G-A15239L → 15239
(G-A and L should be selected and replaced by empty strings)
now200316stillcovid19asdf → 200316
(now and stillcovid19asdf should be selected and replaced by empty strings)
Also, replacing text is not limited as empty string.
substitutions such as $1 are possible too.
Using Regex in 'Kustom' apps. (including KLCK, KLWP, KWGT)
I don't know which engine it's using because there are no information about it
You may use
(\d{4,7})?.?
Or
(\d{4,7})|.
and replace with $1. See the regex demo.
Details
(\d{4,7})? - an optional (due to ? at the end - if it is missing, then the group is obligatory) capturing group matching 1 or 0 occurrences of 4 to 7 digits
| - or
.? - any one char other than line break chars, 1 or 0 times when ? is right after it.
So, any match of 4 to 7 digits is kept (since $1 refers to the Group 1 value) and if there is a char after it, it is removed.
It looks as if the regex is Java based since all non-matching groups are replaced with null:
So, the only possible solution is to use a second pass to post-process the results, just replace null with some kind of a delimiter, a newline for example.
Search: .*?(\d{4,7})[^\d]+|.*
Replace: $1
in for instance Notepad++ 6.0 or better (which comes with built-in PCRE support) works with your examples:
jalsdkfilwsehf
now200316stillcovid19asdf
G-A15239L
becomes:
200316
15239

Regular expression for checking every substring

I had this question in my end semester exam, unfortunately I couldn't solve it and tried it for several days and no luck.
Condition- For every substring of length 4 in a string of length n, write a RE to force the rule- there must be exactly three 1's.
My solution kind of looked like
(1+11+111+€)(0111)*(0+€).
But this is obviously wrong, string 11011 is a valid solution too.
Update- my new solution is (1+11+111+€)(0111)*(0+01+011+€).
Update- the plus operator is actually 'OR'
Update- € is empty string
Update - the string length has no requirements. A string of length 5 will have 2 substrings of length 4, the first 4 chars and the last 4 chars
I think the professor is looking for the realization that in order for the condition to hold, a 1 may never be bracketed by zeros on both sides.
(0?11)*
To complete the picture, we also need to include the case for n=1 where either a 0 or a 1 is allowed (I presume).
^[01]$|(0?11)*
I'm using traditional regex here, where [...] denotes a character class, | is "or", parentheses do grouping, and * specifies zero or more repetitions (Kleene star).
Edit: Consider the following regular expression where € refers to the empty string and assuming the alphabet consists of only {0,1,€}:
€ + (0111)* + (1110)* + (1101)* + (1011)*
Python 2
import re
# regular expression
# '^' in the start of the expression means from the very start of the string
# "[^1.]*" means match any character unless it is '1'
# '$' in the end of the expression means till the very end of the string
# Summary:
# from the start of the string till the end of it check if we have
# '1' exactly 3 times,
# and before or after them you might find any type of characters
# and these characters must not be equal '1' ...
expression = '^[^1.]*1[^1.]*1[^1.]*1[^1.]*$'
# testing strings
tests = ["01110", "11100", "00111", "010101", "00100100111", "00100"]
prog = re.compile(expression)
for test in tests:
print 'matched' if prog.match(test) else 'not matched'

Regular Expression for the Pattern?

I'm required to write a regular expression that has the following rules:
Digits between 1 to 4
hyphen (only one and can occur at any position)
Length of Text must be less than or equal to 6 (including the potential hyphen)
May end with a letter or a number, but not a hyphen.
Some valid examples are:
1-3411
12-413
123-2A
11-1
These examples are invalid:
12--11 ( since it contains two hyphens)
1-2345 ( since it contains number 5)
11-2311 ( since length is more than 6)
The RegEx that I wrote is:
^[1-4]-[1-4]{4}|^[1-4]{2}-[1-4]{3}|^[1-4]{3}-[1-4]{2}|^[1-4]{4}-[1-4]
However, this does not seem to be working, and it doesn't handle the case of a single character being is present in the end.
Can some some please help me determine a way of handling this?
<>
is character occurs in last position then before character we must have a digit not hypen .
i.e 11-a ( must fail)
11-1a (must pass)
^(?!(?:[^-\n]*-){2})(?:[1-4-]{1,5}[1-4]|[1-4-]{1,5}[a-zA-Z])$
You can handle that using a lookahead.See demo.
https://regex101.com/r/tS1hW2/16
If you have such a complex requirement, it is always easy to use lookarrounds to form an and-pattern matching each condition at the same time. Sometimes you need to split up ONE condition into two:
Base-Match: 6 or less digits: ^.{1,6}$
(AND) Only 1-4 and hyphen and letter: ^[1-4a-z\-]+$ (not accurate, requires next line)
(AND) First 1...5 elements NO Letter: ^[1-4\-]{1,5}[1-4a-z]$
(AND) No double hypen and not at the end: ^[^-]*-[^-]+$
Putting all together leads to:
(?=^[1-4\-]{1,5}[1-4a-z]$)(?=^[^-]*-[^-]*$)(?=^[1-4a-z\-]+$)^.{1,6}$
Debuggex Demo

Regex for Regex validation decimal[19,3]

I want to validate a decimal number (decimal[19,3]). I used this
#"[\d]{1,16}|[\d]{1,16}[\.]\d{1,3}"
but it didn't work.
Below are valid values:
1234567890123456.123
1234567890123456.12
1234567890123456.1
1234567890123456
1234567
0.0
.1
Simplification:
The \d doesn't have to be in []. Use [] only when you want to check whether a character is one of multiple characters or character classes.
. doesn't need to be escaped inside [] - [\.] appears to just allow ., but allowing \ to appear in the string in the place of the . may be a language dependent possibility (?). Or you can just take it out of the [] and keep it escaped.
So we get to:
\d{1,16}|\d{1,16}\.\d{1,3}
(which can be shortened using the optional / "once or not at all" quantifier (?)
to \d{1,16}(\.\d{1,3})?)
Corrections:
You probably want to make the second \d{1,16} optional, or equivalently simply make it \d{0,16}, so something like .1 is allowed:
\d{1,16}|\d{0,16}\.\d{1,3}
If something like 1. should also be allowed, you'll need to add an optional . to the first part:
\d{1,16}\.?|\d{0,16}\.\d{1,3}
Edit: I was under the impression [\d] matches \ or d, but it actually matches the character class \d (corrected above).
This would match your 3 scenarios
^(\d{1,16}|(\d{0,16}\.)?\d{1,3})$
first part: a 0 to 16 digit number
second: a 0 to 16 digit number with 1 to 3 decimals
third: nothing before a dot and then 1 to 3 decimals
the ^ and $ are anchorpoints that match start of line and end of line, so if you need to search for numbers inside lines of text, your should remove those.
Testdata:
Usage in C#
string resultString = null;
try {
resultString = Regex.Match(subjectString, #"\d{1,16}\.?|\d{0,16}\.\d{1,3}").Value;
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
Slight optimization
A bit more complicated regex, but a bit more correct would be to have the ?: notation in the "inner" group, if you are not using it, to make that a non-capture group, like this:
^(\d{1,16}|(?:\d{0,16}\.)?\d{1,3})$
Following Regex will help you out -
#"^(\d{1,16}(\.\d{1,3})?|\.\d{1,3})$"
Try something like that
(\d{0,16}\.\d{0,3})|(\d{0,16})
It work with all your examples.
edit. new version ;)
You can try:
^\d{0,16}(?:\.|$)(?:\d{0,3}|)$
match 0 to 16 digits
then match a dot or end of string
and then match 3 more digits