Regex match repetition of a character OR single character - regex

I need to escape more than one # in a line or any commas, dots.
Example:
Not valid: test##test test,test test#te,st test#t.e,st
Valid: test#test test#te#st
Next pattern does exactly what I want (it checks whether a line contains ## or , or . so the result is true/false):
/(#)\1+|[,.]/
but I don't like | sign here.
How can I fix it to use [ ] only? Or is there another way to do this?

I don't like | sign here. How can I fix it to use [ ] only?
You can't. Using | is the only way if you want to have different patterns for # and the other characters.
You can simplify your expression a bit. For example:
##+|[,.]

Related

HTML5 - Password Regex

I try to make a valid html5 pattern for a Password. It should be at least 9 characters long and contain at least one Uppercase, one lowercase, one digit and one specialcharacter of this list
()[]{}?!$%&/=*+~,.;:<>-_
I made this regex but it doesn't work... anyone can fix this?
pattern="^(?=.*\d)(?=.*[a-z])(?=.*[()[]{}?!$%&/=*+~,.;:<>-_])(?=.*[A-Z]).{9,}?$"
pattern="(?=.*\d)(?=.*[a-z])(?=.*[()\[\]{}?!$%&/=*+~,.;:<>_-])(?=.*[A-Z]).{9,}"
There are several errors:
^(?=.*\d)(?=.*[a-z])(?=.*[()[]{}?!$%&/=*+~,.;:<>-_])(?=.*[A-Z]).{9,}?$
# ] needs to be escaped ----^^ ^ ^
# otherwise it will close the character class | |
# [ too but for no logical reason | |
# the - is used to define a character range ----+ |
# the range >-_ gives >?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_ |
# there's no reason to make this quantifier non-greedy -------------+
In addition, anchors ^ and $ are implicit, you don't have to put them.
Note that using ranges, you can also write the pattern like that:
pattern="(?=.*\d)(?=.*[a-z])(?=.*[!$%&(-/:-?_{}~])(?=.*[A-Z]).{9,}"

Extracting plus (+), minus (-), and period (.) characters and all numbers [0-9] from a character string in R

I have many character strings that look like this, with a large variance of numbers:
tester1 <- "{\"fullgame\":\"-303\"}"
tester2 <- "{\"fullgame\":\"+7.5\"}"
I would like to extract plus(+), minus(-), and period(.) characters, along with all numbers[0-9] from my strings. I would also like to preserve the current order of each of these elements as they appear in the string.
I want the resulting strings to be:
formatted1 = "-303"
formatted2 = "+7.5"
I know that functions like gsub, strsplit, and regex would be ideal for this application, but for the life of me, I can't figure out the Perl syntax =(.
Any help would be greatly appreciated! Thank you guys!
Your strings look like json a lot. See the wiki page for more info. You shouldn't try to parse them with regex; rely instead on a specific json parser. There are many in R. Extracting the desired quantity is as easy as:
require(jsonlite) #other libraries: rjson, RJSONIO
fromJSON(tester1)
#$fullgame
#[1] "+7.5"
I don't know about "Perl" but this regex pattern pulls your numbers out:
> gsub( "([^-+]+)([+-]{0,1}[0-9.]+)(.+)", "\\2", c(tester1,tester2) )
[1] "-303" "+7.5"
Breaking apart the pattern which has three capture sections:
([^-+]+) : uses the negation operator in a character class to match any sequence that is
not a plus or minus sign
([+-]{0,1}[0-9.]+) : the second capture class allows (but does not require) a single
+/- sign, followed by any number of digits or decimal point/period
(.+) : is the third capture class ... anything else that is trailing
This one is a bit more specific about what form the numbers and decimal points can assume by adding optional single decimal point and subsequent digits:
gsub( "([^-+]+)([+-]{0,1}[0-9]+[.]{0,1}[0-9]*)(.+)", "\\2", c(tester1,tester2) )
I'm pretty sure that there are earlier postings that covered extraction of signed decimal numbers.
You can gsub out anything that doesn't match the type of characters you're looking for:
gsub('[^+-.0-9]', '', tester1)
# [1] "-303"
gsub('[^+-.0-9]', '', tester2)
# [1] "+7.5"
[^ ... ] defines a set of characters ... where the ^ tells it to match anything except for them. gsub then replaces all those characters with nothing, leaving you with what you want.

Why doesn't this regex pattern work?

I'm trying to select commas without numbers of 4 digits or the word "id" before, I tried with this:
( ? < ! [ \ d { 5 } | id ] ) ,
The problem
for example, if input string is "1999," that comma is not selected, I don't understand why.
Try this pattern:
(?<!\d{5}|id),
Your pattern, (?<![\d{5}|id]), is looking for a comma that is not after a digit, {, }, |, i, or d - They should not be in a charterer class: []. If anything, (?<![\d]{5}|id), will also work, but is redundant.
First of all, unless you're using the /x flag, each space will attempt to match a space. So take those out.
Second, you're using [...] presumably to group an alternation (|) but square brackets actually indicate a character class, i.e. [\d{5}|id] is equivalent to [id5{}|] and matches any one of those characters, but not more. What you mean is this:
(?<!\d{5}|id),
The final problem might be that many implementations of regex (you haven't specified which you're using) don't support variable-width lookbehind assertions. So, you may need to do something like:
(?<!\d{5}|...id),

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

Replace patterns that are inside delimiters using a regular expression call

I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.