This is my string
'SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;'
I need out like 'SEPA1,SEPA2,SEPA3' using Regular expression.
SELECT REGEXP_REPLACE ('SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;',
'([^,]+)(\1)+', '')
FROM dual;
This query is not working: it leaves the input string unchanged. Also, I am looking for a regular expression solution (in particular, no use of a CONNECT BY LEVEL query to split the string into pieces).
MT0 has already provided the correct solution (most likely, but see the discussion of commas - perhaps escaped - within token values). Let me explain here what is happening in your attempt - you may find this helpful.
[^,]+ in the search pattern means one or more non-comma characters. This part is probably OK, but it raises two questions.
Can the input string contain substrings like 'SEPA6,;'? This would be how a "row in a table" (presented as a single string, where "rows" are terminated by semicolon and within each row, values are separated by comma), where the "date" is null. So - the question is, can there be null dates in your string, which would be represented by ,; with nothing between the comma and the semicolon? If that is possible, you would need to change the + quantifier to *, to allow zero or more non-comma characters before the semicolon.
Can there ever be a comma, a few characters, another comma, a few more characters, and then a semicolon? Presumably not in the "date" portion of each token; but where you show SEPA1 etc., whatever they mean, could there be a comma in the name (probably escaped, something like SE","TG)? In that case, you really want something like what you did, with the negated character class. The Answer posted by MT0 will delete everything from the FIRST comma (even if it's in the middle of the "name") to the semicolon.
Then, in your attempt you use a lookback reference, (\1), in the search pattern. There is no reason for that; you want to match non-commas followed by a semicolon, so that's what you must write in the search pattern. There is no repeating of the substring of non-commas found by the first part of the pattern.
Replacing something with null is the default for regexp_replace, so you may - optionally - leave out the last argument - the '' in your attempt.
So, your solution can be rewritten like so:
... regexp_replace(input_string, '[^,]*;')
(I left out the last argument, which was '' in your attempt - that is the default third argument anyway; but you may prefer to show it for clarity. OK either way.)
This will leave a comma at the END of the output string. I asked you a question in the Comments - it is not clear why you are changing from a terminator (the semicolon in the inputs) to a separator (the comma in the output); normally the delimiter should be of the same kind, either terminator in both input and output, or separator in both. (It is also odd that you are changing from semicolon to comma as the primary delimiter, but you must have your reasons.) In any case, that's why MT0 needed to wrap the return string from the regexp replace operation within a call to trim(), to remove the trailing comma.
A note about efficiency:
If you can have commas (perhaps escaped) within the "values" in your input string, the solution will have to be more complicated to handle all the possibilities. If commas are not possible in the "date" portion that you must eliminate from your input, but they are possible elsewhere, then the solution you were trying (which I fixed for you a little earlier in this Answer) will produce the required result; MT0's Answer will not, since it will start at the first comma after a semicolon, regardless of where it is.
However, if there are no commas anywhere except as true delimiters, then MT0's solution will be correct, and much faster than replacing [^,]*;. Regular expressions are (very) slow by nature, and writing them efficiently is exceptionally important. The difference between the solutions seems minor, so let's see what it is.
When you search for '[^,]*;' the regexp engine will try to find a match from the first character. It's not a comma... it reads the second character in, the third, ... and then it finds a comma before it finds a semicolon. So the pattern can't be matched. Then the engine tries to find a match from the second character, which also fails when the first comma is encountered. Etc. This will take a lot of time.
If you search for ',.*?;', the engine starts at the first character in the input string. The first character is not a comma, so there will be no match. The engine can already move on to the second character. It is not a comma either, to match the first character in the regexp pattern, so there will be no match at the second character of the input string either. These conclusions are drawn much faster, so the actual matches are found much faster too. MT0's solution differs from yours by a leading comma - that helps the regexp engine a lot.
Use the regular expression ,.*?; to find each comma and then the minimum amount of characters until the next semi-colon to match the portion of the string you want to replace:
SELECT TRIM(
TRAILING ',' FROM
REGEXP_REPLACE(
'SEPA1,30-NOV-17;SEPA2,30-NOV-17;SEPA3,30-NOV-17;',
',.*?;',
','
)
) AS sepas
FROM DUAL
Output:
SEPAS
-----------------
SEPA1,SEPA2,SEPA3
Related
I want to capture the third comma in strings like:
98,52,"110,18479456000019"
I thought of something like a character except:
[^"0123456789]
But, result was the capture of all commas.
After that, I've tried some regex about nth capture - seems to be a solution -, but none works.
How do I solve this problem?
There are several ways to capture the third ,. This RegEx is one way to do so:
([\d,])\x22\d+(,)\d+\x22
where your desired , is in the second group (,), just to be simple, and you can call it using $2.
I have added additional boundaries to this RegEx for safety, which you can remove it:
\x22 is just ", which you can replace, if you wish:
([\d,])"\d+(,)\d+"
You can also use (\) and escape a char, where necessary.
If your input would be a bit more complex, maybe such as this:
you might create a middle boundary before the third , and add all possible chars in the middle boundary ([\d\w\"]+), such as this RegEx:
(\d+,){2}[\d\w\"]+(,)
and capture the third , using $2. This time you can also relax your expression from the right side, and it would still work.
You might also add a start ^ in the regex:
^(\d+,){2}[\d\w\"]+(,)
as an additional left boundary which means your input must start with this expression.
I need to write a regular expression for form validation that allows spaces within a string, but doesn't allow only white space.
For example - 'Chicago Heights, IL' would be valid, but if a user just hit the space bar any number of times and hit enter the form would not validate. Preceding the validation, I've tried running an if (foo != null) then run the regex, but hitting the space bar still registers characters, so that wasn't working. Here is what I'm using right now which allows the spaces:
^[-a-zA-Z0-9_:,.' ']{1,100}$
It's very simple: .*\S.*
This requires one non-space character, at any place. The regular expression syntax is for Perl 5 compatible regular expressions, if you have another language, the syntax may differ a bit.
The following will answer your question as written, but see my additional note afterward:
^(?!\s*$)[-a-zA-Z0-9_:,.' ']{1,100}$
Explanation: The (?!\s*$) is a negative lookahead. It means: "The following characters cannot match the subpattern \s*$." When you take the subpattern into account, it means: "The following characters can neither be an empty string, nor a string of whitespace all the way to the end. Therefore, there must be at least one non-whitespace character after this point in the string." Once you have that rule out of the way, you're free to allow spaces in your character class.
Extra note: I don't think your ' ' is doing what you intend. It looks like you were trying to represent a space character, but regex interprets ' as a literal apostrophe. Inside a character class, ' ' would mean "match any character that is either ', a space character, or '" (notice that the second ' character is redundant). I suspect what you want is more like this:
^(?!\s*$)[-a-zA-Z0-9_:,.\s]{1,100}$
You could use simple:
^(?=.*\S).+$
if your regex engine supports positive lookaheads. This expression requires at least one non-space character.
See it on rubular.
If we wanted to apply validations only with allowed character set then I tried with USERNAME_REGEX = /^(?:\s*[.\-_]*[a-zA-Z0-9]{1,}[.\-_]*\s*)$/;
A string can contain any number of spaces at the beginning or ending or in between but will contain at least one alphanumeric character.
Optional ., _ , - characters are also allowed but string must have one alphanumeric character.
Try this regular expression:
^[^\s]+(\s.*)?$
It means one or more characters that are not space, then, optionally, a space followed by anything.
Just use \s* to avoid one or more blank spaces in the regular expression between two words.
For example, "Mozilla/ 4.75" and "Mozilla/4.75" both can be matched by the following regular expression:
[A-Z][a-z]*/\s*[0-9]\.[0-9]{1,2}
Adding \s* matches on zero, one or more blank spaces between two words.
I need some help with a Regex. I have a query, that should be splitted between all OR-operators. But if the OR is inside of quotes, it should not splitted.
Example:
This is the query:
"test1" OR "test2.1 OR test2.2" OR test3 OR test4:"test4.1 OR test4.2"
Expression 1: I need everything between the OR-operators or start/end of line... (This is not working)
(^|OR).*?(OR|$)
Expression 2: ...except of the ORs between quotes:
"(.*?)"
The result should be:
"test1"
"test2.1 OR test2.2"
test3
test4:"test4.1 OR test4.2"
How can I make the first expression work and how can I combine these both expressions?
Thank you for help!
It's unclear what the grammar of your expression is, so I just make a bunch of assumptions and come up with this regex to match the tokens between OR:
\G(\w+(?::"[^"]*")?|"[^"]*")(?:(\s+OR\s+)|\s*$)
Demo at regex101
I assume that between OR, it can be an identifier \w+, an identifier with some string \w+:"[^"]*", or a string literal "[^"]*".
Feel free to substitute your own definition of string literal - I'm using the simplest (and broken) specification "[^"]*" as example.
In every match, the regex starts from where the last match left off (or the beginning of the string) and matches one token (as described above), followed by OR or the end of the input string.
The capturing groups at (\s+OR\s+) is deliberate - you will need this to check whether the last match actually terminates at the end of the string or not, or whether the input is malformed.
Caveat
Do note that while my solution produces the expected result for this case, without a full specification of the grammar of the expression, it's not possible to cater for all possible cases you may want to handle.
(?:^|OR(?=(?:[^"]*"[^"]*")*+[^"]*$))([\s\S]*?)(?=OR(?=(?:[^"]*"[^"]*")*+[^"]*$)|$)
You can use this and capture the groups.See demo.
https://regex101.com/r/xC4rJ3/12
Try to match everything in quotes or not-OR with:
(?:"[^"]+"|\b(?:(?!\bOR\b)[^"])+)+
DEMO
This regex works optimally (though it be subject to improvement with a more detailed specification):
(?<!\S)(?!OR\s)[^\s"]*(?:"[^"]*"[^\s"]*)*
DEMO
(?<!\S) ensures the match starts at the beginning of the string or after a whitespace character.
(?!OR\s) prevents it from matching OR
[^\s"]*(?:"[^"]*"[^\s"]*)* matches a contiguous series of, in any order:
sequences of non-whitespace, non-quote characters, or
a pair of quotes enclosing anything except quotes.
However, I notice that all the tokens in your example consist of:
a non-quote, non-whitespace sequence (NQ),
a quoted sequence (Q), or
an NQ followed immediately by a Q.
If you expect all tokens to match that pattern, you can change the regex to this:
(?<!\S)(?!OR\s)(?:[^\s"]*"[^"]*"|[^\s"]+)
According to Regex101, it's slightly more efficient (but probably not enough to matter).
DEMO
shortend URL with my current regex in regexpal:
http://bit.ly/1jbOFGd
I have a line of key=value pairs, space delimited. Some values contain spaces and punctuation so I do a positive lookahead to check for the existence of another key.
I want to tokenize the key and value, which I later convert to a dict in python.
My guess is that I can speed this up by getting rid of .*? but how? In python I convert 10,000 of these lines in 4.3 seconds. I'd like to double or triple that speed by making this regex match more efficient.
Update:
(?<=\s|\A)([^\s=]+)=(.*?)(?=(?:\s[^\s=]+=|$))
I would think this one is more efficient than yours (even though it still uses the .*? for the value, its lookahead is no where near as complex and doesn't use a lazy modifier), but I'll need you to test. This does the same as my original expression, but handles values differently. It uses a lazy .*? match followed by a lookahead that is either a space, followed by a key, followed by a = OR the end of the string. Notice I always define a key as [^\s=]+, so keys cannot contain an equal sign or whitespace (being this specific helps us avoid lazy matches).
Source
Original:
Are there some rules I am missing that you need by doing something this simple?
(?<=\s|\A)([^=]+)=([\S]+)
This starts with a lookbehind of either a space character (\s) or the beginning of the string (\A). Then we match everything except =, followed by a =, and match everything except whitespace (\s).
"Lookbehind" (related to 'lookahead' and 'lookaround') is the key 'regular expression' concept to read up on here - it let's you match and skip individual components of the string.
Good examples here: http://www.rexegg.com/regex-lookarounds.html.
I need to write a regular expression for form validation that allows spaces within a string, but doesn't allow only white space.
For example - 'Chicago Heights, IL' would be valid, but if a user just hit the space bar any number of times and hit enter the form would not validate. Preceding the validation, I've tried running an if (foo != null) then run the regex, but hitting the space bar still registers characters, so that wasn't working. Here is what I'm using right now which allows the spaces:
^[-a-zA-Z0-9_:,.' ']{1,100}$
It's very simple: .*\S.*
This requires one non-space character, at any place. The regular expression syntax is for Perl 5 compatible regular expressions, if you have another language, the syntax may differ a bit.
The following will answer your question as written, but see my additional note afterward:
^(?!\s*$)[-a-zA-Z0-9_:,.' ']{1,100}$
Explanation: The (?!\s*$) is a negative lookahead. It means: "The following characters cannot match the subpattern \s*$." When you take the subpattern into account, it means: "The following characters can neither be an empty string, nor a string of whitespace all the way to the end. Therefore, there must be at least one non-whitespace character after this point in the string." Once you have that rule out of the way, you're free to allow spaces in your character class.
Extra note: I don't think your ' ' is doing what you intend. It looks like you were trying to represent a space character, but regex interprets ' as a literal apostrophe. Inside a character class, ' ' would mean "match any character that is either ', a space character, or '" (notice that the second ' character is redundant). I suspect what you want is more like this:
^(?!\s*$)[-a-zA-Z0-9_:,.\s]{1,100}$
You could use simple:
^(?=.*\S).+$
if your regex engine supports positive lookaheads. This expression requires at least one non-space character.
See it on rubular.
If we wanted to apply validations only with allowed character set then I tried with USERNAME_REGEX = /^(?:\s*[.\-_]*[a-zA-Z0-9]{1,}[.\-_]*\s*)$/;
A string can contain any number of spaces at the beginning or ending or in between but will contain at least one alphanumeric character.
Optional ., _ , - characters are also allowed but string must have one alphanumeric character.
Try this regular expression:
^[^\s]+(\s.*)?$
It means one or more characters that are not space, then, optionally, a space followed by anything.
Just use \s* to avoid one or more blank spaces in the regular expression between two words.
For example, "Mozilla/ 4.75" and "Mozilla/4.75" both can be matched by the following regular expression:
[A-Z][a-z]*/\s*[0-9]\.[0-9]{1,2}
Adding \s* matches on zero, one or more blank spaces between two words.