Find the match extract next n chars but exclude a match itself - regex

I am not a regex savvy so my question may seem simple. How do you extract hours and minutes from a string like this:
2013-12-03T10:45:33-07:00
So I just want to get 10:45 from the above string and ignore the rest.
I tried /[0-2][0-9]:[0-5][0-9]/ but that gives me: 10:45 as well as 07:00
Also tried /[T][0-2][0-9]:[0-5][0-9]/ , but this gives me T10:45
I tried excluding 'T' by using a ^ anchor [^T][ ][ ]:[ ][ ] but this gave me -07:00 !
I thought about searching for the first occurrence of ':' but I don't know how to extract 2 digits before and after ':' and include the ':' itself.
Any help with a comment would be greatly appreciated.

You can use a positive lookbehind for this:
/(?<=T)\d{2}:\d{2}/
What this essentially means it that we're matching two digits followed by a colon followed by 2 digits, but they MUST have a "T" in front. Do not, however, add this to the match as lookaheads/behinds are not matched.
DEMO
[^T] means "any character that isn't T", which is why it didn't work.
JS regex does not support lookaheads/behinds (see?), but you can simply create a matching group using /T(\d{2}:\d{2})/ and then match [1]:
var timeString = '2013-12-03T10:45:33-07:00';
var time = timeString.match(/T(\d{2}:\d{2})/)[1];
console.log(time); //10:45

A simple way to extract without heavy regex knowledge would be to do something like
foo = "2013-12-03T10:45:33-07:00"
(hours,minutes,junk) = foo.split ":"
hours =~ s/*(\d\d)$/$1/
so now you have
hours and minutes available for use

Related

Find number of Instances for few words in string while ignoring other few words using regex

Hi i am using regex in Matlab.
I need to find number of hits for few words while ignoring other few words using regex
what i have tried so far:
String = 'Sunday:Monday:Tuesday:Wednesday:Thursday:Friday:Saturday:Sun:Mon:Tue:Wed:,Thu:,Fri:,Sat:';
Output = regexp( String,'^(?!.*(,Sun:|,Sunday:)).*(Sun:|Sunday:)' )
The Output of above regexp comes as true, But need it as 2 as it got hit 2 times for Sun: and Sunday:.
In next Scenario:
String = 'Sunday:Monday:Tuesday:Wednesday:Thursday:Friday:Saturday:Sun:Mon:Tue:Wed:,Thu:,Fri:,Sat:';
Output = regexp( String,'^(?!.*(,Fri:|,Friday:)).*(Fri:|Friday:)' )
The Output of above regexp comes as false, But need it as 1 as it*** got hit 1 time*** for Friday:.
I also tried:
regexp( String,'^(?!.*(,Sun:|,Sunday:)).*(Sun:|Sunday:)' ,'match')
But its giving Output as whole string.
I am confused how to get number of hits while ignoring other words, Help would be appreciated regexp work in Matlab same as normal.
You can use
(?<!,)Fri(?:day)?:
It matches
(?<!,) - a location not immediately preceded with ,
Fri - Fri
(?:day)? - an optional day string
: - a colon.
See the regex demo.
If you allow some redundancy, you may build the pattern like this:
(?<!,)(Fri:|Sunday:)
It will match Fri: or Sunday: not immediately preceded with a comma.
Unless you really need to use regexp, something like this will be easier to maintain:
Output = sum(ismember(strsplit(String,':'),{'Sunday','Sun'}))

Complex regex get closest

I would like some help to finish my complex regex.
I spent some times on it and still can't figure out how I can achieve what I want
This is the text I want to parse :
Do [|83]([]?([]?([]?([]?([]?([]?([]?:))))):)([]?([]?:([]?:)):)([]?[]? :):)([]?([]?[]:):)([]?([]?[]:):)
Bo [|18] pz ([]?:)\n la :\n[pl]
Co [|76] pp ([]?:)
For readability, I put every text in one line only but please consider that they are not on a new line.
This is my regex so far :
(\[\|(\d*)])+(?!\\\n).*([%\sa-zA-Z]*)(\((\[[^\[\]()?:]*])+\s*\?([^()]*):([^()]*)\))
I'm reading every combinations of [|NUMBER] () one by one. The process I apply on "()" depends of the NUMBER related
When I'm parsing the first time, I'm getting this which is fine :
Then, I replace the whole value after my process :
Now, I do have :
Do [|83] blabla done Bo [|18] pz ([]?:)\n la :\n[pl] Co [|76] pp ([]?:)
When I parse them once more, I got :
The number I got is not the good one. My question is : how can I get the closest one from the string I'm parsing after?
Thanks you for any tips
You might shorten the pattern a bit and exclude matching both the square brackets and the parenthesis in the character class after matching the digit and ]
\[\|\d+][^][()]*\([^()]*\)
The pattern matches:
\[\|\d+] Match [| 1+ digits and ]
[^][()]* Match 0+ times any char other than [ ] ( )
\([^()]*\) Match (, than 0+ times any char other than ( ) then match )
Regex demo

Regex: find all IN clause with number of arguments greater or equal to

As an input I've got a plain SQL query smth like:
select * from (
select * from Table where id in (1,2,3,4,5,6,642,7,8,9)
or another_id in (1,2,3,4,5,6, 34 ,7 , 8,9))
where yet_another_id in (1,2)
I want to find all IN clause statements where the amount of arguments passed in is greater than XXX.
So far I've came up with this solution.
^.*\s*+(?:in)+\s*+(\((?:\s*+\d+\s*+\,?+){XXX,}+\){1}).*$
where XXX is the number of arguments.
Obviously, the first part:
^.*
eats all IN clause statements except the last one. How can I fix that? Any suggestions how can I improve the regex?
Try this here
\bin\b\s*(?:\((?:\s*\d+\s*\,?){5,}\))
So I removed some stuff from your expression and fixed an obvious error (\(?: where you escaped the wrong bracket.
The \b is a word boundary.
This is working now for me here on Regexr
You seem to be massively over complicating this with random + characters all over the place: \s*+ means 0 or more spaces repeated one or more times. \s* is sufficient. Then (?:in)+ means you want to match in or ininininininininin which doesn't seem right. Again the \,?+ means an optional comma repeated one or more times.
The real problem however is that after the literal \( you have ?: which isn't following open parentheses so that means \(?: is matching an optional ( followed by a non-optional :. You don't have any colons in the input so no possible matches.
Try something like this:
>>> import re
>>> text = '''select * from (
select * from Table where id in (1,2,3,4,5,6,642,7,8,9)
or another_id in (1,2,3,4,5,6, 34 ,7 , 8,9))
where yet_another_id in (1,2)'''
>>> re.findall("(?:in)\s*(\((?:[^),]+\,?){10,}\))", text)
['(1,2,3,4,5,6,642,7,8,9)', '(1,2,3,4,5,6, 34 ,7 , 8,9)']
You may or may not need the extra ^.*? and .*$ around the regex depending on how you are using this.

Regular expression help - comma delimited string

I don't write many regular expressions so I'm going to need some help on the one.
I need a regular expression that can validate that a string is an alphanumeric comma delimited string.
Examples:
123, 4A67, GGG, 767 would be valid.
12333, 78787&*, GH778 would be invalid
fghkjhfdg8797< would be invalid
This is what I have so far, but isn't quite right: ^(?=.*[a-zA-Z0-9][,]).*$
Any suggestions?
Sounds like you need an expression like this:
^[0-9a-zA-Z]+(,[0-9a-zA-Z]+)*$
Posix allows for the more self-descriptive version:
^[[:alnum:]]+(,[[:alnum:]]+)*$
^[[:alnum:]]+([[:space:]]*,[[:space:]]*[[:alnum:]]+)*$ // allow whitespace
If you're willing to admit underscores, too, search for entire words (\w+):
^\w+(,\w+)*$
^\w+(\s*,\s*\w+)*$ // allow whitespaces around the comma
Try this pattern: ^([a-zA-Z0-9]+,?\s*)+$
I tested it with your cases, as well as just a single number "123". I don't know if you will always have a comma or not.
The [a-zA-Z0-9]+ means match 1 or more of these symbols
The ,? means match 0 or 1 commas (basically, the comma is optional)
The \s* handles 1 or more spaces after the comma
and finally the outer + says match 1 or more of the pattern.
This will also match
123 123 abc (no commas) which might be a problem
This will also match 123, (ends with a comma) which might be a problem.
Try the following expression:
/^([a-z0-9\s]+,)*([a-z0-9\s]+){1}$/i
This will work for:
test
test, test
test123,Test 123,test
I would strongly suggest trimming the whitespaces at the beginning and end of each item in the comma-separated list.
You seem to be lacking repetition. How about:
^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$
I'm not sure how you'd express that in VB.Net, but in Python:
>>> import re
>>> x [ "123, $a67, GGG, 767", "12333, 78787&*, GH778" ]
>>> r = '^(?:[a-zA-Z0-9 ]+,)*[a-zA-Z0-9 ]+$'
>>> for s in x:
... print re.match( r, s )
...
<_sre.SRE_Match object at 0xb75c8218>
None
>>>>
You can use shortcuts instead of listing the [a-zA-Z0-9 ] part, but this is probably easier to understand.
Analyzing the highlights:
[a-zA-Z0-9 ]+ : capture one or more (but not zero) of the listed ranges, and space.
(?:[...]+,)* : In non-capturing parenthesis, match one or more of the characters, plus a comma at the end. Match such sequences zero or more times. Capturing zero times allows for no comma.
[...]+ : capture at least one of these. This does not include a comma. This is to ensure that it does not accept a trailing comma. If a trailing comma is acceptable, then the expression is easier: ^[a-zA-Z0-9 ,]+
Yes, when you want to catch comma separated things where a comma at the end is not legal, and the things match to $LONGSTUFF, you have to repeat $LONGSTUFF:
$LONGSTUFF(,$LONGSTUFF)*
If $LONGSTUFF is really long and contains comma repeated items itself etc., it might be a good idea to not build the regexp by hand and instead rely on a computer for doing that for you, even if it's just through string concatenation. For example, I just wanted to build a regular expression to validate the CPUID parameter of a XEN configuration file, of the ['1:a=b,c=d','2:e=f,g=h'] type. I... believe this mostly fits the bill: (whitespace notwithstanding!)
xend_fudge_item_re = r"""
e[a-d]x= #register of the call return value to fudge
(
0x[0-9A-F]+ | #either hardcode the reply
[10xks]{32} #or edit the bitfield directly
)
"""
xend_string_item_re = r"""
(0x)?[0-9A-F]+: #leafnum (the contents of EAX before the call)
%s #one fudge
(,%s)* #repeated multiple times
""" % (xend_fudge_item_re, xend_fudge_item_re)
xend_syntax = re.compile(r"""
\[ #a list of
'%s' #string elements
(,'%s')* #repeated multiple times
\]
$ #and nothing else
""" % (xend_string_item_re, xend_string_item_re), re.VERBOSE | re.MULTILINE)
Try ^(?!,)((, *)?([a-zA-Z0-9])\b)*$
Step by step description:
Don't match a beginning comma (good for the upcoming "loop").
Match optional comma and spaces.
Match characters you like.
The match of a word boundary make sure that a comma is necessary if more arguments are stacked in string.
Please use - ^((([a-zA-Z0-9\s]){1,45},)+([a-zA-Z0-9\s]){1,45})$
Here, I have set max word size to 45, as longest word in english is 45 characters, can be changed as per requirement

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"