RegEx Multiple Delimiters - regex

I have a text file with this format:
('1', '2', '3', '4', '5');
('a', 'b', 'c', 'd', 'e');
etc...
I want from each line the third and the fourth entry in the ''
My Text file has 125k lines so it is something big.
Thank you

^.*?,.*?,(.*?),(.*?),.*
will get you the third and fourth fields in \1 and \2 (assuming no commas will appear between quotes, that you wish not to be treated as delimiters, or anything like that).
When run on your example, replacing with \1,\2, the end result is:
'3', '4'
'c', 'd'

Related

Split mathematical expression string with Regex in typescript

I have a mathematical expression (formula) in string format. What I'm trying to do is to split that string and make an array of all the operators and words collectively an array. I'm doing this by passing regex to split() function (As I'm new with regex I tried to create the regex to get my desired result). With this expression I'm getting an array seperated by operators, digits and words. But, somehow I'm getting an extra blank element in the array after each element. Have a look below to get what I'm exactly talking about.
My mathematical expression (formula):
1+2-(0.67*2)/2%2=O_AnnualSalary
Regex that I'm using to split it into an array:
this.createdFormula.split(/([?=+-*/%,()])/)
What I'm expecting an array should I get:
['1', '+', '2', '-', '(', '0', '.', '6', '7', '*', '2', ')', '/', '2', '%', '2', '=', 'O_AnnualSalary']
This what I'm getting:
['', '1', '', '+', '', '2', '', '-', '', '(', '', '0', '', '.', '', '6', '', '7', '', '*', '', '2', '', ')', '', '/', '', '2', '', '%', '', '2', '', '=', 'O_AnnualSalary']
So far what I've tried this expressions from many posts on SO:
this.createdFormula.split(/([?=+-\\*\\/%,()])/)
this.createdFormula.split(/([?=\\\W++-\\*\\/%,()])/)
this.createdFormula.split(/([?=//\s++-\\*\\/%,()])/)
this.createdFormula.split(/([?=+-\\*\\/%,()])(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)/)
this.createdFormula.split(/([?=+-\\*\\/%0-9,()])/)
Can anyone help me to fix this expression to get the desired result? If you need any more information please feel free to ask.
Any help is really appreciated.
Thanks
Assuming you have string match() available, we can use:
var input = "1+2-(0.67*2)/2%2=O_AnnualSalary,HwTotalDays";
var parts = input.match(/(?:[0-9.,%()=/*+-]|\w+)/g);
console.log(parts);

Regexp to possessively match zero or one characters at end of string [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Regex with prefix and optional suffix
(2 answers)
Closed 2 years ago.
(Note: I found a reasonable solution using String.split() instead of Regexp.match(), but I'm still interested in the theoretical regexp question.)
Given a string that may or may not end with the letter a, and may have any number of letters a in other positions, is there a regexp that lets me capture the trailing a if present as one group, and all previous characters as another? E.g.:
Input
Group 1
Group 2
'a'
''*
'a'
'b'
'b'
''
'ba'
'b'
'a'
'baaa'
'baa'
'a'
'baaab'
'baaab'
''
* nil instead of the empty string would also be acceptable
Some things I've tried that haven't worked:
The naive approach: /^(.*)(a?)$/
The same, but with a numeric repetition limit: /^(.*)(a{0,1})$/
The same, but with an atomic group: /^(.*)((?>a?))$/
The same, but with negative lookahead in the first group: /^(.*(?!=a))(a?)$/
All of these fail to capture the trailing a if present:
input
expected
actual
'a'
'', 'a'
'a', ''
'ba'
'b', 'a'
'ba', ''
'baaa'
'baa', 'a'
'baaa', ''
The closest I've been able to come is to use | to split between the cases with and without a trailing a. This comes close, but at the expense of producing twice as many capture groups, such that I'll need to do some additional checking to decide whether to use the left or right pair of groups:
/^(?:(.*)(a)$|(.*[^a])()$)/
input
expected
actual
'a'
'', 'a'
'', 'a', nil, nil
'b'
'b', ''
nil, nil, 'b', ''
'ba'
'b', 'a'
'b', 'a', nil, nil
'baaa'
'baa', 'a'
'baa', 'a', nil, nil
'baaab'
'baaab', ''
nil, nil, 'baaab', ''
The solution I've found is to throw out Regexp.match entirely and just use String.split. This comes close enough for my purposes:
input.split(/(a?)$/)
input
expected
actual
'a'
'', 'a'
'', 'a'
'b'
'b', ''
'b' (close enough)
'ba'
'b', 'a'
'b', 'a'
'baaa'
'baa', 'a'
'baa', 'a'
'baaab'
'baaab', ''
'baaab' (close enough)
This works, but I'd still like to know if there's a way to do it as a straight regexp match.

python re.split() empty string

The following example is taken from the python re documents
re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
'\b' matches the empty string at the beginning or end of a word. Which means if you run this code it produces an error.
(jupyter notebook python 3.6)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-128-f4d2d57a2022> in <module>
1 reg = re.compile(r"\b")
----> 2 re.split(reg, "Words, word, word.")
/usr/lib/python3.6/re.py in split(pattern, string, maxsplit, flags)
210 and the remainder of the string is returned as the final element
211 of the list."""
--> 212 return _compile(pattern, flags).split(string, maxsplit)
213
214 def findall(pattern, string, flags=0):
ValueError: split() requires a non-empty pattern match.
Since \b only matches empty strings, split() does not get its requirement "non-empty" pattern match. I have seen varying questions related to split() and empty strings. Some I could see how you may want to do it in practice, example, the question here. Answers vary from "just can't do it" to (older ones) "it's a bug".
My question is this:
Since this is still an example on the python web page, should this be possible? is it something that is possible in the bleeding edge release?
The question in the in the link above involved
re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar'), it was asked in 2015 and there was no way to accomplish the requirements with just re.split(), is this still the case?
In Python 3.7 re, you can split with zero-length matches:
Changed in version 3.7: Added support of splitting on a pattern that could match an empty string.
Also, note that
Empty matches for the pattern split the string only when not adjacent to a previous empty match.
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...')
['', '', 'w', 'o', 'r', 'd', 's', '', '']
>>> re.split(r'(\W*)', '...words...')
['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Also, with
re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
I get ['foobar', 'barbaz', 'bar'] result in Python 3.7.

Why are these regular expressions not working?

I want a regex to match complex mathematical expressions.
However I will ask for an easier regex because it will be the simplest case.
Example input:
1+2+3+4
I want to separate each char:
[('1', '+', '2', '+', '3', '+', '4')]
With a restriction: there has to be at least one operation (i.e. 1+2).
My regex: ([0-9]+)([+])([0-9]+)(([+])([0-9]+))*
or (\d+)(\+)(\d+)((\+)(\d+))*
Output for re.findall('(\d+)(\+)(\d+)((\+)(\d+))*',"1+2+3+4")
:
[('1', '+', '2', '+4', '+', '4')]
Why is this not working? Is Python the problem?
You could go the test route.
See if its valid using re.match
then just get the results with re.findall
Python code
import re
input = "1+2+3+4";
if re.match(r"^\d+\+\d+(?:\+\d+)*$", input) :
print ("Matched")
print (re.findall(r"\+|\d+", input))
else :
print ("Not valid")
Output
Matched
['1', '+', '2', '+', '3', '+', '4']
You need ([0-9]+)([+])([0-9]+)(?:([+])([0-9]+))*
you get the '+4' for the group is out the last two expressions (([+])([0-9]+)).
the ?: indicate to python dont get de string for this group in the output.

Why does this regular expression not capture arithmetic operators?

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...
We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.
This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(