Why does this regular expression not capture arithmetic operators? - regex

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...

We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.

This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(

Related

Splitting/Tokenizing a sentence into string words with special conditions

I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']

Why are these regular expressions not working?

I want a regex to match complex mathematical expressions.
However I will ask for an easier regex because it will be the simplest case.
Example input:
1+2+3+4
I want to separate each char:
[('1', '+', '2', '+', '3', '+', '4')]
With a restriction: there has to be at least one operation (i.e. 1+2).
My regex: ([0-9]+)([+])([0-9]+)(([+])([0-9]+))*
or (\d+)(\+)(\d+)((\+)(\d+))*
Output for re.findall('(\d+)(\+)(\d+)((\+)(\d+))*',"1+2+3+4")
:
[('1', '+', '2', '+4', '+', '4')]
Why is this not working? Is Python the problem?
You could go the test route.
See if its valid using re.match
then just get the results with re.findall
Python code
import re
input = "1+2+3+4";
if re.match(r"^\d+\+\d+(?:\+\d+)*$", input) :
print ("Matched")
print (re.findall(r"\+|\d+", input))
else :
print ("Not valid")
Output
Matched
['1', '+', '2', '+', '3', '+', '4']
You need ([0-9]+)([+])([0-9]+)(?:([+])([0-9]+))*
you get the '+4' for the group is out the last two expressions (([+])([0-9]+)).
the ?: indicate to python dont get de string for this group in the output.

Keeping special marks when splitting text into tokens using regex

I have this text 'I love this but I have a! question to?' and currently using
token_pattern = re.compile(r"(?u)\b\w+\b")
token_pattern.findall(text)
When using this regex I'm getting
['I','love', 'this', 'but', 'I', 'have', 'a', 'question', 'to']
I'm not the one who wrote this regex and I know nothing about regex (tried to understand from example but just gave up trying) and now I need to change this regex in a way that it will keep the question and exclamation marks and will split them to unique tokens also, so it'll return this list
['I','love', 'this', 'but', 'I', 'have', 'a', '!', 'question', 'to', '?']
Any suggestions on how I can do that.
Try this:
token_pattern = re.compile(r"(?u)[^\w ]|\b\w+\b")
token_pattern.findall(text)
It matches all non alphanumeric characters as a single match, too.
If you really only need question and exclamation marks you can change the regex to
token_pattern = re.compile(r"(?u)[!?]|\b\w+\b")
token_pattern.findall(text)

Regular Expression in Perl

I need to extract the 4th field value (128) from the following line using regular expression.
( '29/11/2010 09:38:05', '41297', '29/11/2010 09:40:30', '128', '17', 'SUCCESS', '30', 'e', '9843171457', '1', '-1')
Please tell me the way to take the 4th value.
Thanks in advance.
Use Text::CSV from CPAN:
my $input = "( '29/11/2010 09:38:05', '41297', '29/11/2010 09:40:30', '128', '17', 'SUCCESS', '30', 'e', '9843171457', '1', '-1')";
my $csv = Text::CSV->new({
quote_char => "'",
always_quote => 1,
allow_whitespace => 1,
});
$csv->parse($input);
my #columns = $csv->fields();
print $columns[3], "\n"; # 128
The brute force way:
/'[^']*',\s*'[^']*',\s*'[^']*',\s*'([^']*)'/
This is a quote, followed by any number of non-quotes, then another quote, a comma, and some optional whitespace. All that is repeated four times with () around the fourth value to capture it. This may not work if the values are allowed to have quotes in them.
As Cameron pointed out, you can avoid the repetition using:
/(?:'[^']*',\s*){3}'([^']*)'/
The ?: tells the regexp parser not to capture the stuff inside the brackets.
Might be easier to split the list up using split with the comma as the delimiter, and then take the fourth element. Of course, if you can have commas inside the values, that may not work.
It's just perl's "split" command
$str = ('29/11/2010 09:38:05','41297','29/11/2010 09:40:30','128','17','SUCCESS','30','e', '9843171457','1','-1');
#vars = split(/','/,$str);
print "${vars[3]}\n";

Regex pattern to match where my code breaks

I have the following values that I want to place into a mysql db
The pattern should look as follows, I need a regex to make sure that the pattern is always as follows:
('', '', '', '', '')
In some rare execution of my code, I hower get the following output where one of the apostrophes disapear. it dissapears every now and then on the 4th record. like in the code below where I placed the *
('1', '2576', '1', '*, 'y')
anyideas to solve this will be welcomed!
This should be able to match one of the times the code breaks
string.replace(/, \',/ig, ', \'\',');
how would I do it if it is like this
('1', '2576', '1', 'where I have text here and it breaks at the end*, 'y')
I am using javascript and asp
I think the solution would be something like this
string.replace(/, \'[a-zA-Z0-9],/ig, ', \'\','); but not exactly sure how to write it
This is almost the solution that I am looking for...
string.replace(/[a-zA-Z0-9], \'/ig, '\', \'');
this code however replaces the last letter of the text with the ', ' so if the text inside the string is 'approved, ' it will replace the 'approve', ' and cut off the letter d
I know there is a way that you can reference it not to remove the last letter but not sure how to do it
Is this what you're looking for? It matches when all but the last field is missing the '
\('.*?'\)
Your regular expression, would be something like this:
^\('.*?',\ '.*?',\ '.*?',\ '.*?',\ '.*?'\)$
you could check if your string matchs in ASP.net with some code similar to this:
Match m = Regex.Match(inputString, #"^\('.*?',\ '.*?',\ '.*?',\ '.*?',\ '.*?'\)$");
if (!m.Success)
{
//some fix logic here
}