Not sure whats wrong with my regex expressions or why its chopping off the first character. The regex correctly IDs what i want to split on, but why is the first character missing in each element of the array?
>>> f = "value: http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:user-services-http/ssoeproxy/logout value: http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com:user-services-http-two/ssoeproxy/logout value: user-services-http #458930 value: user-services-http-two #458930"
>>> re.split(r'[a-z0-9]([-a-z0-9]*[a-z0-9])?', f)
>>> ['', 'alue', ': ', 'ttp', '://', 'c2-xxx-xxx-xxx-xxx', '.', 'ompute-1', '.', 'mazonaws', '.', 'om', ':', 'ser-services-http', '/', 'soeproxy', '/', 'ogout', ' ', 'alue', ': ', 'ttp', '://', 'c2-xxx-xxx-xxx-xxx', '.', 'ompute-1', '.', 'mazonaws', '.', 'om', ':', 'ser-services-http-two', '/', 'soeproxy', '/', 'ogout', ' ', 'alue', ': ', 'ser-services-http', ' #', '58930', ' ', 'alue', ': ', 'ser-services-http-two', ' #', '58930', '']
A more detailed explanation of your problem here is that split() will split on whatever group you're capturing if you only specify one capture group. It won't split on your whole regular expression. In this case you're capturing everything but the first letter. [a-z0-9] is outside your parentheses. Move your parentheses to include this part and you're good to go.
Related
This is the list I am getting:
['', '', ' NRGD\n ', '\n MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN\n ', ' $102.24\n ', ' 5012.00%\n \n2070.00', '\n ']
I want to "clean it up" and return:
['NRGD', 'MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN', '$102.24', '5012.00%', '2070.00']
I want to basically remove all the items that are just spaces or \n as for the ones with actual text I want to remove the spaces and \n and just have the item with text.
We can use a list comprehension here:
inp = ['', '', ' NRGD\n ', '\n MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN\n ', ' $102.24\n ', ' 5012.00%\n \n2070.00', '\n ']
output = [x.strip() for x in inp if x.strip()]
print(output)
This prints:
['NRGD', 'MicroSectors U.S. Big Oil Index -3X Inverse Leveraged ETN',
'$102.24', '5012.00%\n \n2070.00']
The above logic says to retain any list element which, after stripping leading and trailing whitespace, is not empty string. It then retains such elements with whitespace trimmed.
I am trying to implement a tokenizer to split string of words.
The special conditions I have are: split punctuation . , ! ? into a separate string
and split any characters that have a space in them i.e. I have a dog!'-4# -> 'I', 'have', 'a' , 'dog', !, "'-4#"
Something like this.....
I don't plan on trying the nltk's package, and I have looked at re.split and re.findall, yet for both cases:
re.split = I don't know how to split out words with punctuation next to them such as 'Dog,'
re.findall = Sure it prints out all the matched string, but what about the unmatched ones?
IF you guys have any suggestions, I'd be very happy to try them.
Are you trying to split on a delimiter(punctuation) while keeping it in the final results? One way of doing that would be this:
import re
import string
sent = "I have a dog!'-4#"
punc_Str = str(string.punctuation)
print(re.split(r"([.,;:!^ ])", sent))
This is the result I get.
['I', ' ', 'have', ' ', 'a', ' ', 'dog', '!', "'-4#"]
Try:
re.findall(r'[a-z]+|[.!?]|(?:(?![.!?])\S)+', txt, re.I)
Alternatives in the regex:
[a-z]+ - a non-empty sequence of letters (ignore case),
[.!?] - any (single) char from your list (note that between brackets
neither a dot nor a '?' need to be quoted),
(?:(?![.!?])\S)+ - a non-empty sequence of non-white characters,
other than in your list.
E.g. for text containing I have a dog!'-4#?. the result is:
['I', 'have', 'a', 'dog', '!', "'-4#", '?', '.']
Raw query:
select firstfield, secondfield, phone_number, thirdfield
from table
having CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value'
and CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value2'
and CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value3'
and CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value4'
Querybuilder
$qb->select(
'firstfield',
'secondfield',
'thirdfield',
'fourthfield',
)->from(Table, 'u');
$queryHaving = "CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value'";
$qb->andhaving($queryHaving);
$queryHaving = "CONCAT(firstfield, ' ', secondfield, ' ', thirdfield, ' ', fourthfield) regexp 'value2'";
$qb->andhaving($queryHaving);
Problem:
How to collect concat with regexp not as function? Tried using literal() function but it is not possible to create due error throws on not possible to assign into.
The query seems to work for me for MySQL with any of these 2 forms:
select *
from test
having concat(field1, field2) regexp '^[FB].*' and
concat(field1, field2) regexp 'o$';
select *
from test
where concat(field1, field2) regexp '^[FB].*' and
concat(field1, field2) regexp 'o$';
See demo here
I'm just thinkging about the problem could be with CHAR columns
So, for example, one column would have FOO<space><space> on a CHAR(5) instead of FOO at VARCHAR(5). So when concatenating you would have something similar to FOO<space><space>BAR<space><space> and thus the regex would fail.
However, with SQLFiddle it doesn't seem to be the case. It does not seem to add spaces. See here.
Anyways, it may be worth trying on your app: Are you using chars or varchars? Could you try adding trims at the columns, like this:
select *,concat(trim(field1), trim(field2))
from test
having concat(trim(field1), trim(field2)) regexp '^[FB].*' and
concat(trim(field1), trim(field2)) regexp 'o$';
select *,concat(trim(field1), trim(field2))
from test
where concat(trim(field1), trim(field2)) regexp '^[FB].*' and
concat(trim(field1), trim(field2)) regexp 'o$';
Demo here.
I use the following regex to split sentences into words:
"('?\w[\w']*(?:-\w+)*'?)"
For example:
import re
re.split("('?\w[\w']*(?:-\w+)*'?)","'cos I like ice-cream")
gives:
['', "'cos", ' ', 'I', ' ', 'like', ' ', 'ice-cream', '!']
However, formatting tags sometimes appear in my text and my regex obviously can't process them as I would like:
re.split("('?\w[\w']*(?:-\w+)*'?)","'cos I <i>like</i> ice-cream!")
gives:
['', "'cos", ' ', 'I', ' <', 'i', '>', 'like', '</', 'i', '> ', 'ice-cream', '!']
while I would like:
['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
How would you go about solving this?
You could use a word boundary regex, specifying exclusions of matches using negative lookbehind and lookahead assertions:
^|(?<!['<\/-])\b(?![>-])
Regex demo.
Unfortunately, the python regex engine doesn't support splitting on zero-width characters, so you have to use a workaround.
import re
a = re.sub(r"^|(?<!['<\/-])\b(?![>-])", "|", "'cos I <i>like</i> ice-cream!").split('|');
print(a)
# ['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
Python demo.
# I added a negative lookahead to your pattern to assert bracket > is closed properly
import re
print re.split("('?\w[\w']*(?:-\w+)*'?(?!>))","'cos I <i>like</i> ice-cream!" )
[Output]
['', "'cos", ' ', 'I', ' <i>', 'like', '</i> ', 'ice-cream', '!']
I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...
We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.
This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(