Regular expression, not outputting values correctly - regex

num = re.findall (r'[-+]?\d*\.*\d+' , str (table))
Hi all I have this regular expression and it is printing the values i want. However, they are separated.
For example:
['7', '336.82', '-3.89', '-0.05', '7', '351.60', '7', '322.86', '7', '340.71']
is what it prints
But i want it to print:
['7,336.82', '-3.89', '-0.05', '7,351.60', '7,322.86', '7,340.71']
Please could someone help?
Thanks in advance.

Looks like you want to capture numbers that are separated by comma. You can use:
r'[-+]?(?:\d+[\d,]*)?\.?\d+'
RegEx Demo

If validating numbers by mandatory 3 digits after comma is necessary:
[-+]?\d{1,3}(\,\d{3})*(\.\d+)?
if input is 1,000,00.0 it means: 1,000 and 00.0 in this answer.
Demo: https://regex101.com/r/8nYbaQ/2
If 01,123 should be reject: (because of starting 0 digit)
(\+?[1-9]|\-\d)\d{0,2}(\,\d{3})*(\.\d+)?
Demo: https://regex101.com/r/8nYbaQ/3

Related

How to simplify postgres regexp_replace

Is there a way to simplify this query using only one regexp_replace?
select regexp_replace(regexp_replace('BL 081', '([^0-9.])', '', 'g'), '(^0+)', '', 'g')
the result should be 81
I'm trying to remove all non-numeric chars and leading 0's from the result
You can do this by capturing the digits you want (not including any leading zeros) and removing everything else:
select regexp_replace('BL 0081', '.*?([1-9][0-9]*)$', '\1')
Output
81
Note you don't need the g flag as you are only making one replacement.
Demo on dbfiddle
Why not just change the range from 0-9 to 1-9?
regexp_replace('BL 081', '(^[^1-9]+)', '', 'g')
This pattern should do: \D+|(?<=\s)0+
\D - matches characters that are not digits
(?<=\s) - looks behind for spaces and matches leading zeros
You can use 1 fewer regexp_replace:
select regexp_replace('BL 081', '\D+|(?<=\s)0+', '', 'g')
# outputs 81
alternatively, if you are interested in the numeric value, you could use a simpler regex and then cast to an integer.
select regexp_replace('BL 081', '\D+', '')::int
# also outputs 81, but its type is int

Regex expression to separate collapsed title

First time post. I have a text where lots of text in title case is collapsed without spaces. I'm trying to:
a) keep the full text (not loose any words),
b) use logic to separate 'A' as in 'A Way Forward',
c) avoid separating acronyms such as EPA, DOJ, ect (which are already in full caps).
My regex code comes pretty close, but it's leaving 'A' at the beginning or end of words:
f = "TheCuriousIncidentOfAManInAWhiteHouseAt1600PennsylvaniaAveAndTheEPA"
re.sub( r"([A-Z][a-z]|[A-Z][A-Z]|\d+)", r" \1", f).split()
output:
['The', 'Curious', 'Incident', 'Of', 'AMan','In', 'AWhite','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
The problem is output like 'AMan', 'AWhite', ect.
It should be:
['The', 'Curious', 'Incident', 'Of', 'A', Man','In', 'A', White','House', 'At', '1600', 'Pennsylvania', 'Ave', 'And', 'The', 'EPA']
Thank you
Welcome to Stack Overflow Greg. Good start on your regex.
I'd try something like this:
([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)
Broken down, for explanation:
([A-Z]{2,}(?![a-z]) // 2 or more capital letters, not followed by a lowercase letter
| // OR
[a-zA-Z][a-z]* // Any letter, followed by any number of lowercase letters
| // OR
[0-9]+) // One or more digits
Best used like this:
re.findall(r'([A-Z]{2,}(?![a-z])|[a-zA-Z][a-z]*|[0-9]+)', s)
Try it online (contains \W* for formatting)

Why does this regular expression not capture arithmetic operators?

I'm trying to capture tokens from a pseudo-programming-language script, but the +-*/, etc are not captured.
I tried this:
[a-z_]\w*|"([^"\r\n]+|"")*"|\d*\.?\d*|\+|\*|\/|\(|\)|&|-|=|,|!
For example i have this code:
for i = 1 to 10
test_123 = 3.55 + i- -10 * .5
next
msg "this is a ""string"" with quotes in it..."
in this part of code the regular expression has to highlight:
valid variablenames,
strings enclosed with quotes,
operators like (),+-*/!
numbers like 0.1 123 .5 10.
the result of the regular expression has to be:
'for',
'i',
'=',
'1',
'to',
'10',
'test_123',
'=',
'3.55',
'+'
etc....
the problem is that the operators are not selected if i use this regular expression...
We don't know your requirements, but it seems that in your regex you are capturing only a few non \n, \r etc...
try something like this, grouping the tokens you want to capture:
'([a-z_]+)|([\.\d]+)|([\+\-\*\/])|(\=)|([\(\)\[\]\{\}])|(['":,;])'
EDIT: With the new information you wrote in your question, I adjusted the regex to this new one, and tried it with python. I don't know vbscript.
import re
test_string = r'''for i = 1 to 10:
test_123 = 3.55 + i- -10 * .5
next
msg "this is a 'string' with quotes in it..."'''
patterb = r'''([\da-z_^\.]+|[\.\d]+|[\+\-\*\/]|\=|[\(\)\[\]\{\}]|[:,;]|".*[^"]"|'.*[^']')'''
print(re.findall(pattern, test_string, re.MULTILINE))
And this is the list with the matches:
['for', 'i', '=', '1', 'to', '10', ':', 'test_123', '=', '3.55', '+', 'i', '-', '-', '10', '*', '.5', 'next', 'msg', '"this is a \'string\' with quotes in it..."']
I think it captures all you need.
This fits my needs i guess:
"([^"]+|"")*"|[\-+*/&|!()=,]|[a-z_]\w*|(\d*\.)?\d*
but only white space must be left over so i have to find a way to capture everything else that is not white space to if its not any of the other options in my regular expression.
characters like "$%µ°" are ignored even when i put "|." after my regular expression :(

How to strip a sentence just keep letters, numbers, spaces

I have a lot of sentences that need to be cleaned up from all the special characters and punctuations (I want to keep just the letters and numbers and spaces), for example:
$string = "TB Avrupa ve Türkiye'nin en iyi oranlari ile Lider Bahis Sitesi!!";
$final_title = preg_replace('/[^a-z]+/i', '', $string);
This remove everything (with spaces)
I need to keep spaces can i add anything to the previous line to achieve this ??
Expected output :
TB Avrupa ve Türkiyenin en iyi oranlari ile Lider Bahis Sitesi
I want to keep just the letters and numbers and spaces
You can use this regex to remove everything other than english letters, digits and spaces:
preg_replace('/[^a-z\d ]+/i', '', $string);
Just include any characters you want to keep:
'/[^a-z0-9 ]+/i'
You would need to change your regex to this:
$final_title = preg_replace('/[^a-z0-9 ]+/i', '', $string);
This will keep numbers and spaces.
I do not know exactly what your requirements are, however, ü is a valid letter in some languages.
If you want to keep those as well, you would need to make a regex like so:
$final_title = preg_replace('/[\p{L}0-9 ]+/i', '', $string);
Try this:
preg_replace('/[^A-Z^a-z^0-9^şŞıİçÇöÖüÜĞğ ]+/i', '', $string);

c# regex #"[;]+"

In c#, there is a line of code such as:
string[] values = Regex.Split(fielddata, #"[;]+");
On what values does this split? I'm getting a bit confused by the mixture of literals from the # sign and what the square braces and + mean here. Any ideas?
# is a verbatim string literal, meaning you don't have to escape special characters. As Asad already said, it splits on one or more consecutive semicolon, where + stands for 1 or more (regex grammar)
Here's a runnable example: http://ideone.com/whLqUe
string input = "a;b; ;c;;;d";
string[] values = Regex.Split(input, #";+");
foreach (var value in values)
Console.WriteLine(value);
outputting
a
b
c
d
Here is a good tutorial.
[...] is a character class matching any single character inside the square brackets. In this case it is redundant, just writing #";+" would mean exactly the same.
+ repeats the previous character or pattern 1 or more times.
So this splits on consecutive ; (as many as possible).
The verbatim string (#"...") is used simply as a matter of good practice. Once you need to escape things inside regular expressions, it gets ugly if you use a normal string. Again, in this particular example, it would not make a difference to leave out the #. But it's something worth getting used to.
Those brackets are unnecessary. That regex is equivalent to the following:
string[] values = Regex.Split(fielddata, #";+");
It'll split on any amount of semi-colons, so that "1;2;;3;;4;;;5;;6;7" would return an array:
['1', '2', '3', '4', '5', '6', '7']
The split method will split fielddata on 1 or more semi colons. The # symbol means that you do not have to escape characters and the string is verbatim what is between the double quotes.
if fielddata = "a;b;c;;d;e;;;f"
then
values = ["a","b","c","d","e","f"]