Regex pattern for localization - regex

I am trying to find a regex pattern to fix a localize issue.
The usual delimiters are "." "," or "_" which i have stored into an array of delimiters.
I'm trying to find a pattern with match any of these delimiters which also ends with one or more 0.
For example 3,000 or 3,0 3.0 3.00

You could try positive lookahead
If indeed your data always has one or more 0 after any delimiter, using a positive lookahead ( (?=0+) in this case) might be what you are looking for...
More precisely, for the numbers you gave:
s/([_.,](?=0+))/g
should do the trick!
You could try it out and experiment with regex here!

We could likely start with an expression similar to:
\d+[.,](\d+)?[0]
and add additional boundaries to it, if we like so.
For instance, if we wish to capture the delimiters, we would be adding a capturing group:
\d+([.,])(\d+)?[0]
Demo
Or if we wish to remove delimiters, we would expand it to:
(\d+)([.,])(\d+)?([0])
and replace it with:
$1$3$4
Demo
Test
const regex = /(\d+)([.,])(\d+)?([0])/gm;
const str = `3,000
3,0
3.0
3.00`;
const subst = `$1$3$4`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);

You could use add the delimiters in a character class [_.,] and use word boundaries \b to prevent the number being part of a larger word.
If thet are the only value, you might also use anchors to assert the start ^ and the ends $ of the string.
\b\d+[_.,]\d*0\b
That will match:
\b Word boundary
\d+ Match 1+ digits
[_.,] Match any of the listed in the character class
\d*0 Match 0+ digits followed by a zero
\b Word boundary
Regex demo

Related

Regex for replacing anything other than characters, more than one spaces and number only in end with empty char

I want to replace anything other than character, spaces and number only in end with empty string or in other words: we replace any number or spaces comes in-starting or in-middle of the string replace with empty string.
Example
**Input** **Output**
Ndd12 Ndd12
12Ndd12 Ndd12
Ndd 12 Ndd 12
Nav G45up Nav Gup
Attempted Code
regexp_replace(df1[col_name]), "(^[A-Za-z]+[0-9 ])", ""))
You may use:
\d+(?!\d*$)|[^\w\n]+(?!([A-Z]|$))
RegEx Demo
Explanation:
\d+(?!\d*$): Match 1+ digits that are not followed by 0+ digits and end of line
|: OR
[^\w\n]+(?!([A-Z]|$)): Match 1+ non-word characters that are not followed by an uppercase letter or and end of line
if you use python, you can use regular expressions.
You can use the re module.
import re
new_string = re.sub(r"[^a-zA-Z0-9]","",s)
Where ^ means exclusion.
Regular expressions exist in other languages. So it would be helpful to find a regular expression.
I came up with this regex to capture all characters that you want to remove from the string.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+
Do
regexp_replace(df1[col_name]), "^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+", ""))
Regex Demo
Explanation:
^\d+ - captures all digits in a sequence from the start.
(?<=\w)\d+(?![\d\s]) - Positive look behind for a word character with a negative look ahead for a number followed by space and capturing a sequence of digits in the middle. (Captures digits in G45up)
(?<=\s)\s+ - positive look behind for a space followed by one or more spaces, capturing all additional spaces.
Note : This regex could be inefficient when matching large strings as it uses expensive look-arounds.
^\d+|(?<=\w)\d+(?![\d\s])|(?<=\s)\s+|(?<=\w)\W|\W(?=\w)|(?<!\w)\W|\W(?!\w)

Matching comma after certain phrase

I'm using Atom's regex search and replace feature and not JavaScript code.
I thought this JavaScript-compatible regex would work (I want to match the commas that have Or rather behind it):
(?!\b(Or rather)\b),
?! = Negative lookahead
\b = word boundary
(...) = search the words as a whole not character by character
\b = word boundary
, = the actual character.
However, if I remove characters from "Or rather" the regex still matches. I'm confused.
https://regexr.com/4keju
You probably meant to use positive lookbehind instead of negative lookbehind
(?<=\b(Or rather)\b),
Regex Demo
You can activate lookbehind in atom using flags, Read this thread
The (?!\b(Or rather)\b), pattern is equal to , as the negative lookahead always returns true since , is not equal to O.
To remove commas after Or rather in Atom, use
Find What: \b(Or rather),
Replace With: $1
Make sure you select the .* option to enable regular expressions (and the Aa is for case sensitivity swapping).
\b(Or rather), matches
\b - a word boundary
(Or rather) - Capturing group 1 that matches and saves the Or rather text in a memory buffer that can be accessed using $1 in the replacement pattern
, - a comma.
JS regex demo:
var s = "Or rather, an image.\nor rather, an image.\nor rather, friends.\nor rather, an image---\nOr rather, another time they.";
console.log(s.replace(/\b(Or rather),/g, '$1'));
// Case insensitive:
console.log(s.replace(/\b(Or rather),/gi, '$1'));
To Match any comma after "Or rather" you can simply use
(or rather)(,) and access the second group using match[2]
Or an alternative would be to use or rather as a non capturing group
(?:or rather)(,) so the first group would be commas after "Or rather"

REGEX - how replace word before colon with word itself wrapped into double quotes

I have a "json" like this:
{
example:"hi there",
dateTime: "01/01/1970 bla:bla"
}
I should replace all the value before the colon wrapping them inside a doublequote.
Referring to this response Regex in ruby to enclose word before : with double quotes, I tried the code, but was not yet full correct, because it changed also the value before the colon in the dateTime.
so I should add to this code
(\w+)(?=:)
another control, which sees if the word is just after a comma.
I would like to change, so, the "json" to a real json like this:
{
"example":"hi there",
"dateTime": "01/01/1970 bla:bla"
}
and not like this like now:
{
"example":"hi there",
"dateTime": "01/01/1970 "bla":bla"
}
Here is the solution to select all attributes and replace(Check the demo):
For all case alphabets:
/(?:[a-z]+(?=:[" ]))/ig
For alphanumeric and underscore:
/(?:[\w]+(?=:[" ]))/g
Demo:
https://www.regextester.com/?fam=107535
?: is used to create group without back referencing. Hence, increases computational speed as it doesn't have to remember the group for reuse.
If you can rely to the position of values to be replaced at the start of the line, as in your example, you can use a regex like ^ +([a-zA-Z0-9_]*):, that matches only sequences of alphanumerical characters and underscores before colon and preceded by zero or more spaces and captures the sequence as first group.
You can see using online regex validator what is matched/captured and what is not for the input sample you showed.
Then you can use the captured group to wrap the text you are interested on in double quotes.
Simple runnable example in JS:
var input = `{
example:"hi there",
dateTime: "01/01/1970 bla:bla"
}`
var regexp = /^[ \t]+([a-zA-Z0-9_]*):/mg
var replaced = input.replace(regexp, '"$1":')
console.log(replaced)
EXPLANATION: m flag enables multiline match, g flag enables match all mode
I can't show you a Ruby example but the provided regexpr should help you!
Your pattern (\w+)(?=:) matches 1+ word character where on the right side is a colon. That will match bla:
What you might do is extend that positive lookahead to match 1+ times a word character \w+ and then use a positive lookahead to assert what is on the right is a colon followed by a double quote, match what is in between and a closing double quote.
Note that the matching is apart from any data structure and relies on what is on the right side of the match.
Then in the replacement you can wrap the match between quotes.
\w+(?=:\s*"[^"]+")
That matches:
\w+ Match 1+ word characters
(?= Positive lookahead to assert what is on the right matches
\s* Match 0+ times a whitespace character
*"[^"]*" Match using a negated character class from opening will closing quote and what is between.
) Close lookahead
Regex demo | Example using Ruby

Regex to match and replace a character in a pattern

I would like to replace a character "?" with "fi" in a string.
I could write a generic str replace for this. But I want to replace the "?" only if it appears in between two A-Za-z character and avoid the rest
Eg., "Okay?" should be "Okay?" and not "Okayfi"
but
Modi?es should be Modifies since it has ? in middle
What have I tried?
sentence = re.sub(r"(\?)\b", "fi", sentence)
Please see here.
https://regexr.com/3nvk3
Seems to work fine in regexr. but doesnt work well in code. Am I doing something wrong?
The best approach here is to find the original text with the fi ligature and read it in with proper encoding.
Otherwise, you will have to use some workarounds.
You may use (?<=[a-zA-Z]) / (?=[A-Za-z]) lookarounds:
sentence = re.sub(r"(?<=[a-zA-Z])\?(?=[a-zA-Z])", "fi", sentence)
See the regex demo. The (?<=[a-zA-Z]) positive lookbehind matches a position immediately after an ASCII letter, and (?!=[A-Za-z]) positive lookahead matches a position immediately before an ASCII letter.
Or, you may also use a capturing group with backreferences:
sentence = re.sub(r"([a-zA-Z])\?([a-zA-Z])", r"\1fi\2", sentence)
See another regex demo. Note that \1 references the value captured with the first ([a-zA-Z]) group and \2 references the value captured into Group 2 (([a-zA-Z])).

Finding a match one after another

How do I find multiple matches that are (and can only be) separated from each other by whitespaces?
I have this regular expression:
/([0-9]+)\s*([A-Za-z]+)/
And I want each of the matches (not groups) to be surrounded by a whitespace or another match. If the condition is not fullfilled, the match should not be returned.
This is valid: 1min 2hours 3days
This is not: 1min, 2hours 3days (1min and 2hours should not be returned)
Is there a simpler way of finding a continuous sequence of matches (in Java preferably) than repeating the whole regex before and after the main one, checking if there is a whitespace, start/end of the string or another match?
I believe this pattern will meet your requirements (provided that only a single space character separates your alphanumeric tokens):
(?<=^|[\w\d]\s)([\w\d]+)(?=\s|$)
^^^^^^^^^^ ^^^^^^^ ^^^^
(2) (1) (3)
A capture group that contains an alphanumeric string.
A look-behind assertion: To the left of the capture group must be a) the beginning of the line or b) an alphanumeric character followed by a single space character.
A look-ahead assertion: To the right of the capture group must be a) a space character or b) the end of the line.
See regex101.com demo.
Here is some sample data that I included in the demo. Each bolded alphanumeric string indicates a successful capture:
1min 2hours 3days
1min, 2hours 3days
42min 4hours 2days
String text = "1min 2hours 3days";
boolean match = text.matches("(?:\\s*[0-9]+\\s*[A-Za-z]+\\s*)*");
This is basically looking for a pattern on your example. Then using * after the pattern its looking for zero or more occurrence of the pattern in text. And ?: means doesn't capture the group.
This will will also return true for empty string. If you don't want the empty string to be true, then change * into +
I've mananged to solve my problem by splitting the string using string.split("\\s+") and then matching the results to the pattern /([0-9]+)\s*([A-Za-z]+)/.
There is an error here the '' will match all characters and ignore your rest
/([0-9]+)\s([A-Za-z]+)/
Change to
/(\d+)\s+(\w+)/g
This will return an array of matches either digits or word characters. There is no need to always write '[0-9]' or '[A-Za-z]' the same thing can be said as '\d' match any 0 to 9 more can be found at this cheat sheet regular expression cheat sheet