How to add formatting commas to a number in Google Refine

How to add formatting commas to a number in Google Refine - regex

Due to what we're using the data for, it's important that long numbers (8+ digits) have commas every 3 digits for formatting and readability.
The issue is I really don't know how to make an expression that does this. Would anyone with some more experience writing these expressions point me in the right direction?
The supported expression languages are GREL (Google Refine Expression Language), Clojure, and Jython.

Using a substitution, this \B(?=(\d{3})+(?!\d)) will insert a comma every three digits
So 12345678 becomes 12,345,678
It uses :
\B : Negated word boundary, it's the negated version of \b and matches at every position where \b does not. Effectively, \B matches at any position between two word characters as well as at any position between two non-word characters. More details
A positive look ahead (?=...) wich make sure the comma will be inserted every three digits starting from the right
Live DEMO

Related

Regex to find last occurrence of pattern in a string

My string being of the form:
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
I only want to match against the last segment of whitespace before the last period(.)
So far I am able to capture whitespace but not the very last occurrence using:
\s+(?=\.\w)
How can I make it less greedy?

In a general case, you can match the last occurrence of any pattern using the following scheme:
pattern(?![\s\S]*pattern)
(?s)pattern(?!.*pattern)
pattern(?!(?s:.*)pattern)
where [\s\S]* matches any zero or more chars as many as possible. (?s) and (?s:.) can be used with regex engines that support these constructs so as to use . to match any chars.
In this case, rather than \s+(?![\s\S]*\s), you may use
\s+(?!\S*\s)
See the regex demo. Note the \s and \S are inverse classes, thus, it makes no sense using [\s\S]* here, \S* is enough.
Details:
\s+ - one or more whitespace chars
(?!\S*\s) - that are not immediately followed with any 0 or more non-whitespace chars and then a whitespace.

You can try like so:
(\s+)(?=\.[^.]+$)
(?=\.[^.]+$) Positive look ahead for a dot and characters except dot at the end of line.
Demo:
https://regex101.com/r/k9VwC6/3

"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=((?<=\S)\s+)).*
replaced by `>\1<`
> <
As a more generalized example
This example defines several needles and finds the last occurrence of either one of them. In this example the needles are:
defined word findMyLastOccurrence
whitespaces (?<=\S)\s+
dots (?<=[^\.])\.+
"as.asd.sd ffindMyLastOccurrencedsfs. dfindMyLastOccurrencefsd d.sdfsd. sdfsdf sd ..COM"
.*(?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)).*
replaced by `>\1<`
>..<
Explanation:
Part 1 .*
is greedy and finds everything as long as the needles are found. Thus, it also captures all needle occurrences until the very last needle.
edit to add:
in case we are interested in the first hit, we can prevent the greediness by writing .*?
Part 2 (?=(findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+|(?<=**Not**NeedlePart)NeedlePart+))
defines the 'break' condition for the greedy 'find-all'. It consists of several parts:
(?=(needles))
positive lookahead: ensure that previously found everything is followed by the needles
findMyLastOccurrence|(?<=\S)\s+|(?<=[^\.])\.+)|(?<=**Not**NeedlePart)NeedlePart+
several needles for which we are looking. Needles are patterns themselves.
In case we look for a collection of whitespaces, dots or other needleparts, the pattern we are looking for is actually: anything which is not a needlepart, followed by one or more needleparts (thus needlepart is +). See the example for whitespaces \s negated with \S, actual dot . negated with [^.]
Part 3 .*
as we aren't interested in the remainder, we capture it and dont use it any further. We could capture it with parenthesis and use it as another group, but that's out of scope here

SIMPLE SOLUTION for a COMMON PROBLEM
All of the answers that I have read through are way off topic, overly complicated, or just simply incorrect. This question is a common problem that regex offers a simple solution for.
Breaking Down the General Problem
THE STRING
The generalized problem is such that there is a string that contains several characters.
THE SUB-STRING
Within the string is a sub-string made up of a few characters. Often times this is a file extension (i.e .c, .ts, or .json), or a top level domain (i.e. .com, .org, or .io), but it could be something as arbitrary as MC Donald's Mulan Szechuan Sauce. The point it is, it may not always be something simple.
THE BEFORE VARIANCE (Most important part)
The before variance is an arbitrary character, or characters, that always comes just before the sub-string. In this question, the before variance is an unknown amount of white-space. Its a variance because the amount of white-space that needs to be match against varies (or has a dynamic quantity).
Describing the Solution in Reference to the Problem
(Solution Part 1)
Often times when working with regular expressions its necessary to work in reverse.
We will start at the end of the problem described above, and work backwards, henceforth; we are going to start at the The Before Variance (or #3)
So, as mentioned above, The Before Variance is an unknown amount of white-space. We know that it includes white-space, but we don't know how much, so we will use the meta sequence for Any Whitespce with the one or more quantifier.
The Meta Sequence for "Any Whitespace" is \s.
The "One or More" quantifier is +
so we will start with...
NOTE: In ECMAS Regex the / characters are like quotes around a string.
const regex = /\s+/g
I also included the g to tell the engine to set the global flag to true. I won't explain flags, for the sake of brevity, but if you don't know what the global flag does, you should DuckDuckGo it.
(Solution Part 2)
Remember, we are working in reverse, so the next part to focus on is the Sub-string. In this question it is .com, but the author may want it to match against a value with variance, rather than just the static string of characters .com, therefore I will talk about that more below, but to stay focused, we will work with .com for now.
It's necessary that we use a concept here that's called ZERO LENGTH ASSERTION. We need a "zero-length assertion" because we have a sub-string that is significant, but is not what we want to match against. "Zero-length assertions" allow us to move the point in the string where the regular expression engine is looking at, without having to match any characters to get there.
The Zero-Length Assertion that we are going to use is called LOOK AHEAD, and its syntax is as follows.
Look-ahead Syntax: (?=Your-SubStr-Here)
We are going to use the look ahead to match against a variance that comes before the pattern assigned to the look-ahead, which will be our sub-string. The result looks like this:
const regex = /\s+(?=\.com)/gi
I added the insensitive flag to tell the engine to not be concerned with the case of the letter, in other words; the regular expression /\s+(?=\.cOM)/gi
is the same as /\s+(?=\.Com)/gi, and both are the same as: /\s+(?=\.com)/gi &/or /\s+(?=.COM)/gi. Everyone of the "Just Listed" regular expressions are equivalent so long as the i flag is set.
That's it! The link HERE (REGEX101) will take you to an example where you can play with the regular expression if you like.
I mentioned above working with a sub-string that has more variance than .com.
You could use (\s*)(?=\.\w{3,}) for instance.
The problem with this regex, is even though it matches .txt, .org, .json, and .unclepetespurplebeet, the regex isn't safe. When using the question's string of...
"as.asd.sd fdsfs. dfsd d.sdfsd. sdfsdf sd .COM"
as an example, you can see at the LINK HERE (Regex101) there are 3 lines in the string. Those lines represent areas where the sub-string's lookahead's assertion returned true. Each time the assertion was true, a possibility for an incorrect final match was created. Though, only one match was returned in the end, and it was the correct match, when implemented in a program, or website, that's running in production, you can pretty much guarantee that the regex is not only going to fail, but its going to fail horribly and you will come to hate it.

You can try this. It will capture the last white space segment - in the first capture group.
(\s+)\.[^\.]*$

What do comma separated numbers in curly braces at the end of a regex mean?

I am trying to understand the following regex, I understand the initial part but I'm not able to figure out what {3,19} is doing here:
/[A-Z][A-Za-z0-9\s]{3,19}$/

That is the custom repetition operation known as the Quantifier.
\d{3} will find exactly three digits.
[a-c]{1,3} will find any occurrance of a, b, or c at least one time, but up to three times.
\w{0,1} means a word character will be optionally found. This is the same as placing a Question Mark, e.g.: \w?
(\d\w){1,} will find any combination of a Digit followed by a Word Character at least One time, but up to infinite times. So it would match 1k1k2k4k1k5j2j9k4h1k5k This is the same as a Plus symbol, e.g.: (\d\w)+
b{0,}\d will optionally find the letter b followed by a digit, but could also match infinite letter b's followed by a digit. So it will match 5, b5, or even bbbbbbb5. This is the same as an Asterisk. e.g.: b*\d
Quantifiers

They are 'quantifiers' - it means 'match previous pattern between 3 and 19 times'
When you are learning regular expressions, it's really use to play with them in an interactive tool which can highlight the matches. I've always liked a tool called Regex Coach, but it is Windows only. Plenty of online tools though - have a play with your regex here, for example.

{n,m} means "repeat the previous element at least n times and at most m times", so the expression
[A-Za-z0-9\s]{3,19} means "match between 3 and 19 characters that are letters, digits, or whitespace". Note that repetition is greedy by default, so this will try to match as many characters as possible within that range (this doesn't come into play here, since the end of line anchor makes it so that there is really only one possibility for each match).

The regular expression you have there /[A-Z][A-Za-z0-9\s]{3,19}$/ breaks up to mean this:
[A-Z] We are looking for a Capital letter
Followed by
[A-Za-z0-9\s]{3,19} a series of letters, digits, or white space that is between 3 and 19 characters
$ Then the end of the line.

It will have to match [A-Za-z0-9\s] between 3 and 19 times.
Here's a good regex reference guide:
http://www.regular-expressions.info/reference.html

what does comma separated numbers in curly brace at the end of regex means
It denotes Quantifier with the range specified in curly brace.
curly brace analogues to function with arguments. Where we can specify single integer or two integers which acts as a range between the two numbers.
/[A-Z][A-Za-z0-9\s]{3,19}$/
Using online regex websites we can get understand as follows:
https://regex101.com/

How to optimise this regex to match string (1234-12345-1)

I've got this RegEx example: http://regexr.com?34hihsvn
I'm wondering if there's a more elegant way of writing it, or perhaps a more optimised way?
Here are the rules:
Digits and dashes only.
Must not contain more than 10 digits.
Must have two hyphens.
Must have at least one digit between each hyphen.
Last number must only be one digit.
I'm new to this so would appreciate any hints or tips.
In case the link expires, the text to search is
----------
22-22-1
22-22-22
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
666666-7777777-1
88888888-88888888-1
1-1-1
88888888-88888888-22
22-333-
333-22
----------
My regex is: \b((\d{1,4}-\d{1,5})|(\d{1,5}-\d{1,4}))-\d{1}\b
I'm using this site for testing: http://gskinner.com/RegExr/
Thanks for any help,
Nick

Here is a regex I came up with:
(?=\b[\d-]{3,10}-\d\b)\b\d+-\d+-\d\b
This uses a look-ahead to validate the information before attempting the match. So it looks for between 3-10 characters in the class of [\d-] followed by a dash and a digit. And then after that you have the actual match to confirm that the format of your string is actually digit(dash)digit(dash)digit.
From your sample strings this regex matches:
22-22-1
333-333-1
333-4444-1
4444-4444-1
4444-55555-1
55555-4444-1
1-1-1
It also matches the following strings:
22-7777777-1
1-88888888-1

Your regexp only allows a first and second group of digits with a maximum length of 5. Therefore, valid strings like 1-12345678-1 or 123456-1-1 won't be matched.
This regexp works for the given requirements:
\b(?:\d\-\d{1,8}|\d{2}\-\d{1,7}|\d{3}\-\d{1,6}|\d{4}\-\d{1,5}|\d{5}\-\d{1,4}|\d{6}\-\d{1,3}|\d{7}\-\d{1,2}|\d{8}\-\d)\-\d\b
(RegExr)

You can use this with the m modifier (switch the multiline mode on):
^\d(?!.{12})\d*-\d+-\d$
or this one without the m modifier:
\b\d(?!.{12})\d*-\d+-\d\b
By design these two patterns match at least three digits separated by hyphens (so no need to put a {5,n} quantifier somewhere, it's useless).
Patterns are also build to fail faster:
I have chosen to start them with a digit \d, this way each beginning of a line or word-boundary not followed by a digit is immediately discarded. Other thing, using only one digit, I know the remaining string length.
Then I test the upper limit of the string length with a negative lookahead that test if there is one more character than the maximum length (if there are 12 characters at this position, there are 13 characters at least in the string). No need to use more descriptive that the dot meta-character here, the goal is to quickly test the length.
finally, I describe the end of string without doing something particular. That is probably the slower part of the pattern, but it doesn't matter since the overwhelming majority of unnecessary positions have already been discarded.

regex negative look-ahead for exactly 3 capital letters arround a char

im trying to write a regex finds all the characters that have
exactly 3 capital letters on both their sides
The following regex finds all the characters that have exactly 3 capital letters on the left side of the char, and 3 (or more) on the right:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})'
When trying to limit the right side to no more then 3 capitals using the regex:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3})(?![A-Z])'
i get no results, there seems to be a fail when adding the (?![A-Z]) to the first regex.
can someone explain me the problem and suggest a way to solve it?
Thanks.

You need to put the negative lookahead inside the positive one:
(?<![A-Z])[A-Z]{3}.(?=[A-Z]{3}(?![A-Z]))
You can do that with the lookbehind, too:
(?<=(?<![A-Z])[A-Z]{3}).(?=[A-Z]{3}(?![A-Z]))
It doesn't violate the "fixed-length lookbehind" rule because lookarounds themselves don't consume any characters.
EDIT (about fixed-length lookbehind): Of all the flavors that support lookbehind, Python is the most inflexible. In most flavors (e.g. Perl, PHP, Ruby 1.9+) you could use:
(?<=^[A-Z]{3}|[^A-Z][A-Z]{3}).
...to match a character preceded by exactly three uppercase ASCII letters. The first alternative - ^[A-Z]{3} - starts looking three positions back, while the second - [^A-Z][A-Z]{3} - goes back exactly four positions. In Java, you can reduce that to:
(?<=(^|[^A-Z])[A-Z]{3}).
...because it does a little extra work at compile time to figure out that the maximum lookbehind length will be four positions. And in .NET and JGSoft, anything goes; if it's legal anywhere, it's legal in a lookbehind.
But in Python, a lookbehind subexpression has to match a single, fixed number of characters. If you've butted your head against that limitation a few times, you might not expect something like this to work:
(?<=(?<![A-Z])[A-Z]{3}).
At least I didn't. It's even more concise than the Java version; how can it work in Python? But it does work, in Python and in every other flavor that supports lookbehind.
And no, there are no similar restrictions on lookaheads, in any flavor.

Taking out the positive lookahead worked for me.
(?<![A-Z])[A-Z]{3}(.)([A-Z]{3})(?![A-Z])
'ABCdDEF' 'ABCfDEF' 'HHHhhhHHHH' 'jjJJjjJJJ' JJJjJJJ
matches
ABCdDEF
ABCfDEF
JJJjJJJ

I'm not sure how the regexp engines should work with multiple lookahead assertions, but the one you're using may have its own opinion on that.
You could as well use a single assertion as follows:
'(?<![A-Z])[A-Z]{3}(.)(?=[A-Z]{3}[^A-Z])'
The same with lookbehind:
'(?<=[^A-Z][A-Z]{3})(.)(?=[A-Z]{3}[^A-Z])'
This will have a problem matching the pattern in the beginning and in the end of the line.
I can't think of a proper solution, but there can be a dirty trick: for instance, add a space (or something else) in the beginning and the end of the whole line, then perform the matching.
$ echo 'ABCdDEF ABCfDEF HHHhhhHHHH AAAaAAAbAAA jjJJJJjJJJ JJJjJJJ' | sed 's/.*/ & /' | grep -oP '(?<=[^A-Z][A-Z]{3})(\S)(?=[A-Z]{3}[^A-Z])'
d
f
a
b
j
Note that I changed (.) to (\S) in the middle, change it back if you want the space to match.
P.S. Are you solving The Python Challenge? :)

Since the look ahead pattern is the same as the look behind pattern, you could also use the continue anchor \G:
/(?:[A-Z]{3}|\G[A-Z]*)(.)[A-Z]{3}/
A match is returned if three capitals precede a single character or where the last match left off (optionally followed by other capitals).

Regex to match name1.name2[.name3]

I am trying to validate user id's matching the example:
smith.jack or smith.jack.s
In other words, any number of non-whitespace characters (except dot), followed by exactly one dot, followed by any number of non-whitespace characters (except dot), optionally followed by exactly one dot followed by any number of non-whitespace characters (except dot). I have come up with several variations that work fine except for allowing consecutive dots! For example, the following Regex
^([\S][^.]*[.]{1}[\S][^.]*|[\S][^.]*[.]{1}[\S][^.]*[.]{1}[\S][^.]*)$
matches "smith.jack" and "smith.jack.s" but also matches "smith..jack" "smith..jack.s" ! My gosh, it even likes a dot as a first character. It seems like it would be so simple to code, but it isn't. I am using .NET, btw.
Frustrating.

that helps?
/^[^\s\.]+(?:\.[^\s\.]+)*$/
or, in extended format, with comments (ruby-style)
/
^ # start of line
[^\s\.]+ # one or more non-space non-dot
(?: # non-capturing group
\. # dot something
[^\s\.]+ # one or more non-space non-dot
)* # zero or more times
$ # end of line
/x
you're not clear on how many times you can have dot-something, but you can replace the * with {1,3} or something, to specify how many repetitions are allowed.
i should probably make it clear that the slashes are the literal regex delimiter in ruby (and perl and js, etc).

^([^.\s]+)\.([^.\s]+)(?:\.([^.\s]+))?$

I'm not familiar with .NET's regexes. This will do what you want in Perl.
/^\w+\.\w+(?:\.\w+)?$/
If .NET doesn't support the non-capturing (?:xxx) syntax, use this instead:
/^\w+\.\w+(\.\w+)?$/
Note: I'm assuming that when you say "non-whitespace, non-dot" you really mean "word characters."

You are using the * duplication, which allows for 0 iterations of the given component.
You should be using plus, and putting the final .[^.]+ into a group followed by ? to represent the possibility of an extra set.
Might not have the perfect syntax, but something similar to the following should work.
^[^.\s]+[.][^.\s]+([.][^.\s]+)?$
Or in simple terms, any non-zero number of non-whitespace non-dot characters, followed by a dot, followed by any non-zero number of non-whitespace non-dot characters, optionally followed by a dot, followed by any non-zero number of non-whitespace non-dot characters.

I realise this has already been solved, but I find Regexpal extremely helpful for prototyping regex's. The site has a load of simple explanations of the basics and lets you see what matches as you adjust the expression.

[^\s.]+\.[^\s.]+(\.[^\s.]+)?
BTW what you asked for allows "." and ".."

I think you'd benefit from using + which means "1 or more", instead of * meaning "any number including zero".

(^.)+|(([^.]+)[.]([^.]+))+
But this would match x.y.z.a.b.c and from your description, I am not sure if this is sufficiently restrictive.
BTW: feel free to modify if I made a silly mistake (I haven't used .NET, but have done plently of regexs)

[^.\s]+\.[^.\s]+(\.([^\s.]+?)?
has unmatched paren. If corrected to
[^.\s]+\.[^.\s]+(\.([^\s.]+?))?
is still too liberal. Matches a.b. as well as a.b.c.d. and .a.b
If corrected to
[^.\s]+\.[^.\s]+(\.([^\s.]+?)?)
doesn't match a.b

^([^.\W]+)\.?([^.\W]+)\.?([^.\W]+)$
This should capture as described, group the parts of the id and stop duplicate periods

I took a slightly different approach. I figured you really just wanted a string of non-space characters followed by only one dot, but that dot is optional (for the last entry). Then you wanted this repeated.
^([^\s\.]+\.?)+$
Right now, this means you have to have at least one string of characters, e.g. 'smith' to match. You, of course could limit it to only allow one to three repetitions with
^([^\s\.]+\.?){1,3}$
I hope that helps.

RegexBuddy Is a good (non-free) tool for regex stuff

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to add formatting commas to a number in Google Refine - regex

Related

Regex to find last occurrence of pattern in a string

What do comma separated numbers in curly braces at the end of a regex mean?

How to optimise this regex to match string (1234-12345-1)

regex negative look-ahead for exactly 3 capital letters arround a char

Regex to match name1.name2[.name3]

Categories

Resources