Put thousand separator using REGEX replace method - regex

I have to put thousand separator between numbers. I have done till now:
Input string
1852
2589653
586699
8542.28
The find pattern
(?<=\d)(?=(?:\d{3})+(?!\d))
replace-with
,
result
1,852
2,589,653
586,699
8,542.28
TODO
I want to eliminate all year ranges from 1700 to 2010, from match-collection.
Anyone have any idea. All suggestions are welcome. Thanks in advance.

This is not a good way to use regular expressions.
Instead, use the string formatting features of your language:
Use a simple regular expression to find the numbers, if you have to extract them from text.
Convert them to floating point or integer numbers (as appropriate)
Use a string format specifier to say you would like them output with a thousands separator.
For example, here's a shell transcript where I extract a number from a string and format it with a comma thousands separator: (Python 2.x)
In [12]: import re
In [13]: number_pattern = re.compile(r"\d+(.\d+)") #positive integer or floating point number
In [14]: mystring = "The size of the rocket is 3141592.6."
In [15]: number_string = number_pattern.search(mystring).group() #extract the number as a string
In [16]: number_string
Out[16]: '3141592.6'
In [18]: number = float(number_string) #convert to number
In [19]: '{:,}'.format(number) #format with thousands separator
Out[19]: '3,141,592.6'
Doing it this way also makes eliminating ranges of numbers trivial.
if (number > 1700) or (number < 2100):
pass #do something

Here is a RegEx Example in PowerShell:
[Regex]::Replace(12345, '[0-9](?=(?:[0-9]{3})+(?![0-9]))', '$0''')
12'345

Related

How to make TfidfVectorizer only learn alphabetical characters as part of the vocabulary (exclude numbers)

I'm trying to extract a vocabulary of unigrams, bigrams, and trigrams using SkLearn's TfidfVectorizer. This is my current code:
max_df_param = .003
use_idf = True
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
unigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(2,2), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
bigrams = vectorizer.get_feature_names()
vectorizer = TfidfVectorizer(max_df = max_df_param, stop_words='english', ngram_range=(3,3), max_features=max(1, int(len(unigrams)/10)), use_idf=use_idf)
X = vectorizer.fit_transform(dataframe[column])
trigrams = vectorizer.get_feature_names()
vocab = np.concatenate((unigrams, bigrams, trigrams))
However, I would like to avoid numbers and words that contain numbers and the current output contains terms such as "0
101
110
12
15th
16th
180c
180d
18th
190
1900
1960s
197
1980
1b
20
200
200a
2d
3d
416
4th
50
7a
7b"
I try to only include words with alphabetical characters using the token_pattern parameter with the following regex:
vectorizer = TfidfVectorizer(max_df = max_df_param,
token_pattern=u'(?u)\b\^[A-Za-z]+$\b',
stop_words='english', ngram_range=(1,1), max_features=2000, use_idf=use_idf)
but this returns: ValueError: empty vocabulary; perhaps the documents only contain stop words
I have also tried only removing numbers but I still get the same error.
Is my regex incorrect? or am I using the TfidfVectorizer incorrectly? (I have also tried removing max_features argument)
Thank you!
Thats because your regex is wrong.
1) You are using ^ and $ which are used to denote string start and end. That means this pattern will only match complete string with only alphabets in it (no numbers, no spaces, no other special chars). You dont want that. So remove that.
See the details about special characters here: https://docs.python.org/3/library/re.html#regular-expression-syntax
2) You are using raw regex pattern without escaping the backslash which will itself be used for escaping the characters following it. So when used in conjuction with regular expressions in python, this will not be valid as you want to. You can either properly format the string by using double backslashes instead of single or use r prefix.
3) u prefix is for unicode. Unless your regex pattern have special unicode characters, this is also not needed.
See more about that here: Python regex - r prefix
So finally your correct token_pattern should be:
token_pattern=r'(?u)\b[A-Za-z]+\b'

SAS compress: keep numbers before |

So one of my variables was coded in a messy mix of numeric values, texts, parenthesis and so on. I actually only need to extract the numeric values which are recorded as 12345 (for example, not limited to a specific number of digits, i mean it could be a n-k-digit to n-digit) followed by || and then description that might also contain some numeric values. So when I applied SAS compress funtion newvar = compress(oldvar, '', 'a'), the newvar extracted ALL the numbers from the oldvar. Thus it looks like 12345|||(789)|| etc. The number of '|' sign (which is control character to indicate line breaks etc.?) varies though.
I only need to extract the first numeric values before the '|' sign. Any help please?
Thanks in advance.
Use the SCAN() function to extract the values. It will result in a character value and converting to a numeric should be straightforward.
new_var = input(scan(old_var, 1, "|"), best12.);
This should do it:
substr("12345||45||89||...",1,find("|","12345||45||89||...",1)-1)

regex interval with possible characters before and after number VBA

I'm trying to produce a regular expression that can identify a number within an interval in a string in VBA. Sometimes this number has characters around it, other times not (non-consistent notation from a supplier). The expression should identify that 1413 in the three examples below are within the number range 500-2000 (or alternatively that it's not in the number range 0-50 or 51-499).
Example:
Test 12/2014. Tot.flow:1413 m3 or
Test 12/2014. Tot.flow:1413m3 or
Test 12/2014. Tot.flow: 1413
These strings have some identifiers:
there will always be a colon before the number
there may be a white space between the colon and the number
there may be a white space between the number and the m3
m3 is not necessarily always present, and if not, the number is at the end of the string
So far what I have in my attempt to make an regex that find the number range is ([5-9][0-9][0-9]|[1]\d{3}|2000), but this matches all three digit numbers as well (2001 gives a match on 200). However, I understand that I'm missing out on a couple of concepts to achieve the ultimate goal here. I guess my problems are as following:
How to start the interval at something not being zero (found lots of questions on intervals starting on zero)
How to take into account the variations in notation both for flow: and m3?
I'm only interested in checking that the number lies within the number range. This is driving me bonkers, all help is highly appreciated!
You can just extract the number with regExp.Replace() using the following regex:
^.*:\s*(\d+).*$
The replacement part is $1.
Then, use usual number comparison to check whether the value is in the expected range (e.g. If CLng(result) > 499 And If CLng(result) < 2001 Then ...).
Test macro:
Dim re As RegExp, tgt As String, src As String
Set re = New RegExp
With re
.pattern = "^.*:\s*(\d+).*$"
.Global = False
End With
src = "Test 12/2014. Tot.flow: 1413"
tgt = re.Replace(src, "$1")
MsgBox (CLng(tgt) > 499 And CLng(tgt) < 2001)
You can try with:
:\s?([5-9]\d\d|1\d{3}|2000)\s?(m3|\n)
also, your regex ([5-9][0-9][0-9]|[1]\d{3}|2000) in my opinion is fine, it should not match numbers >500 and 2000<.

matlab - extracting numbers from (odd) string

I have a series of strings in a cvs file, they all look like the two bellow:
7336598,"[4125420656L, 2428145712L, 1820029797L, 1501679119L, 1980837904L, 380501274L]"
7514340,"[507707719L, 901144614L, 854823005L]"
....
how can I extract the numbers in it?
As in.. to retreive 7336598, 4125420656, etc....
Tried textscan, and regexp, but not much success...
Sorry for the beginners question...and thank you for having a look! :)
Edit: the size of each line is variable.
You can use textread and regexp to extract only the numbers from your CSV file:
C = textread('file.cvs', '%s', 'delimiter', '\n');
C = regexp(C, '\d+', 'match');
The regular expression is quite simple. In MATLAB's regexp pattern,\d denotes a digit, and the + indicates that this digit must occur at least once. The match mode tells regexp to return the matched strings.
The result is a cell array of strings. You can go further and convert the strings to numerical values:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false)
The result is still stored in a cell array. If you can guarantee that there's the same amount of numerical values in each row, you can convert the cell array to a matrix:
A = cell2mat(C);
I don't have matlab to test, but does a '[0-9]+' does the job ?
It works for me outside matlab :
echo '7336598,"[4125420656L, 2428145712L, 1820029797L, 1501679119L, 1980837904L, 380501274L]"' | grep -o '[0-9]\+'
7336598
4125420656
2428145712
1820029797
1501679119
1980837904
380501274

How to print an integer with a thousands separator in Matlab?

I would like to turn a number into a string using a comma as a thousands separator. Something like:
x = 120501231.21;
str = sprintf('%0.0f', x);
but with the effect
str = '120,501,231.21'
If the built-in fprintf/sprintf can't do it, I imagine cool solution could be made using regular expressions, perhaps by calling Java (which I assume has some locale-based formatter), or with a basic string-insertion operation. However, I'm not an expert in either Matlab regexp's or calling Java from Matlab.
Related question: How can I print a float with thousands separators in Python?
Is there any established way to do this in Matlab?
One way to format numbers with thousands separators is to call the Java locale-aware formatter. The "formatting numbers" article at the "Undocumented Matlab" blog explains how to do this:
>> nf = java.text.DecimalFormat;
>> str = char(nf.format(1234567.890123))
str =
1,234,567.89
where the char(…) converts the Java string to a Matlab string.
voilà!
Here's the solution using regular expressions:
%# 1. create your formated string
x = 12345678;
str = sprintf('%.4f',x)
str =
12345678.0000
%# 2. use regexprep to add commas
%# flip the string to start counting from the back
%# and make use of the fact that Matlab regexp don't overlap
%# The three parts of the regex are
%# (\d+\.)? - looks for any number of digits followed by a dot
%# before starting the match (or nothing at all)
%# (\d{3}) - a packet of three digits that we want to match
%# (?=\S+) - requires that theres at least one non-whitespace character
%# after the match to avoid results like ",123.00"
str = fliplr(regexprep(fliplr(str), '(\d+\.)?(\d{3})(?=\S+)', '$1$2,'))
str =
12,345,678.0000