source txt file:
34|Gurla Mandhata|7694|25243|2788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900
You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc
this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update
I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'
Related
I’m currently running into some difficulties with splitting and regular expressions in a Google spreadsheet. I’m attempting to split the contents of a cell across a row, but only pulling out sequences of four consecutive digits (representing years) and only using cell formulas (not functions). Eventually, this formula would apply to an entire column, but I’ve limited it to a single cell for the time being. For example, given a cell “I2” with the contents:
2009; Library of Congress; 1939-1945; 23rd 1984; 16
I need a result (placed in “J2, K2, L2, M2, etc.”) like:
2009 1939 1945 1984
This sample cell is as representative as I’m aware of for various possibilities that are likely to come up, though the number of entries between semicolons varies from one to many. In my own attempts so far, I’ve ended up with two formulas that are close to what I need, but both fall short.
1) The first formula is:
=ArrayFormula(SPLIT(SUBSTITUTE(REGEXREPLACE(I$2, "[^\d\-\;]", ""),"-", ";"), ";"))
which achieves (in "J2, K2, L2, M2, N2"):
2009 1939 1945 231984 16
2) The second formula is:
=ArrayFormula(SPLIT(SUBSTITUTE(REGEXREPLACE(REGEXREPLACE(I$2, "[^\d]", ";"), "[^\d\-\;]", ""),"-", ";"), ";"))
which gets me (in "J2, K2, L2, M2, N2, O2"):
2009 1939 1945 23 1984 16
I’ve been trying to think of a way to limit the formula’s returns with "\d{4}", for example, but no combination or alterations I’ve made so far have been successful. Does anyone have any insight which would solve this problem?
The following seems to work, although I am no expert in Sheets, and there may be more efficient methods.
Apparently if you use capture groups, REGEXEXTRACT will return an array of values. This method, however, seems to require that you know the exact number of matches to be extracted.
So the following seems to work:
=REGEXEXTRACT($I2,REPT("(\b\d{4}\b).*?",(len($I2)-len(REGEXREPLACE($I2,"\b\d{4}\b","")))/4))
How it works:
First compute the number of matches in the string:
=(len(I2)-len(REGEXREPLACE(I2,"\b\d{4}\b","")))/4
Next, create a regex expression incorporating the regex the correct number of times:
REPT("(\b\d{4}\b).*?", ...Above_formula...)
And finally, we put it all together in our final formula above.
Of course, if you know that the number of matches will always be four (4), there is no need for constructing the regex string in this manner, you can just hard code it.
EDIT To eliminate unwanted zero's if there are no matches, test to see if there are any matches using REGEXMATCH: eg:
=ArrayFormula(if(REGEXMATCH($I2,"\b\d{4}\b"),(value(REGEXEXTRACT($I2,REPT("(\b\d{4}\b).*?",(len($I2)-len(REGEXREPLACE($I2,"\b\d{4}\b","")))/4)))),""))
Use this formula, perhaps replacing the colon as split character with another character that's not likely to occur in source strings.
=filter(split(regexreplace(I$2, "\D+", ":"), ":"), len(split(regexreplace(I$2, "\D+", ":"), ":"))=4)
Explanation: it's a way around regex limitations in Google RE2 engine. Instead of looking for the pattern, we look for the anti-pattern (anything that is not digit) and replace it with the separator, then split. What remains is only substrings composed of digits, so we filter them so that only 4-character substrings remain.
I am close, but I need some help to complete a regex. Here is the goal:
Should succeed:
10.05
3.00
50
Should fail:
55.99 (>50)
3.001 (can't have the "1" at the end)
0.50 (< 3)
.99 (< 3)
$50 (can't have "$")
5.2 (if decimal, must have 2 digits after)
Here's the regex I have so far, but it doesn't quite do all the above correctly:
^([1-4][0-9]|50|[3-9])+(\.[0-9][0-9])?$
Can anyone share the answer? Thanks!
^(50(\.00)?|([1-4][0-9]|[3-9])(\.[0-9][0-9])?)$
There were two issues. Firstly, you had allowed non-zero values after the decimal point, even if the value before it was 50. So I separated that out on the top level. Secondly, just remove the +. Because due to it, you can have much larger numbers (by chaining 50 and 43 together, for instance).
However, as Bergi mentioned in a comment, it would be better to just check the format, and do the range check separately (without regex). This would be the format check:
^\d+(\.\d\d)?$
I found a online utility that returns a regex for integers when input the lower and upper limits of the range you want. I used it for the part before the . with limits 3-50 and after the . with limits 0-99. Here is the result:
^0*([3-9]|[1-4][0-9]|50)(\.[0-9]{2})?$
A quick glance... just remove the +
^([1-4][0-9]|50|[3-9])(.[0-9][0-9])?$
You should remove the + before the potential cents. Also, you will need to handle 50$ as a special case, because it can only have .00 after it and not any cent amount.
Also, I changed the [0-9] to the shortcut for digits: \d
/^((0?[3-9]|[1-4]\d)(\.\d\d)?|50(\.00)?)$/
I am new to regex and struggling to create an expression to parse a csv containing 1 to n values. The values can be integers or real numbers. The sample inputs would be:
1
1,2,3,4,5
1,2.456, 3.08, 0.5, 7
This would be used in c#.
Thanks,
Jerry
Use a CSV parser instead of RegEx.
There are several options - see this SO questions and answers and this one for the different options (built into the BCL and third party libraries).
The BCL provides the TextFieldParser (within the VisualBasic namespace, but don't let that put you off it).
A third party library that is liked by many is filehelpers.
Using REGEX for CSV parsing has been a 10 year jihad for me. I have found it remarkably frustrating, due to the boundary cases:
Numbers come in a variety of forms (here in the US, Canada):
1
1.
1.0
1000
1000.
1,000
1e3
1.0e3
1.0e+3
1.0e+003
-1
-1.0 (etc)
But of course, Europe has traditionally been different with regard to commas and decimal points:
1
1,0
1000
1.000e3
1e3
1,0e3
1,0e+3
1,0e+003
Which just ruins everything. So, we ignore the German and French and Continental standard because the comma just is impossible to work out whether it is separating values, or part of values. (The Continent likes TAB instead of COMMA)
I'll assume that you're "just" looking for numerical values separated from each other by commas and possible space-padding. The expression:
\s*(\-?\d+(?:\.\d*)?(?:[eE][\-+]?\d*)?)\s*
is a pretty fair parser of A NUMBER. Catches just about every reasonable case. Doesn't deal with imbedded commas though! It also trims off spaces, either side of the number.
From there, you can either build an iterative CSV string decomposer (walking each field, absorbing commas, assigning to an array, say), or use the scanf type function to do the same thing. I do prefer the iterative decomposition method - as it also allows you to parse out strings, hexadecimal, and virtually any other pattern you find in the data.
The regex you want is
#"([+-]?\d+(?:\.\d+)?)(?:$|,\s*)"
...from which you'll want capture group 1. However, don't use regex for something like this. String manipulation is much better when the input is in a very static, predictable format:
string[] nums = strInput.split(", ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
List<float> results = (from n in nums
select float.Parse(n)).ToList();
If you do use regex, make sure you do a global capture.
I think you would have to loop it to check for an unknown number of ints... or else something like this:
/ *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) *,? *([0-9.]*) */
and you could keep that going ",?([0-9]*)" as far as you wanted to, to account for a lot of numbers. The result would be an array of numbers....
http://jsfiddle.net/8URvL/1/
I know this is a long shot, but I have a huge text file and I need to add a given number to other numbers matching some criteria.
Eg.
identifying text 1.1200
identifying text 1.1400
and I'd like to transform this (by adding say 1.15) to
identifying text 2.2700
identifying text 2.2900
Normally I'd do this in Python, but it's on a Windows machine where I can't install too many things. I've got Vim though :)
Here is a simplification and a fix on hobbs' solution:
:%s/identifying text \zs\d\+\(.\d\+\)\=/\=(1.15+str2float(submatch(0)))/
Thanks to \zs, there is no need to recall the leading text. Thanks to str2float() a single addition is done on the whole number (in other words, 1.15 + 2.87 will give the expected result, 4.02, and not 3.102).
Of course this solution requires a recent version of Vim (7.3?)
You can do a capturing regex and then use a vimscript expression as a replacement, something like
:%s/\(identifying text \)\(\d\+\)\.\(\d\+\)/
\=submatch(1) . (submatch(2) + 1) . "." . (submatch(3) + 1500)
(only without the linebreak).
Your number format seems to be a fixed one, so it's easy to convert to int and come back (remove the dot) add 11500 and put the dot back.
:%s/\.//
:%normal11500^A " type C-V then C-a
:%s/....$/.&/
If you don't want to do that on all the lines but only the one which match 'identifying text' replace all the % by 'g/indentifying text/'
For integers you can just use n^A to add n to a number (and n^X to subtract it). I doubt whether that works for fractional numbers though.
Well this might not be a solution for vim but I think awk can help:
cat testscript | LC_ALL=C awk '{printf "%s %s %s %s %.3f\n", $1,$2,$3,$4,$5+1.567 }'
and the test
this is a number 1.56
this is a number 2.56
this is a number 3.56
I needed the LC_ALL=C for the correct conversion of the floating point separator, and maybe there is a more elegant solution for printing the beginning/ rest of the string. And the result looks like:
this is a number 3.127
this is a number 4.127
this is a number 5.127
Using macro
qa .......................... start record macro 'a'
/iden<Enter> ................ search 'ident*' press Enter
2w .......................... jump 2 words until number one (before dot)
Ctrl-a ...................... increases the number
2w .......................... jump to number after dot
1500 Ctrl-a ................. perform increases 1500 times
q ........................... stop record to macro 'a'
if you have 300 lines with this pattern just now making
300#a
I'm using an sql to replace credit card numbers with xxxx and finding that REGEX_REPLACE does not consistently replace everything. Below is the SET command i'm using on the SQL
SET COMMENTS_LONG =
REGEXP_REPLACE (COMMENTS_LONG,'\D[1-6]\d{3}.\d{4}.\d{4}.\d{3}(\d{1}.\d{3})?|\D[1-6]\d{12,15}|\D[1-6]\d{3}.\d{3}.?\d{3}.\d{5}', ' XXXXXXXXXXXXXXXX')
Before
Elizabeth aclled to change address.5430-6000-2111-1931 A
After
Elizabeth aclled to change address XXXXXXXXXXXXXXXX1 A
I tried increasing the number of X but result is the same. I also find that i have to put a space in front of the first X as it appears to move 1 char to the left.
I would't make the regex to specific, that increases the change of accidentially letting real numbers that don't match your expression pass to the end user.
I would just use a simple regex like this:
(\d+-){3}\d+
btw: why did you include \D at the beginning? The . is not part of the creditcard number, right?
Edit: Just found this regex
\b(?:\d[ -]*?){13,16}\b
at this site: http://www.regular-expressions.info/creditcard.html
You should read the paragraph "Finding Credit Card Numbers in Documents"