Notepad++ regular expression find a "," and replace odd "," with "." in every row - regex

I ran into a mess recently with my data aquisition program which saves four datapoints separated by a comma(csv format) every couple of milliseconds. I used a PC (NL region) where the decimal point is a "," for data acquisition.
Now when i try to import my csv file to matlab/excel it gives me 8 columns (which should be 4) as all the decimals are also printed as ","
Is there a way to use regular expression in notepad++ (for eg) to find all "," in a row, and replace the odd ones to a "."?
Thanks a lot for any help. I have thousands of rows of data such that doing it manually will take ages.
Example raw data:
0,000000,293,625871,331,588659,37,440656
0,049000,294,148003,332,215504,37,400764
0,098000,294,814740,332,944775,37,261284
0,145000,295,683491,333,688803,37,184621
0,193000,296,504183,334,271264,37,058032
0,241000,297,213232,334,704293,37,109150
0,289000,297,595142,335,081749,37,113087
0,339000,297,968663,335,292896,37,088883
0,403000,298,204013,335,796915,37,109307
How the processed data should look:
0.000000,293.625871,331.588659,37.440656
0.049000,294.148003,332.215504,37.400764
0.098000,294.814740,332.944775,37.261284
0.145000,295.683491,333.688803,37.184621
0.193000,296.504183,334.271264,37.058032
0.241000,297.213232,334.704293,37.109150
0.289000,297.595142,335.081749,37.113087
0.339000,297.968663,335.292896,37.088883
0.403000,298.204013,335.796915,37.109307

Just simply do:
Find what: (\d+),(\d+)
Replace with: $1.$2
Then clic on Replace all

To match all odd commas, use a look ahead that asserts an even number of commas follow:
,(?=(([^,]*,){2})*[^,]*$)

Related

SUM multiple values after a substring within all cells in a column in Google Sheets

For an open source chat analyser in Google Sheets, I need to extract all numeric values after a substring (Example), then total them.
For example, if a cell contains Example1 another text 123 Example500 text, Example1 and Example500 should be extracted out, and their numeric values summed to 501.
This is complicated further by needing to obtain the total for a column of messages.
What I've tried already:
=REGEXEXTRACT(A1, "Example(\d+)"): This only extracts the first matching value, but works!
=SUM(SPLIT(A1, "Example")): This works for messages that only include my target string, but falls apart when other strings are included. The output could possibly be filtered to results that start with a number, but this is very messy and possibly a red herring.
CONCATENATEing all my cells together, then searching for numbers. This is error-prone due to additional numbers within messages.
Another idea is to substitute each Example(\d+) to $1 the captured digit and space |. or replace anything else with empty string (regex101 demo). Knowing that $1 is unset on the right side of the alternation. Then split on space and sum up digits (any other occurring digits have been removed). If Example is a placeholder, replace with e.g. [[:alpha:]]+ for one or more alphabetic characters.
=IF(ISTEXT(A1);SUM(SPLIT(REGEXREPLACE(A1;"Example(\d+)|.";"$1 ");" "));0)
I added IF(ISTEXT(A1);...) for only processing text in the source field (to avoid errors). Else if empty or no text it's set to 0. Just remove if the field always contains text and this is unneeded.
Edit from #TheMaster: As a array formula, we can use BYROW
=BYROW(A:A; LAMBDA(row; IF(ISTEXT(row); SUM(SPLIT(
REGEXREPLACE(row;"Example(\d+)|.";"$1 ");" "));)))
try:
=LAMBDA(x, REGEXEXTRACT(A1, "(\w+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\w+\d+")),
REGEXEXTRACT(x, "\w+(\d+)"), )))(SPLIT(A1, " "))
update 1:
=LAMBDA(x, REGEXEXTRACT(A1, "(\D+)\d+")&
SUMPRODUCT(IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), )))(SPLIT(A1, " "))
update 2:
=INDEX(LAMBDA(xx, REGEXEXTRACT(xx, "(\D+)\d+")&
BYROW(LAMBDA(x, IF(IFERROR(REGEXMATCH(x, "\D+\d+")),
REGEXEXTRACT(x, "\D+(\d+)"), ))(SPLIT(xx, " ")), LAMBDA(x, SUMPRODUCT(x))))
(A1:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
if you start from A2 just change A1: to A2:

Get just X number of strings from a comma separated cel

having real trouble finding a succinct solution to this simple problem. Currently I have cells which contain many comma separated items. I just want the first 5.
ie. cell A1 =
text, another string, something else, here's another one, guess what another string here, and another, hello i'm another string, another string etc, etc, etccccc
and I'm trying to grab just the first 5 strings.
Beyond that, I wonder if I can incorporate a formula such as =LEN(A1)>20
Currently I do this with numerous; =IFERROR(INDEX( SPLIT(C31,","),1)) then =IFERROR(INDEX( SPLIT(C31,","),2)) etc. then run the LEN formula above.
Is there a simpler solution? Thanks so much.
Try,
=split(replace(A1, find("|", SUBSTITUTE(A1, ", ", "|", 5)), len(A1), ""), ", ", false)
For Excel, with data in A1, in B1 enter:
=TRIM(MID(SUBSTITUTE($A1,",",REPT(" ",999)),COLUMNS($A:A)*999-998,999))
and copy across:
To get all 5 substrings into a single cell, use:
=LEFT(A1,FIND(CHAR(1),SUBSTITUTE(A1,",",CHAR(1),5))-1)
=ARRAY_CONSTRAIN(SPLIT(A1,","),1,5)
=REGEXEXTRACT(A1,"((?:.*?,){5})")
=REGEXEXTRACT(A1,REPT("(.*?),",5))
SPLIT to split by delimiter
ARRAY_CONSTRAIN to constrain the array
REGEX1 to extract 5 comma separated values
. Any character
.*?, Any character repeated unlimited number of times (? as little as possible) followed by a ,
{5} Quantifier
REPT to repeat strings

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

Format all IP-Addresses to 3 digits

I'd like to use the search & replace dialogue in UltraEdit (Perl Compatible Regular Expressions) to format a list of IPs into a standard Format.
The list contains:
192.168.1.1
123.231.123.2
23.44.193.21
It should be formatted like this:
192.168.001.001
123.231.123.002
023.044.193.021
The RegEx from http://www.regextester.com/regular+expression+examples.html for IPv4 in the PCRE-Format is not working properly:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
I'm stucked. Does anybody have a proper solution which works in UltraEdit?
Thanks in advance!
Set the regular expression engine to Perl (on the advanced section) and replace this:
(?<!\d)(\d\d?)(?!\d)
with this:
0$1
twice. That should do it.
If your input is a single IP address (per line) and nothing else (no other text), this approach will work:
I used "Replace All" with Perl style regular expressions:
Replace (?<!\d)(?=\d\d?(?=[.\s]|$))
with 0
Just replace as often as it matches. If there is other text, things will get more complicated. Maybe the "Search in Column" option is helpful here, in case you are dealing with CSV.
If this is just a one-off data cleaning job, I often just use Excel or OpenOffice Calc for this type of thing:
Open your textfile and make sure only one IP address per line.
Open Excel or whatever and goto "Data|Import External Data" and import your textfile using "." as the separator.
You should now have 4 columns in excel:
192 | 168 | 1 | 1
Right click and format each column as a number with 3 digits and leading zeroes.
In column 5 just do a string concatenation of the previous columns with a "." in between each column:
A1 & "." & B1 & "." & C1 & "." & D1
This obviously is a cheap and dirty fix and is not a programmatic way of dealing with this, but I find this sort of technique useful for cleaning up data every now and then.
I'm not sure how you can use Regular Expression in Replace With box in UltraEdit.
You can use this regular expression to find your string:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])$