How to split CSV line according to specific pattern - regex

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.

The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']

If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

Related

Regex With Colons in Data

I have a text file which I'm looking to remove some data from. The data is separated using a colon ':' as the delimiter. There are approx 9 separations. The data after the 7th column is most often null and thus useless but the additional colons are still there.
An example of the file would like this:
column1:column2:column3:column4:column5:column6:column7:column8:column9:column10
I hope to remove the info from after column8. So the data to be removed would be:
:column9:column10
Could someone advise me how to do so in Regex?
I've been reading and no where have I found a way to isolate a colon and text following after x number of colons.
Any help you could offer would be much appreciated.
$_ = join ":", ( split /:/, $_, -1 )[0..7];
or
s/(?::[^:]*){2}\z//;
The following regex will keep the first 8 columns and discard all others.
s/^[^:]*(?::[^:]*){7}\K.*//;
Assumes simple single line records.

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

Url from google: \75, reg exp

I have such part of the code, which i'va got from google:
"
\u0438", '/webmasters/tools/backlinks-latest-dl?
hl\75ru\46siteUrl\75http://site.comt/\46security_token\75sMuiouiouWA-
TiuoiuoiuiocDo4:1489898898'); </script></div>
"
\75 is '=' but is it 16 digit? Why google don't use 'base64' or 'urlencode'
When i use such regexp
/backlinks-latest-dl.*security_token\\\75/
everything is ok.
But when i put the code to the file and then parse the file- i can't find data by the reg exp. I can find using only
/backlinks-latest-dl.*security_token(\\\)(75)/ HERE 3 '\' NOT 2
When i use
/backlinks-latest-dl.*security_token(\\\75)/
it is'n't work.
What is it?
I could hardly get your point but the problem could be that you put three backslashes instead of two or four. It you want to escape the initial slashes,you should use four of them in a row \\. Or, you need to leave only two of them if additional escaping is not needed. I just don't see any logic in using three backslashes.

Notepad++ regular expression find a "," and replace odd "," with "." in every row

I ran into a mess recently with my data aquisition program which saves four datapoints separated by a comma(csv format) every couple of milliseconds. I used a PC (NL region) where the decimal point is a "," for data acquisition.
Now when i try to import my csv file to matlab/excel it gives me 8 columns (which should be 4) as all the decimals are also printed as ","
Is there a way to use regular expression in notepad++ (for eg) to find all "," in a row, and replace the odd ones to a "."?
Thanks a lot for any help. I have thousands of rows of data such that doing it manually will take ages.
Example raw data:
0,000000,293,625871,331,588659,37,440656
0,049000,294,148003,332,215504,37,400764
0,098000,294,814740,332,944775,37,261284
0,145000,295,683491,333,688803,37,184621
0,193000,296,504183,334,271264,37,058032
0,241000,297,213232,334,704293,37,109150
0,289000,297,595142,335,081749,37,113087
0,339000,297,968663,335,292896,37,088883
0,403000,298,204013,335,796915,37,109307
How the processed data should look:
0.000000,293.625871,331.588659,37.440656
0.049000,294.148003,332.215504,37.400764
0.098000,294.814740,332.944775,37.261284
0.145000,295.683491,333.688803,37.184621
0.193000,296.504183,334.271264,37.058032
0.241000,297.213232,334.704293,37.109150
0.289000,297.595142,335.081749,37.113087
0.339000,297.968663,335.292896,37.088883
0.403000,298.204013,335.796915,37.109307
Just simply do:
Find what: (\d+),(\d+)
Replace with: $1.$2
Then clic on Replace all
To match all odd commas, use a look ahead that asserts an even number of commas follow:
,(?=(([^,]*,){2})*[^,]*$)

How to extract line numbers from a multi-line string in Vim?

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)
You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")
One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')