I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]
Related
I have search but found python and related solutions.
I have a string like
"Hello 'how' are % you?"
which I want to convert to below after Remove everything except numbers and alphabets
Hello how are you
I am using Regexreplace as follows but now sure what should be the replacement or if its a right approach
=REGEXREPLACE(B2 , "([^A-Za-z0-9]+)")
The main thing i want to remove from the string are the stuff like " or strange symbols
can anyone help?
You can use:
=TRIM(REGEXREPLACE(B2,"[\W_]+"," "))
Or, include the space in your character class:
=REGEXREPLACE(B2,"[\W_ ]+"," "))
Where: \W is short for [^A-Ba-b0-9_], so to include the underscore we added it to the character class.
you can use:
=TRIM(REGEXREPLACE(A1, "'|%|""", ))
I have a string that is delimited by a comma.
The first 3 fields are static.
Fields 4-20 are dynamic and can contain any string even if it has special characters but cannot be empty.
Field 21 is static
Field 22 is dynamic and can contain any string even if it has special characters.
Fields 23,24 are static.
I need to make sure the string matches the above criteria and is a match, but am wondering on how to make fields 4-20 have the option of containing the special characters and not be blank. (Total of 17 between 4-20)
If I remove the requirement of the special characters this seems to work:
Field1\,Field2\,Field3\,+([\w\s\,]+)F21/C\,[\w\s\,]+(F/23\,)(Field24)
with this string
Field1,Field2,Field3,F4,f5,6f 1,f72,f8,F9,F10,F1,f12,f13,f14,f15,f16,f17,f18,f19,f20,F21/C,F22,F/23,Field24
Is there a way to accomplish this with fields 4-20 having special characters and not being empty like "" or " " or am I pushing it too far?
I know I can parse it through c# but I'm experimenting with Regex and it seems pretty powerful.
Thanks
I did not fully understand the problem
But I think that's what you want bottom line:
s1,s2,s3,([^ ,]+,){17}s21,[^ ,]+,s23,s24
replace the sX to relevant static fields.
example:
https://regex101.com/r/EaAPKH/1
The data I want to parse has columns with the following format:
Character Big Medium Meaning ImageCode Small Constitutens Lesson Frame Strokes JH JTPL Heisig Story koohiiStory1 koohiiStory2 On-Reading Kun-Reading Examples:
All of those are separated by tabs \t (even though it may not look like it on the browser). Also notice at the end of each line there is a colon :. The problem is that the columns koohiiStory2 and examples may or may not exist and there may also be cases in which the data is corrupt and there is a tab inside Heisig Story but those are the minority.
What I'm trying to match is the values for On-Reading, Kun-Reading and Examples. All of these are distinct from the rest because they don't use standard english characters (romaji) but they use japanese characters instead with the exception of perhaps a few commas or dots. It is also guaranteed that either Kun-Reading or Examples will end with a colon : and that On-Reading and Kun-Reading will exist and that all three of the columns will be consecutive.
Here is some sample data.
How can I parse that to return this?
Alright, I'll give it a shot.
Since the content you expect is mostly non-ascii characters within a dot + space or tab* and :
(?<=\.(\s|\t)) // Positive lookbehind for a 'dot' + 'space or tab'
[^\w]+ // Any non words
(?=\:) // Positive lookahead for a ':'
Working sample on regex101
In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)
I want to tokenize text in my database with RegEx and store the resulting tokens in a table. First I want to split the words by spaces, and then each token by punctuation.
I'm doing this in my application, but executing it in the database might speed it up.
Is it possible to do this?
There is a number of functions for tasks like that.
To retrieve the 2nd word of a text:
SELECT split_part('split this up', ' ', 2);
Split the whole text and return one word per row:
SELECT regexp_split_to_table('split this up', E'\\s+');
Actually, the last example splits on any stretch of whitespace.)