Regex to handle malformed delimited files - regex

I am trying to find a regular expression that will not match a delimiter if it is wrapped in double quotes. But it must also be able to handle values that have a single double quote. I have the first part down with the below expression where DELIMITER could be just about anything but is mainly commas, pipes, and double pipes:
DELIMITER(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
This handles a properly formed CSV rowlike apple, "banana, and orange", grape. I can split on the delimiter and get the values:
['apple', 'banana, and orange', 'grape']
My problem is that I may encounter a line like apple, "banana, and orange, grape. In this case I would want to get the values:
['apple', '"banana', 'and orange', 'grape']
However, I get:
['apple, "banana', 'and orange', 'grape']
It basically ignores all of the commas up to the double quote.
The logic that I have in my head is that I want to ignore a comma if it is preceded by a double quote, but only if it has a double quote in front of it as well. My first thought was to play around with a look-behind, but I can't get that to work due to look-behinds not able to handle quantifiers (correct me if this is wrong).
I am using Qt QRegExp which I understand is more or less similar to the Perl regex engine. Please let me know if there is more information that I can provide. I know regular expressions can be finicky based on your setup, and I hope I have explained what I'm looking for well enough!

It's not QT but boost::tokenizer, which is header-only, has support for escaped delimited text formats.
From the example usage at the Boost docs: http://www.boost.org/doc/libs/1_60_0/libs/tokenizer/escaped_list_separator.htm
// simple_example_2.cpp
#include<iostream>
#include<boost/tokenizer.hpp>
#include<string>
int main(){
using namespace std;
using namespace boost;
string s = "Field 1,\"putting quotes around fields, allows commas\",Field 3";
tokenizer<escaped_list_separator<char> > tok(s);
for(tokenizer<escaped_list_separator<char> >::iterator beg=tok.begin(); beg!=tok.end();++beg){
cout << *beg << "\n";
}
}
In the malformed case tok returns a single token, which isn't what you're looking for. You're looking for non-standard1 parsing, consider writing a small state machine instead of a regular expression.
1. As much as there is a standard for delimited text

Related

Extract a text string with regex

I have a large set of data I need to clean with open refine.
I am quite bad with regex and I can't think of a way to get what I want,
which is extracting a text string between quotes that includes lots of special characters like " ' / \ # # -
In each cell, it has the same format
caption': u'text I want to extract', u'likes':
Any help would be highly appreciated!
If you want to extract text string that includes lots of special characters in between, and is located between quotes ' ', You can do it in general this way:
\'[\S\s]*?\'
Demo
.
In your case, if you want to extract only the medial quote from this: caption': u'text I want to extract', u'likes': , Try this Regex:
(?<=u\')[\V]*?(?=\'\,)
Demo
We designed OpenRefine with a few smart functions to handle common cases such as yours without using Regex.
Two other cool ways to handle this in OpenRefine.
Using drop down menu:
Edit Column
Split into several columns
by separator Separator '
Using smartSplit
(string s, optional string sep)
returns: array
Returns the array of strings obtained by splitting s with separator sep. Handles quotes properly. Guesses tab or comma separator if "sep" is not given.
value.smartSplit("'")[2]

How to stop Ember.Handlebars.Utils.escapeExpression escaping apostrophes

I'm fairly new to Ember, but I'm on v1.12 and struggling with the following problem.
I'm making a template helper
The helper takes the bodies of tweets and HTML anchors around the hashtags and usernames.
The paradigm I'm following is:
use Ember.Handlebars.Utils.escapeExpression(value); to escape the input text
do logic
use Ember.Handlebars.SafeString(value);
However, 1. seems to escape apostrophes. Which means that any sentences I pass to it get escaped characters. How can I avoid this whilst making sure that I'm not introducing potential vulnerabilities?
Edit: Example code
export default Ember.Handlebars.makeBoundHelper(function(value){
// Make sure we're safe kids.
value = Ember.Handlebars.Utils.escapeExpression(value);
value = addUrls(value);
return new Ember.Handlebars.SafeString(value);
});
Where addUrlsis a function that uses a RegEx to find and replace hashtags or usernames. For example, if it were given #emberjs foo it would return #emberjs foo.
The result of the above helper function would be displayed in an Ember (HTMLBars) template.
escapeExpression is designed to convert a string into the representation which, when inserted in the DOM, with escape sequences translated by the browser, will result in the original string. So
"1 < 2"
is converted into
"1 < 2"
which when inserted into the DOM is displayed as
1 < 2
If "1 < 2" were inserted directly into the DOM (eg with innerHTML), it would cause quite a bit of trouble, because the browser would interpret < as the beginning of a tag.
So escapeExpression converts ampersands, less than signs, greater than signs, straight single quotes, straight double quotes, and backticks. The conversion of quotes is not necessary for text nodes, but could be for attribute values, since they may enclosed in either single or double quotes while also containing such quotes.
Here's the list used:
var escape = {
"&": "&",
"<": "<",
">": ">",
'"': """,
"'": "'",
"`": "`"
};
I don't understand why the escaping of the quotes should be causing you a problem. Presumably you're doing the escapeExpression because you want characters such as < to be displayed properly when output into a template using normal double-stashes {{}}. Precisely the same thing applies to the quotes. They may be escaped, but when the string is displayed, it should display fine.
Perhaps you can provide some more information about input and desired output, and how you are "printing" the strings and in what contexts you are seeing the escaped quote marks when you don't want to.

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

Parsing as string of data but leaving out quotes

I need to use RegEx to run through a string of text but only return that parts that I need. Let's say for example the string is as follows:
1234,Weapon Types,100,Handgun,"This is the text, "and", that is all."""
\d*,Weapon Types,(\d*),(\w+), gets me most of the way, however it is the last part that I am having an issue with. Is there a way for me to capture the rest of the string i.e.
"This is the text, "and", that is all."""
without picking up the quotes? I've tried negating them, however it just stops the string at the quote.
Please keep in mind that the text for this string is unknown so doing literal matches will not work.
You've given us something very difficult to solve. It's okay that you have nested commas inside your string. Once we come across a double-quote, we can ignore everything until the end quote. This would gooble up commas.
But how will your parser know that the next double-quote isn't ending the string. How does it know that it a nested double-quote?
If I could slightly modify your input string to make it clear what is a nested quote, then parsing is easy...
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, "and", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""([^""]+)""");
MessageBox.Show(m.Groups[3].Value);
But if your input string must have nested quotes like that, then we must come up with some other rule for detecting what is the real end of the string. How about this?
var txt = "1234,Weapon Types,100,Handgun,\"This is the text, \"and\", that is all.\",other stuff";
var m = Regex.Match(txt, #"^\d*,Weapon Types,(\d*),(\w+),""(.+)"",");
MessageBox.Show(m.Groups[3].Value);
The result is...
This is the text, "and", that is all.

Regex to replace string with another string in MS Word?

Can anyone help me with a regex to turn:
filename_author
to
author_filename
I am using MS Word 2003 and am trying to do this with Word's Find-and-Replace. I've tried the use wildcards feature but haven't had any luck.
Am I only going to be able to do it programmatically?
Here is the regex:
([^_]*)_(.*)
And here is a C# example:
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
String test = "filename_author";
String result = Regex.Replace(test, #"([^_]*)_(.*)", "$2_$1");
}
}
Here is a Python example:
from re import sub
test = "filename_author";
result = sub('([^_]*)_(.*)', r'\2_\1', test)
Edit: In order to do this in Microsoft Word using wildcards use this as a search string:
(<*>)_(<*>)
and replace with this:
\2_\1
Also, please see Add power to Word searches with regular expressions for an explanation of the syntax I have used above:
The asterisk (*) returns all the text in the word.
The less than and greater than symbols (< >) mark the start and end
of each word, respectively. They
ensure that the search returns a
single word.
The parentheses and the space between them divide the words into
distinct groups: (first word) (second
word). The parentheses also indicate
the order in which you want search to
evaluate each expression.
Here you go:
s/^([a-zA-Z]+)_([a-zA-Z]+)$/\2_\1/
Depending on the context, that might be a little greedy.
Search pattern:
([^_]+)_(.+)
Replacement pattern:
$2_$1
In .NET you could use ([^_]+)_([^_]+) as the regex and then $2_$1 as the substitution pattern, for this very specific type of case. If you need more than 2 parts it gets a lot more complicated.
Since you're in MS Word, you might try a non-programming approach. Highlight all of the text, select Table -> Convert -> Text to Table. Set the number of columns at 2. Choose Separate Text At, select the Other radio, and enter an _. That will give you a table. Switch the two columns. Then convert the table back to text using the _ again.
Or you could copy the whole thing to Excel, construct a formula to split and rejoin the text and then copy and paste that back to Word. Either would work.
In C# you could also do something like this.
string[] parts = "filename_author".Split('_');
return parts[1] + "_" + parts[0];
You asked about regex of course, but this might be a good alternative.