Regex to Split by comma and remove text delimiters [duplicate]

Regex to Split by comma and remove text delimiters [duplicate] - regex

I have a text file that is in a comma separated format, delimited by " on most fields. I am trying to get that into something I can enumerate through (Generic Collection, for example). I don't have control over how the file is output nor the character it uses for the delimiter.
In this case, the fields are separated by a comma and text fields are enclosed in " marks. The problem I am running into is that some fields have quotation marks in them (i.e. 8" Tray) and are accidentally being picked up as the next field. In the case of numeric fields, they don't have quotes around them, but they do start with a + or a - sign (depicting a positive/negative number).
I was thinking of a RegEx, but my skills aren't that great so hopefully someone can come up with some ideas I can try. There are about 19,000 records in this file, so I am trying to do it as efficiently as possible. Here are a couple of example rows of data:
"00","000000112260 ","Pie Pumpkin ","RET","6.99 "," ","ea ",+0000000006.99000
"00","000000304078 ","Pie Apple caramel ","RET","9.99 "," ","ea ",+0000000009.99000
"00","StringValue here","8" Tray of Food ","RET","6.99 "," ","ea ",-00000000005.3200
There are a lot more fields, but you can get the picture....
I am using VB.NET and I have a generic List setup to accept the data. I have tried using CSVReader and it seems to work well until you hit a record like the 3rd one (with a quote in the text field). If I could somehow get it to handle the additional quotes, than the CSVReader option will work great.
Thanks!

I recommend looking at the TextFieldParserClass in .Net. You need to include
Imports Microsoft.VisualBasic.FileIO.TextFieldParser
Here's a quick sample:
Dim afile As FileIO.TextFieldParser = New FileIO.TextFieldParser(FileName)
Dim CurrentRecord As String() ' this array will hold each line of data
afile.TextFieldType = FileIO.FieldType.Delimited
afile.Delimiters = New String() {","}
afile.HasFieldsEnclosedInQuotes = True
' parse the actual file
Do While Not afile.EndOfData
Try
CurrentRecord = afile.ReadFields
Catch ex As FileIO.MalformedLineException
Stop
End Try
Loop

From here:
Encoding fileEncoding = GetFileEncoding(csvFile);
// get rid of all doublequotes except those used as field delimiters
string fileContents = File.ReadAllText(csvFile, fileEncoding);
string fixedContents = Regex.Replace(fileContents, #"([^\^,\r\n])""([^$,\r\n])", #"$1$2");
using (CsvReader csv =
new CsvReader(new StringReader(fixedContents), true))
{
// ... parse the CSV

As this link says... Don't roll your own CSV parser!
Use TextFieldParser as Avi suggested. Microsoft has already done this for you. If you ended up writing one, and you find a bug in it, consider replacing it instead of fixing the bug. I did just that recently and it saved me a lot of time.

Give a look to the FileHelpers library.

You could give CsvHelper (a library I maintain) a try and it's available via NuGet. It follows the RFC 4180 standard for CSV. It will be able to handle any content inside of a field including commas, quotes, and new lines.
CsvHelper is simple to use, but it's also easy to configure it to work with many different types of delimited files.
CsvReader csv = new CsvReader( streamToFile );
IEnumerable<MyObject> myObjects = csv.GetRecords<MyObject>();
If you want to read CSV files on a lower level, you can use the parser directly, which will return each row as a string array.
var parser = new CsvParser( myTextReader );
while( true )
{
string[] line = parser.ReadLine();
if( line == null )
{
break;
}
}

I am posting this as an answer so I can explain how I did it and why.... The answer from Mitch Wheat was the one that gave me the best solution for this case and I just had to modify it slightly due to the format this data was exported in.
Here is the VB Code:
Dim fixedContents As String = Regex.Replace(
File.ReadAllText(csvFile, fileEncoding),
"(?<!,)("")(?!,)",
AddressOf ReplaceQuotes)
The RegEx that was used is what I needed to change because certain fields had non-escaped quotes in them and the RegEx provided didn't seem to work on all examples. This one uses 'Look Ahead' and 'Look Behind' to see if the quote is just after a comma or just before. In this case, they are both negative (meaning show me where the double quote is not before or after a comma). This should mean that the quote is in the middle of a string.
In this case, instead of doing a direct replacement, I am using the function ReplaceQuotes to handle that for me. The reason I am using this is because I needed a little extra logic to detect whether it was at the beginning of a line. If I would have spent even more time on it, I am sure I could have tweaked the RegEx to take into consideration the beginning of the line (using MultiLine, etc) but when I tried it quickly, it didn't seem to work at all.
With this in place, using CSV reader on a 32MB CSV file (about 19000 rows), it takes about 2 seconds to read the file, perform the regex, load it into the CSV Reader, add all the data to my generic class and finish. Real quick!!

RegEx to exclude first and last quote would be (?<!^)(?<!,)("")(?!,)(?!$). Of course, you need to use RegexOptions.Multiline.
That way there is no need for evaluator function. My code replaces undesired double quotes with single quotes.
Complete C# code is as below.
string fixedCSV = Regex.Replace(
File.ReadAllText(fileName),
#"(?<!^)(?<!;)("")(?!;)(?!$)", "'", RegexOptions.Multiline);

There are at least ODBC drivers for CSV files. But there are different flavors of CSV.
What produced these files? It's not unlikely that there's a matching driver based on the requirements of the source application.

Your problem with CSVReader is that the quote in the third record isn't escaped with another quote (aka double quoting). If you don't escape them, then how would you expect to handle ", in the middle of a text field?
http://en.wikipedia.org/wiki/Comma-separated_values
(I did end up having to work with files (with different delimiters) but the quote characters inside a text value weren't escaped and I ended up writing my own custom parser. I do not know if this was absolutely necessary or not.)

The logic of this custom approach is: Read through file 1 line at a time, split each line on the comma, remove the first and last character (removing the outer quotes but not affecting any inside quotes), then adding the data to your generic list. It's short and very easy to read and work with.
Dim fr As StreamReader = Nothing
Dim FileString As String = ""
Dim LineItemsArr() as String
Dim FilePath As String = HttpContext.Current.Request.MapPath("YourFile.csv")
fr = New System.IO.StreamReader(FilePath)
While fr.Peek <> -1
FileString = fr.ReadLine.Trim
If String.IsNullOrEmpty(FileString) Then Continue While 'Empty Line
LineItemsArr = FileString.Split(",")
For Each Item as String In LineItemsArr
'If every item will have a beginning and closing " (quote) then you can just
'cut the first and last characters of the string here.
'i.e. UpdatedItems = Item. remove first and last character
'Then stick the data into your Generic List (Of String()?)
Next
End While

public static Encoding GetFileEncoding(String fileName)
{
Encoding Result = null;
FileInfo FI = new FileInfo(fileName);
FileStream FS = null;
try
{
FS = FI.OpenRead();
Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8 };
for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)
{
FS.Position = 0;
byte[] Preamble = UnicodeEncodings[i].GetPreamble();
bool PreamblesAreEqual = true;
for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)
{
PreamblesAreEqual = Preamble[j] == FS.ReadByte();
}
if (PreamblesAreEqual)
{
Result = UnicodeEncodings[i];
}
}
}
catch (System.IO.IOException)
{
}
finally
{
if (FS != null)
{
FS.Close();
}
}
if (Result == null)
{
Result = Encoding.Default;
}
return Result;
}

Related

How to split on a character only if it's outside of quotes in golang?

I need to split a chunk of text on the + symbol, but only when it's outside of single quotes. The text will look something like:
Some.data:'some+value'+some.more.data:9+yet.more.data:'rock+roll'
which should become a slice of three values:
Some.data:'some+value'
some.more.data:9
yet.more.data:'rock+roll'
I've found similar questions that do it using regex, but that requires look ahead which the golang regex engine doesn't have.
I also took a crack at creating my own regex without lookahead:
'.*?'(\+)|[^']*(\+)
But that seems to fall apart on the third item where it splits on the + in 'rock+roll'.
I've thought about potentially doing a string split on + and then validating each slice to make sure it's not a partial expression and then stitching the pieces back together if it is, but it will be fairly involved and i'd like to avoid it if possible.
At the moment I think the best solution would be to identify text that is inside of quotes (which I can easily do with regex), either URL encode that text or do something else with the plus sign, split the text and then URL decode the expression to get the + sign inside of quotes back, but i'm wondering if there is a better way.
Does anyone know of a way to split on a + sign that is outside of quotes using regex without lookahead? Can anyone think of a simpler solution than my URL encoding/decoding method?

Plain code can be easier:
func split(s string) []string {
var result []string
inquote := false
i := 0
for j, c := range s {
if c == '\'' {
inquote = !inquote
} else if c == '+' && !inquote {
result = append(result, s[i:j])
i = j +1
}
}
return append(result, s[i:])
}

Regular expression for custom syntax in text input

I'm supposed to enforce a certain search-syntax in a text input, and after watching several RegEx videos and tutorials, I'm still having difficulties creating a regex for my purpose.
The expression structure should be something like that:
$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8
may start with a free text search that may contain any character other than the delimiter, which is ,. (free text must be first, and the string may be ONLY free text search).
after free text comma-separated parts of field names which consist only [a-z][A-Z], followed by operator: (=|<|>|<>) and followed by field search value that may be anything but ,.
between the commas that separate the parts there may be spaces (\s*).
The free text part or at least one field=value must appear in order for the string to be valid.
Did anyone understand the question? :)

^[^,]*(?:,\s*[a-zA-Z]+(?:[=><]|<>)[^,]+)*$? – Rawing
Thanks, that seems to work. Why did you use non-capturing groups?
He did it most probably because he didn't assume that the groups are to be captured (you didn't specify that).
Plus - if I start out the string with a comma, it is valid, whereas I
want it to not be valid (if there's no free text at the beginning).
That can be accomplished by changing the first * to a +, i. e. ^[^,]+…
I'm using javascript. I want to be able to separate each key=value
pair (including the possible free text as a group), and within that
group I would like to be able to capture key, operator, and value as
separate entities (or groups)
That's not doable with only one RegExp invocation, see e. g. How to capture an arbitrary number of groups in JavaScript Regexp? Here's an example solution:
s = '$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8'
part = /,\s*([a-zA-Z]+)(<>|[=><])([^,]+)/
re = RegExp('^([^,]+)('+part.source+')*$')
freetext = re.exec(s)[1] // validate s and take free text as 1st capture group of re
if (freetext)
{ document.write('free text:', freetext, '<p>')
parts = RegExp(part.source, 'g')
m = s.slice(freetext.length).match(parts) // now get all key=value pairs into m[]
if (m)
{ field = []
for (i = 0; i < m.length; ++i)
{ f = m[i].match(part) // and now capture key, operator and value from m[i]
field[i] = { key:f[1], operator:f[2], value:f[3] }
for (property in field[i]) // display them
document.write(property, ':', field[i][property], '; ')
document.write('<p>')
}
document.write(field.length, ' key/value pairs total<p>')
}
}

AS3 Remove characters after a word in a string

AS3 | Adove Air 3.5 | Flash CS6
I have html pulled from a source online put into a String that I'm taking apart piece by piece to build an XML file based off of the information I pulled. I need to search the whole String to remove characters following "info" up until the character "&". There are multiple instances of this throughout the string, so I thought it would be best to use a RegExp. Other suggestions are welcomed.
//code from the website put into a full string
var fullString = new String(urlLoader.data);
//search string to find <info> and remove characters AFTER it up until the character '&'
var stringPlusFive:RegExp = /<info>1A2E589EIM&/g;
//should only remove '1A2E589EIM' leaving '<info>&'
fullString = fullString.replace(stringPlusFive,"");
The problem I'm having trouble figuring out is the "1A2E589EIM" is not consistent. They are random numbers and characters, and possibly length, so I can't really use what I have written above. It will always lead up to a "&".
Thanks in advance for any help.

I think the regexp would be more like
//search string to find <info> and remove characters AFTER it up until the character '&'
var stringPlusFive:RegExp = /<info>\w+&/g;
//should only remove '1A2E589EIM' leaving '<info>&'
fullString = fullString.replace(stringPlusFive,"<info>&");
now, if you are going to parse your string to XML, you could wait until you have your XML structure and just remove the info string from 'info' nodes
//code from the website put into a full string
var fullString = new String(urlLoader.data);
//parse to XML
var xml:XML = XML(fullString),
index:int, text:string;
for each(var info:XML in xml.descendants('info')){
text = info.*[0].text();
index = text.indexOf('&');
if(index != -1){
info.*[0] = text.substr(index);
}
}
I'm not sure about the affectation to *[0] but it should be something like that. Hope this helps.

How to format given string using regex?

So I have defined variables in such a way in my file:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world= "bye world"
I have many declarations like that.
Is it possible to format them as(All '=' in a line):
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
I can't even think of a way to do it. Any kind of help is appreciated.
P.S If it matters, I use sublime text 2.

If it is a one-time task you might try the following:
Import the text file into, e.g., Excel using the 'text in columns' functionality (separation character: space) so that column A contains "public" in each row, column B "static", ..., column E the variable names, column F the "=" signs, and column G the variable values (strings).
Then put the following formula into cell H1 (and copy it down to the other rows):
="public static final String "&E1&REPT(" ";50-LEN(E1))&" = "&""""&G1&""""
Afterwards, column H contains the following outputs:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
Please note that the Excel functions REPT and LEN are named differently if your Excel language is not English.

If you're careful with your original layout (so that = signs are separated from the variable name, for example, unlike the third line of data in the example), then this will do the job:
awk '{ if (length($5) > max) max = length($5);
name[NR] = $5; value[NR] = $0; sub(/^[^"]*"/, "\"", value[NR]); }
END { format = sprintf("public static final String %%-%ds = %%s\n", max);
for (i = 1; i <= NR; i++) printf(format, name[i], value[i]); }'
It assumes you are dealing with 'public static final String' throughout (but doesn't verify that). It keeps track of the length of the longest name it reads (line 1), and also the variable name and the material from the open double quote to the end of line (line 2). At the end, it generates a format string which will print the variable names left justified in a field as long as the longest (line 3). It then applies that to the saved data (line 4), generating:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
To make it bomb-proof (e.g. the original data), you have to work a bit harder, though it shouldn't be insuperable. The simplest fix for the sloppy original format would be to pre-filter the data with:
sed 's/=/ = /'
Extra spaces around properly spaced input won't affect the output, and the missing spaces in the 3 sample line of data are fixed. It would be fiddly to do that inside awk because you'd want it to resplit the line after editing it. You could do something very similar in Perl.
Given that the volumes of data to be processed are unlikely to be in the megabyte range, let alone larger, the two command-command solution is perfectly reasonable; you're unlikely to be able to measure the cost of the sed process.

There is no single regex that can solve your problem. Your only option would be to run a series of regexes, one to handle each line length:
s/^(.{40})=/\1 =/
s/^(.{39})=/\1 =/
s/^(.{38})=/\1 =/
And even then, that's probably not what you want and it's probably much, much easier to do it by hand.
The problem is that the only way a regex substitution can insert different strings at different times is if what it's inserting is a backref, and there's no backref to give you your 5 - N space characters. Your other option would be to try to capture a variable number of characters, but in this case there's no way to make that do it for you either.
Regexes were not made to do things like that (they don't support arithmetic), but some text editors are, so just find a fancy text editor or do it by hand.

Since you're using Sublime Text 2, there's a much easier way to do that.
There's a great package for Sublime Text 2 which will do exactly what you want:
Sublime Alignment
Dead-simple alignment of multi-line selections and
multiple selections for Sublime Text 2.
Features:
Align multiple selections to the same column by inserting spaces (or
tabs)
Align all lines in a multi-line selection to the same indent
level
Align the first = on each line of a multi-line selection to the
same column
Before:
After:

Notepad++ RegeEx group capture syntax

I have a list of label names in a text file I'd like to manipulate using Find and Replace in Notepad++, they are listed as follows:
MyLabel_01
MyLabel_02
MyLabel_03
MyLabel_04
MyLabel_05
MyLabel_06
I want to rename them in Notepad++ to the following:
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three
The Regex I'm using in the Notepad++'s replace dialog to capture the label name is the following:
((MyLabel_0)((1)|(2)|(3)|(4)|(5)|(6)))
I want to replace each capture group as follows:
\1 = Label_
\2 = A_One
\3 = A_Two
\4 = A_Three
\5 = B_One
\6 = B_Two
\7 = B_Three
My problem is that Notepad++ doesn't register the syntax of the regex above. When I hit Count in the Replace Dialog, it returns with 0 occurrences. Not sure what's misesing in the syntax. And yes I made sure the Regular Expression radio button is selected. Help is appreciated.
UPDATE:
Tried escaping the parenthesis, still didn't work:
\(\(MyLabel_0\)\((1\)|\(2\)|\(3\)|\(4\)|\(5\)|\(6\)\)\)

Ed's response has shown a working pattern since alternation isn't supported in Notepad++, however the rest of your problem can't be handled by regex alone. What you're trying to do isn't possible with a regex find/replace approach. Your desired result involves logical conditions which can't be expressed in regex. All you can do with the replace method is re-arrange items and refer to the captured items, but you can't tell it to use "A" for values 1-3, and "B" for 4-6. Furthermore, you can't assign placeholders like that. They are really capture groups that you are backreferencing.
To reach the results you've shown you would need to write a small program that would allow you to check the captured values and perform the appropriate replacements.
EDIT: here's an example of how to achieve this in C#
var numToWordMap = new Dictionary<int, string>();
numToWordMap[1] = "A_One";
numToWordMap[2] = "A_Two";
numToWordMap[3] = "A_Three";
numToWordMap[4] = "B_One";
numToWordMap[5] = "B_Two";
numToWordMap[6] = "B_Three";
string pattern = #"\bMyLabel_(\d+)\b";
string filePath = #"C:\temp.txt";
string[] contents = File.ReadAllLines(filePath);
for (int i = 0; i < contents.Length; i++)
{
contents[i] = Regex.Replace(contents[i], pattern,
m =>
{
int num = int.Parse(m.Groups[1].Value);
if (numToWordMap.ContainsKey(num))
{
return "Label_" + numToWordMap[num];
}
// key not found, use original value
return m.Value;
});
}
File.WriteAllLines(filePath, contents);
You should be able to use this easily. Perhaps you can download LINQPad or Visual C# Express to do so.
If your files are too large this might be an inefficient approach, in which case you could use a StreamReader and StreamWriter to read from the original file and write it to another, respectively.
Also be aware that my sample code writes back to the original file. For testing purposes you can change that path to another file so it isn't overwritten.

Bar bar bar - Notepad++ thinks you're a barbarian.
(obsolete - see update below.) No vertical bars in Notepad++ regex - sorry. I forget every few months, too!
Use [123456] instead.
Update: Sorry, I didn't read carefully enough; on top of the barhopping problem, #Ahmad's spot-on - you can't do a mapping replacement like that.
Update: Version 6 of Notepad++ changed the regular expression engine to a Perl-compatible one, which supports "|". AFAICT, if you have a version 5., auto-update won't update to 6. - you have to explicitly download it.

A regular expression search and replace for
MyLabel_((01)|(02)|(03)|(04)|(05)|(06))
with
Label_(?2A_One)(?3A_Two)(?4A_Three)(?5B_One)(?6B_Two)(?7B_Three)
works on Notepad 6.3.2
The outermost pair of brackets is for grouping, they limit the scope of the first alternation; not sure whether they could be omitted but including them makes the scope clear. The pattern searches for a fixed string followed by one of the two-digit pairs. (The leading zero could be factored out and placed in the fixed string.) Each digit pair is wrapped in round brackets so it is captured.
In the replacement expression, the clause (?4A_Three) says that if capture group 4 matched something then insert the text A_Three, otherwise insert nothing. Similarly for the other clauses. As the 6 alternatives are mutually exclusive only one will match. Thus only one of the (?...) clauses will have matched and so only one will insert text.

The easiest way to do this that I would recommend is to use AWK. If you're on Windows, look for the mingw32 precompiled binaries out there for free download (it'll be called gawk).
BEGIN {
FS = "_0";
a[1]="A_One";
a[2]="A_Two";
a[3]="A_Three";
a[4]="B_One";
a[5]="B_Two";
a[6]="B_Three";
}
{
printf("Label_%s\n", a[$2]);
}
Execute on Windows as follows:
C:\Users\Mydir>gawk -f test.awk awk.in
Label_A_One
Label_A_Two
Label_A_Three
Label_B_One
Label_B_Two
Label_B_Three

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js