Regex for strings in Bibtex - regex

I'm trying to parse Bibtex files using lex/yacc. Strings in the bibtex database can be surrounded by quotes "..." or with braces - {...}
But every entry is also enclosed in braces. How do differentiate between an entry and a string surrounded by braces?
#Book{sweig42,
Author = { Stefan Sweig },
title = { The impossible book },
publisher = { Dead Poet Society},
year = 1942,
month = mar
}

you have various options:
lexer start conditions (from a Lex tutorial)
building on the ideas from greg ward, enhance your lex rules with start conditions ('modes' as they are called in the referenced source).
specifically, you would have the start conditions BASIC ENTRY STRING and the following rules (example taken and slightly enhanced from here):
%START BASIC ENTRY STRING
%%
/* Lexical grammar, mode 1: top-level */
<BASIC>AT # { BEGIN ENTRY; }
<BASIC>NEWLINE \n
<BASIC>COMMENT \%[^\n]*\n
<BASIC>WHITESPACE. [\ \r\t]+
<BASIC>JUNK [^#\n\ \r\t]+
/* Lexical grammar, mode 2: in-entry */
<ENTRY>NEWLINE \n
<ENTRY>COMMENT \%[^\n]*\n
<ENTRY>WHITESPACE [\ \r\t]+
<ENTRY>NUMBER [0-9]+
<ENTRY>NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+ { if (stricmp(yytext, "comment")==0) { BEGIN STRING; } }
<ENTRY>LBRACE \{ { if (delim == '\0') { delim='}'; } else { blevel=1; BEGIN STRING; } }
<ENTRY>RBRACE \} { BEGIN BASIC; }
<ENTRY>LPAREN \( { BEGIN STRING; delim=')'; plevel=1; }
<ENTRY>RPAREN \)
<ENTRY>EQUALS =
<ENTRY>HASH \#
<ENTRY>COMMA ,
<ENTRY>QUOTE \" { BEGIN STRING; bleveL=0; plevel=0; }
/* Lexical grammar, mode 3: strings */
<STRING>LBRACE \{ { if (blevel>0) {blevel++;} }
<STRING>RBRACE \} { if (blevel>0) { blevel--; if (blevel == 0) { BEGIN ENTRY; } } }
<STRING>LPAREN \( { if (plevel>0) { plevel++;} }
<STRING>RPAREN \} { if (plevel>0) { plevel--; if (plevel == 0) { BEGIN ENTRY; } } }
<STRING>QUOTE \" { BEGIN ENTRY; }
please note that the rule set is by no means complete but should get you started. more details to be found here.
btparse
These docs explain in a fairly detailed fashion thenintricacies of parsing the bibtex formats and comes with a 'python parser.
biblex
you might also be interested in employing the unix toolchain of biblex and bibparse. these tools generate and parse a bibtex token stream, respectively.
more info can be found here.
best regards, carsten

Related

Angular Input Restriction Directive - Negating Regular Expressions

EDIT: Please feel free to add additional validations that would be useful for others, using this simple directive.
--
I'm trying to create an Angular Directive that limits the characters input into a text box. I've been successful with a couple common use cases (alphbetical, alphanumeric and numeric) but using popular methods for validating email addresses, dates and currency I can't get the directive to work since I need it negate the regex. At least that's what I think it needs to do.
Any assistance for currency (optional thousand separator and cents), date (mm/dd/yyyy) and email is greatly appreciated. I'm not strong with regular expressions at all.
Here's what I have currently:
http://jsfiddle.net/corydorning/bs05ys69/
HTML
<div ng-app="example">
<h1>Validate Directive</h1>
<p>The Validate directive allow us to restrict the characters an input can accept.</p>
<h3><code>alphabetical</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphabetical (A-Z, a-z) characters only.</p>
<label><input type="text" validate="alphabetical" ng-model="validate.alphabetical"/></label>
<h3><code>alphanumeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to alphanumeric (A-Z, a-z, 0-9) characters only.</p>
<label><input type="text" validate="alphanumeric" ng-model="validate.alphanumeric" /></label>
<h3><code>currency</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to US currency characters with comma for thousand separator (optional) and cents (optional).</p>
<label><input type="text" validate="currency.us" ng-model="validate.currency" /></label>
<h3><code>date</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to the mm/dd/yyyy date format only.</p>
<label><input type="text" validate="date" ng-model="validate.date" /></label>
<h3><code>email</code> <span style="color: red">(doesn't work)</span></h3>
<p>Restricts input to email format only.</p>
<label><input type="text" validate="email" ng-model="validate.email" /></label>
<h3><code>numeric</code> <span style="color: green">(works)</span></h3>
<p>Restricts input to numeric (0-9) characters only.</p>
<label><input type="text" validate="numeric" ng-model="validate.numeric" /></label>
JavaScript
angular.module('example', [])
.directive('validate', function () {
var validations = {
// works
alphabetical: /[^a-zA-Z]*$/,
// works
alphanumeric: /[^a-zA-Z0-9]*$/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/354044/what-is-the-best-u-s-currency-regex
currency: /^[+-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?$/,
// doesn't work - need to negate?
// taken from here: http://stackoverflow.com/questions/15196451/regular-expression-to-validate-datetime-format-mm-dd-yyyy
date: /(?:0[1-9]|1[0-2])\/(?:0[1-9]|[12][0-9]|3[01])\/(?:19|20)[0-9]{2}/,
// doesn't work - need to negate?
// taken from: http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
email: /^([\w-]+(?:\.[\w-]+)*)#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$/i,
// works
numeric: /[^0-9]*$/
};
return {
require: 'ngModel',
scope: {
validate: '#'
},
link: function (scope, element, attrs, modelCtrl) {
var pattern = validations[scope.validate] || scope.validate
;
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue.replace(pattern, '')
;
if (transformedInput != inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
}
};
});
I am pretty sure, there is better way, probably regex is also not best tool for that, but here is mine proposition.
This way you can only restrict which characters are allowed for input and to force user to use proper format, but you will need to also validate final input after user will finish typing, but this is another story.
The alphabetic, numeric and alphanumeric are quite simple, for input and validating input, as it is clear what you can type, and what is a proper final input. But with dates, mails, currency, you cannot validate input with regex for full valid input, as user need to type it in first, and in a meanwhile the input need to by invalid in terms of final valid input. So, this is one thing to for example restrict user to type just digits and / for a date format, like: 12/12/1988, but in the end you need to check if he typed proper date or just 12/12/126 for example. This need to be checked when answer is submited by user, or when text field lost focus, etc.
To just validate typed character, you can try with this:
JSFiddle DEMO
First change:
var transformedInput = inputValue.replace(pattern, '')
to
var transformedInput = inputValue.replace(pattern, '$1')
then use regular expressions:
/^([a-zA-Z]*(?=[^a-zA-Z]))./ - alphabetic
/^([a-zA-Z0-9]*(?=[^a-zA-Z0-9]))./ - alphanumeric
/(\.((?=[^\d])|\d{2}(?![^,\d.]))|,((?=[^\d])|\d{3}(?=[^,.$])|(?=\d{1,2}[^\d]))|\$(?=.)|\d{4,}(?=,)).|[^\d,.$]|^\$/- currency (allow string like: 343243.34, 1,123,345.34, .05 with or without $)
^(((0[1-9]|1[012])|(\d{2}\/\d{2}))(?=[^\/])|((\d)|(\d{2}\/\d{2}\/\d{1,3})|(.+\/))(?=[^\d])|\d{2}\/\d{2}\/\d{4}(?=.)).|^(1[3-9]|[2-9]\d)|((?!^)(3[2-9]|[4-9]\d)\/)|[3-9]\d{3}|2[1-9]\d{2}|(?!^)\/\d\/|^\/|[^\d/] - date (00-12/00-31/0000-2099)
/^(\d*(?=[^\d]))./ - numeric
/^([\w.$-]+\#[\w.]+(?=[^\w.])|[\w.$-]+\#(?=[^\w.-])|[\w.#-]+(?=[^\w.$#-])).$|\.(?=[^\w-#]).|[^\w.$#-]|^[^\w]|\.(?=#).|#(?=\.)./i - email
Generally, it use this pattern:
([valid characters or structure] captured in group $1)(?= positive lookahead for not allowed characters) any character
in effect it will capture all valid character in group $1, and if user type in an invalid character, whole string is replaced with already captured valid characters from group $1. It is complemented by part which shall exclude some obvious invalid character(s), like ## in a mail, or 34...2 in currency.
With understanding how these regular expression works, despite that it looks quite complex, I think it easy to extend it, by adding additional allowed/not allowed characters.
Regular expression for validating currency, dates and mails are easy to find, so I find it redundant to post them here.
OffTopic. Whats more the currency part in your demo is not working, it is bacause of: validate="currency.us" instead of validate="currency", or at least it works after this modification.
In my opinion it is impossible to create regular expressions that will work for matching things like dates or emails with the
parser you use. This is mainly because you would need non-capturing groups in your
regular expressions (which is possible), which are not replaced by the
inputValue.replace(pattern, '') call you have in your parser function. And this is the
part that is not possible in JavaScript. JavaScript replaces what you put in non-capturing
groups as well.
So... you'll need to go for a different approach. I would suggest to go for positive
regular expressions, which will yield a match when the input is valid.
Then you need of course to change the code of your parser. You could for instance
decide to chop off characters from the end of the input text until what remains passes
the regular expression test. This you could code as follows:
modelCtrl.$parsers.push(function (inputValue) {
var transformedInput = inputValue;
while (transformedInput && !pattern.exec(transformedInput)) {
// validation fails: chop off last character and try again
transformedInput = transformedInput.slice(0, -1);
}
if (transformedInput !== inputValue) {
modelCtrl.$setViewValue(transformedInput);
modelCtrl.$render();
}
return transformedInput;
});
Now life has become a bit easier. Just pay attention that you make your regular
expressions in such a way that they do not reject partial input. So "01/" should be
considered valid for a date, otherwise the user can never get to type in a date. On
the other hand, as soon as it becomes clear that adding characters will no longer
allow for valid input, the regular expression should reject it. So "101" should be
rejected as a date, as you can never add characters at the end to make it a valid date.
Also, all of these regular expressions should check the whole input, so as a consequence
they need to make use of the ^ and $ symbols.
Here is what the regular expression for a (partial) date could look like:
^([0-9]{0,2}|[0-9]{2}[\/]([0-9]{0,2}|[0-9]{2}[\/][0-9]{0,4}))$
This means: an input of 0 to 2 digits is valid, or exactly 2 digits followed by a slash, followed by either:
0 to 2 digits, or
exactly 2 digits followed by a slash, followed by 0 to 4 digits
Admittedly, not as smart as the one you had found, but that one would need a lot of editing to allow for partially entered dates. It is possible, but
it represents a very long expression with a lot of brackets and |.
Once you have all the regular expressions set up, you could think to further improve
the parser. One idea would be to not let it chop off characters from the end, but to
let it test all strings with one character removed somewhere compared to the original,
and see which one passes the test. If there is no way found to remove one character and have
success, then remove two consecutive characters in any place of the input value,
then three, ... etc, until you find a value that passes the test or arrive at an empty value.
This will work better for cases where the user inserts characters half way their input.
Just an idea...
import { Directive, ElementRef, EventEmitter, HostListener, Input, Output, Renderer2 } from '#angular/core';
import { ControlValueAccessor, NG_VALUE_ACCESSOR } from '#angular/forms';
import { CurrencyPipe, DecimalPipe } from '#angular/common';
import { ValueChangeEvent } from '#goomTool/goom-elements/events/value-change-event.model';
const noOperation = () => {
};
#Directive({
selector: '[formattedNumber]',
providers: [{
provide: NG_VALUE_ACCESSOR,
useExisting: FormattedNumberDirective,
multi: true
}]
})
export class FormattedNumberDirective implements ControlValueAccessor {
#Input() public configuration;
#Output() public valueChange: EventEmitter<ValueChangeEvent> = new EventEmitter();
public locale: string = process.env.LOCALE;
private el: HTMLInputElement;
// Keeps track of the value without formatting
private innerInputValue: any;
private specialKeys: string[] =
['Backspace', 'Tab', 'End', 'Home', 'Enter', 'Shift', 'ArrowRight', 'ArrowLeft', 'Delete'];
private onTouchedCallback: () => void = noOperation;
private onChangeCallback: (a: any) => void = noOperation;
constructor(private elementRef: ElementRef,
private decimalPipe: DecimalPipe,
private currencyPipe: CurrencyPipe,
private renderer: Renderer2) {
this.el = elementRef.nativeElement;
}
public writeValue(value: any) {
if (value !== this.innerInputValue) {
if (!!value) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', this.getFormattedValue(value));
}
this.innerInputValue = value;
}
}
public registerOnChange(fn: any) {
this.onChangeCallback = fn;
}
public registerOnTouched(fn: any) {
this.onTouchedCallback = fn;
}
// On Focus remove all non-digit ,display actual value
#HostListener('focus', ['$event.target.value'])
public onfocus(value) {
if (!!this.innerInputValue) {
this.el.value = this.innerInputValue;
}
}
// On Blur set values to pipe format
#HostListener('blur', ['$event.target.value'])
public onBlur(value) {
this.innerInputValue = value;
if (!!value) {
this.el.value = this.getFormattedValue(value);
}
}
/**
* Allows special key, Unit Interval, value based on regular expression
*
* #param event
*/
#HostListener('keydown', ['$event'])
public onKeyDown(event) {
// Allow Backspace, tab, end, and home keys . .
if (this.specialKeys.indexOf(event.key) !== -1) {
if (event.key === 'Backspace') {
this.updateValue(this.getBackSpaceValue(this.el.value, event));
}
if (event.key === 'Delete') {
this.updateValue(this.getDeleteValue(this.el.value, event));
}
return;
}
const next: string = this.concatAtIndex(this.el.value, event);
if (this.configuration.angularPipe && this.configuration.angularPipe.length > 0) {
if (!this.el.value.includes('.')
&& (this.configuration.min == null || this.configuration.min < 1)) {
if (next.startsWith('0') || next.startsWith('0.') || next.startsWith('.')) {
if (next.length > 1) {
this.updateValue(next);
}
return;
}
}
}
/* pass your pattern in component regex e.g.
* regex = new RegExp(RegexPattern.WHOLE_NUMBER_PATTERN)
*/
if (next && !String(next).match(this.configuration.regex)) {
event.preventDefault();
return;
}
if (!!this.configuration.minFractionDigits && !!this.configuration.maxFractionDigits) {
if (!!next.split('\.')[1] && next.split('\.')[1].length > this.configuration.minFractionDigits) {
return this.validateFractionDigits(next, event);
}
}
this.innerInputValue = next;
this.updateValue(next);
}
private updateValue(newValue) {
this.onTouchedCallback();
this.onChangeCallback(newValue);
if (newValue) {
this.renderer.setAttribute(this.elementRef.nativeElement, 'value', newValue);
}
}
private validateFractionDigits(next, event) {
// create real-time pattern to validate min & max fraction digits
const regex = `^[-]?\\d+([\\.,]\\d{${this.configuration.minFractionDigits},${this.configuration.maxFractionDigits}})?$`;
if (!String(next).match(regex)) {
event.preventDefault();
return;
}
this.updateValue(next);
}
private concatAtIndex(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) + event.key +
current.slice(event.currentTarget.selectionEnd);
}
private getBackSpaceValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart - 1) +
current.slice(event.currentTarget.selectionEnd);
}
private getDeleteValue(current: string, event) {
return current.slice(0, event.currentTarget.selectionStart) +
current.slice(event.currentTarget.selectionEnd + 1);
}
private transformCurrency(value) {
return this.currencyPipe.transform(value, this.configuration.currencyCode, this.configuration.display,
this.configuration.digitsInfo, this.locale);
}
private transformDecimal(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale);
}
private transformPercent(value) {
return this.decimalPipe.transform(value, this.configuration.digitsInfo, this.locale) + ' %';
}
private getFormattedValue(value) {
switch (this.configuration.angularPipe) {
case ('decimal'): {
return this.transformDecimal(value);
}
case ('currency'): {
return this.transformCurrency(value);
}
case ('percent'): {
return this.transformPercent(value);
}
default: {
return value;
}
}
}
}
----------------------------------
export const RegexPattern = Object.freeze({
PERCENTAGE_PATTERN: '^([1-9]\\d*(\\.)\\d*|0?(\\.)\\d*[1-9]\\d*|[1-9]\\d*)$', // e.g. '.12% ' or 12%
DECIMAL_PATTERN: '^(([-]+)?([1-9]\\d*(\\.|\\,)\\d*|0?(\\.|\\,)\\d*[1-9]\\d*|[1-9]\\d*))$', // e.g. '123.12'
CURRENCY_PATTERN: '\\$?[-]?[0-9]{1,3}(?:,?[0-9]{3})*(?:\\.[0-9]{2})?$', // e.g. '$123.12'
KEY_PATTERN: '^[a-zA-Z\\-]+-[0-9]+', // e.g. ABC-1234
WHOLE_NUMBER_PATTERN: '^([-]?([1-9][0-9]*)|([0]+)$)$' // e.g 1234
});

Finding Noun Phrases in sentiment analysis using stanford POS tagger

**I am making a project on sentiment analysis. so i used stanford POS tagger to tag the sentence. I want to extract noun phrases from the sentences but it was only tagging noun.
How do i get noun phrases from that. i code in java.
i searched on websites and i found this for making a noun phrase:
For noun phrases, this pattern or regular expression is the following:
(Adjective | Noun)* (Noun Preposition)? (Adjective | Noun)* Noun
i.e. Zero or more adjectives or nouns, followed by an option group of a noun and a preposition, followed again by zero or more adjectives or nouns, followed by a single noun.
i was trying to code it using java's reguler expression library. i.e regex. but couldnt find the desired result.
Does anyone has code for it?
**
I have coded this. and solution is..
it will extracy all the noun phrase from a sentence containing only noun.
for eg. like NP is: the white tiger. it will extract "white tiger".
public static void maketree(String sent, int sno, Sentences sen)
{
try
{
LexicalizedParser parser = LexicalizedParser.loadModel("stanford-parser-full-2014-01-04\\stanford-parser-3.3.1-models\\edu\\stanford\\nlp\\models\\lexparser\\englishPCFG.ser.gz");
String sent2 = "Picture Quality of this camera is very good";
String sent1[] = sent2.split(" ");
List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent1);
Tree x = parser.apply(rawWords);
x.indexLeaves();
System.out.println(x);
findNP(x,sen);
}
catch (Exception e)
{
e.printStackTrace();
}
}
public static void findNP(Tree t, Sentences sent)
{
if (t.label().value().equals("NP"))
{
noun(t,sent);
}
else
{
for (Tree child : t.children())
{
findNP(child,sent);
}
}
}
public static void noun(Tree t,Sentences sent)
{
String noun="";
for(Tree temp : t.children())
{
String val = temp.label().value();
if(val.equals("NN") || val.equals("NNS") || val.equals("NNP") || val.equals("NNPS"))
{
Tree nn[] = temp.children();
String ss = Sentence.listToString(nn[0].yield());
if(noun=="")
{
noun = ss;
}
else
{
noun = noun+" "+ss;
}
}
else
{
if(noun!="")
{
sent.nouns[i++] = noun;
noun = "";
}
noun(temp,sent);
}
}
if(noun!="")
{
sent.nouns[i++] = noun;
}
}
Could you please check the link and comment on this. Could you please me if
"the white tiger" would get the same result with your above code.probably the code is not complete and thats why I am getting some error.
for eg:
sent.nouns[i++] = noun; // sent.nouns????? it seems to be undefined. could you please get the complete code or if you can commnet on the below link.
here is the link
Extract Noun phrase using stanford NLP
Thanks for the help

regular expression for ipv4 address in CIDR notation

I am using the below regular expression to match ipv4 address in CIDR notation.
[ \t]*(((2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)(/(3[012]|[12]?[0-9])))[ \t]*
I have tested the above using [http://regexpal.com/][1]
It seems to match the following example 192.168.5.10/24
However when I use the same example in flex it says "unrecognized rule".Is there some limitation in flex in that it does not support all the features? The above regex seems pretty basic without the use of any extended features.Can some one point out why flex is not recognizing the rule.
Here is a short self contained example that demonstrates the problem
IPV4ADDRESS [ \t]*(((2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)\.){3}(2(5[0-5]|[0-4][0-9])|[01]?[0-9][0-9]?)(/(3[012]|[12]?[0-9])))[ \t]*
SPACE [ \t]
%x S_rule S_dst_ip
%%
%{
BEGIN S_rule;
%}
<S_rule>(dst-ip){SPACE} {
BEGIN(S_dst_ip);
}
<S_dst_ip>\{{IPV4ADDRESS}\} {
printf("\n\nMATCH [%s]\n\n", yytext);
BEGIN S_rule;
}
. { ECHO; }
%%
int main(void)
{
while (yylex() != 0)
;
return(0);
}
int yywrap(void)
{
return 1;
}
When I try to do flex test.l it give "unrecognized rule" error.I want to match
dst-ip { 192.168.10.5/10 }
The "/" in your IPV4ADDRESS pattern needs to be escaped ("\/").
An un-escaped "/" in a flex pattern is the trailing context operator.

How To split a string in c# and keep the delimiter in the array while excluding white space in a name parser

This took me a while to figure out so I will Post my results here in the Question as this is Answered.
Question: How do i split a string using a array of possible delimiters in a name field while keeping the delimiter in the split array and excluding white-space the split may create in the array.
Example: Sam Washington& Jenna
My issue was the name parser i created was writing
Firstname:Sam
LastName : Jenna
Using the following code I was able to Parse it out like this
FirstName: Sam
Lastname : Washington
Firstname2 Jenna
Be careful However because if you are going to use my list of joiners do not include string values that can be found in common names such as "And" and "OR"
This would parse your names EX: "Andy" would be "And" , "Y"
EX2: "Gregory would be "Greg" "or" "y"
Hope this helps someone. If you have questions please feel free to shoot me a message.
/// <summary>
/// remove bad name parts
/// </summary>
/// <param name="parts">name parsed for review</param>
public static void CheckBadNames(ref string[] parts)
{
string[] BadName = new string[] {"LIFE", "ESTATE" ,"(",")","*","AN","LIFETIME","INTREST","MARRIED",
"UNMARRIED","MARRIED/UNMARRIED","SINGLE","W/","/W","THE","ET",
"ALS","AS", "TENANT","WIFE", "HUSBAND", "NOT", "DRIVE" ,"INSURED",
"EXCLUDED","DISABLED" ,"LICENSED","TRUSTEE","ATSOT","A T S O T",
"AKA", "-ATSOT","OF","DBA","EVOCABLE","FAMILY","INTEREST","MASTER"};
string[] joiners = new string[9] { "&", #"AND\", #"OR\", "\\", "&/OR", "AND/OR", "&-OR", "/", "OF/AND" };
Restart:
List<string> list = new List<string>(parts); //convert array to list
foreach (string part in list)
{
if (BadName.Any(s => part.ToUpper().Equals(s)) || part == "-")
{
list.Remove(part);
parts = list.ToArray();
goto Restart;
}
//check to see if any part ends with joiner
if (joiners.Any(s => part.ToUpper().EndsWith(s)))
{
//check if by ends with means that it is just a joiner
if (joiners.Any(s => part.ToUpper().Equals(s)))
{
continue;
}
else //name part ends with a joiner EX. Washington&
{
foreach (string div in joiners.Where(s => part.ToUpper().Contains(s))) // each string that contains a joiner
{
var temp = Regex.Split(part, "(" + div + ")").Where(x => x != String.Empty); // split into parts ignore leading or trailing spaces
int pos = list.IndexOf(part);
list.Remove(part);
for (int i = 0; i < temp.Count(); i++)
{
list.Insert(pos + i, temp.ElementAt(i));
}
parts = list.ToArray();
goto Restart;
}
}
}
}
if (parts.Count() == 0)
{
return;
}
if (joiners.Any(s => list.Last().ToUpper().Equals(s))) //remove last part if is a joiner
{
list.Remove(list.Last());
}
parts = list.ToArray(); // convert list back to array
}

Regular expression for csv with commas and no quotes

I'm trying to parse really complicated csv, which is generated wittout any quotes for columns with commas.
The only tip I get, that commas with whitespace before or after are included in field.
Jake,HomePC,Microsoft VS2010, Microsoft Office 2010
Should be parsed to
Jake
HomePC
Microsoft VS2010, Microsoft Office 2010
Can anybody advice please on how to include "\s," and ,"\s" to column body.
If your language supports lookbehind assertions, split on
(?<!\s),(?!\s)
In C#:
string[] splitArray = Regex.Split(subjectString,
#"(?<!\s) # Assert that the previous character isn't whitespace
, # Match a comma
(?!\s) # Assert that the following character isn't whitespace",
RegexOptions.IgnorePatternWhitespace);
split by r"(?!\s+),(?!\s+)"
in python you can do this like
import re
re.split(r"(?!\s+),(?!\s+)", s) # s is your string
Try this. It gave me the desired result which you have mentioned.
StringBuilder testt = new StringBuilder("Jake,HomePC,Microsoft VS2010, Microsoft Office 2010,Microsoft VS2010, Microsoft Office 2010");
Pattern varPattern = Pattern.compile("[a-z0-9],[a-z0-9]", Pattern.CASE_INSENSITIVE);
Matcher varMatcher = varPattern.matcher(testt);
List<String> list = new ArrayList<String>();
int startIndex = 0, endIndex = 0;
boolean found = false;
while (varMatcher.find()) {
endIndex = varMatcher.start()+1;
if (startIndex == 0) {
list.add(testt.substring(startIndex, endIndex));
} else {
startIndex++;
list.add(testt.substring(startIndex, endIndex));
}
startIndex = endIndex;
found = true;
}
if (found) {
if (startIndex == 0) {
list.add(testt.substring(startIndex));
} else {
list.add(testt.substring(startIndex + 1));
}
}
for (String s : list) {
System.out.println(s);
}
Please note that the code is in Java.