Regex to select semicolons that are not enclosed in double quotes - regex

I have string like
a;b;"aaa;;;bccc";deef
I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be
a
b
"aaa;;;bccc"
deef
I tried using look-behind, but I'm not able to find a correct regular expression for splitting.

Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.
That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.
;(?=(([^"]*"){2})*[^"]*$)
Example: http://www.rubular.com/r/RyLQyR8F19
This will break down if you can have escaped quotation marks within a string, for example a;"foo\"bar";c.
Here is a much cleaner example using Python's csv module:
import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
delimiter=';', quotechar='"')
for row in reader:
print '\n'.join(row)

Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to ,. Here is a sample snippet:
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::ParseWords;
my $string = 'a;b;"aaa;;;bccc";deef';
my #ary = parse_line(q{;}, 0, $string);
print "$_\n" for #ary;
Output
a
b
aaa;;;bccc
deef

This is kind of ugly, but if you don't have \" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \"badoo\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).
If you *do have \" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.
Here's a fiddle...
http://jsfiddle.net/VW9an/
var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'
var strCp = str.replace(/\\"/g,"--##--");
var parts = strCp.split(/"/);
var allPieces = new Array();
for(var i in parts){
if(i % 2 == 0){
var innerParts = parts[i].split(/\;/)
for(var j in innerParts)
allPieces.push(innerParts[j])
}
else{
allPieces.push('"' + parts[i] +'"')
}
}
for(var a in allPieces){
allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');
}
console.log(allPieces)

Match All instead of Splitting
Answering long after the battle because no one used the way that seems the simplest to me.
Once you understand that Match All and Split are Two Sides of the Same Coin, you can use this simple regex:
"[^"]*"|[^";]+
See the matches in the Regex Demo.
The left side of the alternation | matches full quoted strings
The right side matches any chars that are neither ; nor "

Related

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

How can I use regex with sed (or equivalent unix command line tool) to fix title case in LaTeX headings?

regular expression attempt
(\\section\{|\\subsection\{|\\subsubsection\{|\\paragraph[^{]*\{)(\w)\w*([ |\}]*)
search text
\section{intro to installation of apps}
\subsection{another heading for \myformatting{special}}
\subsubsection{good morning, San Francisco}
\paragraph{installation of backend services}
desired output
All initial characters are capitalized except prepositions, conjunctions, and the usual parts of speech that are made upper case on titles.
I supposed I should really narrow this down, so let me borrow from the U.S. Government Printing Office Style Manual:
The articles a, an, and the; the prepositions at, by, for, in, of, on, to, and up; the conjunctions and, as, but, if, or, and nor; and the second element of a compound numeral are not capitalized.
Page 41
\subsection{Installation guide for the server-side app \myapp{webgen}}
changes to
\subsection{Installation Guide for the Server-side App \myapp{Webgen}}
OR
\subsection{Installation Guide for the Server-side App \myapp{webgen}}
How would you name this type of string modification?
Applying REGEX to a string between strings?
Applying REGEX to a part of a string when that part falls between two other strings of characters?
Applying REGEX to a substring that occurs between two
other substrings within a string?
<something else>
problem
I match each latex heading command, including the {. This means that my expresion does not match more than the first word in the actually heading text. I cannot surround the whole heading code with an "OR space" because then I will find nearly every word in the document. Also, I have to be careful of formatting commands within the headings themselves.
other helpful related questions
Uppercasing First Letter of Words Using SED
https://superuser.com/questions/749164/how-to-use-regex-to-capitalise-the-first-letter-of-each-word-in-a-sentence
Using Sed to capitalize the first letter of each word
Capitalize first letter of each word in a selection using vim
So it seems to me as if you need to implement pseudo-code like this:
Are we on the first word? If yes, capitalize it and move on.
Is the current word "reserved"? If yes, lower it and move on.
Is the current word a numeral? If yes, lower it and move on.
Are we still in the list? If yes, print the line verbatim and move on.
One other helpful rule might be to leave fully upper-case words as they are, just in case they're acronyms.
The following awk script might do what you need.
#!/usr/bin/awk -f
function toformal(subject) {
return toupper(substr(subject,1,1)) tolower(substr(subject,2))
}
BEGIN {
# Reserved word list gets split into an array for easy matching.
reserved="at by for in of on to up and as but if or nor";
split(reserved,a_reserved," "); for(i in a_reserved) r[a_reserved[i]]=1;
# Same with the list of compound numerals. If this isn't what you mean, say so.
numerals="hundred thousand million billion";
split(numerals,a_numerals," "); for(i in a_numerals) n[a_numerals[i]]=1;
}
# This awk condition matches the lines we're interested in modifying.
/^\\(section|subsection|subsubsection|paragraph)[{]/ {
# Separate the particular section and the text, then split text to an array.
section=$0; sub(/\\/,"",section); sub(/[{].*/,"",section);
text=$0; sub(/^[^{]*[{]/,"",text); sub(/[}].*/,"",text);
size=split(text,atext,/[[:space:]]/);
# First word...
newtext=toformal(atext[1]);
for(i=2; i<=size; i++) {
# Reserved word...
if (r[tolower(atext[i])]) { newtext=newtext " " atext[i]; continue; }
# Compound numerals...
if (n[tolower(atext[i])]) { newtext=newtext " " tolower(atext[i]); continue; }
# # Acronyms maybe...
# if (atext[i] == toupper(atext[i])) { newtext=newtext " " atext[i]; continue; }
# Everything else...
newtext=newtext " " toformal(atext[i]);
}
print newtext;
next;
}
# Print the line if we get this far. This is a non-condition with
# a print-only statement.
1
Here is an example of how you could do it in Perl using the module Lingua::EN::Titlecase and recursive regular expressions :
use strict;
use warnings;
use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();
my $data = do {local $/; <> };
my ($kw_regex) = map { qr/$_/ }
join '|', qw(section subsection subsubsection paragraph);
$data =~ s/(\\(?: $kw_regex))(\{(?:[^{}]++|(?2))*\})/title_case($tc,$1,$2)/gex;
print $data;
sub title_case {
my ($tc, $p1, $p2) = #_;
$p2 =~ s/^\{//;
$p2 =~ s/\}$//;
if ($p2 =~ /\\/ ) {
while ($p2 =~ /\G(.*?)(\\.*?)(\{(?:[^{}]++|(?3))*\})/ ) {
my $next_pos = $+[0];
substr($p2, $-[1], $+[1] -$-[1], $tc->title($1));
substr($p2, $-[3], $+[3] -$-[3], title_case($tc,'',$3));
pos($p2) = $next_pos;
}
$p2 =~ s/\G(.+)$/$tc->title($1)/e;
}
else {
$p2 = $tc->title($p2);
}
return $p1 . '{' . $p2 . '}';
}

Splitting a string based on positions with regex

I need to convert this (date) String "12112014" to "12.11.2014"
What i would like to to is:
Split first 2 Strings "12", add ".",
then split the string from 3-4 to get "11", add "."
at the end split the last 4 strings (or 5-8) to get "2012"
I already found out how to get the first 2 characters ( "^\d{2}" ), but I failed to get characters based on a position.
Whatever be the programming language, You should try to extract the digits from string and then join them with a ".".
In perl, it can be done as :
$_ = '12112014';
s/(\d{2})(\d{2})(\d{4})/$1.$2.$3/;
print "$_";
Without you specifying the language you're after, I've picked javascript:
var s = '12012011';
var s2 = s.replace(/(\d{2})(\d{2})(\d{4})/,'$1.$2.$3'));
console.log(s2); // prints "12.01.2011"
The gist of it is that you use () to specify groups inside your regular expression and then can use the groups in your replace expression.
Same in Java:
String s = "12012011";
String s2 = s.replaceAll("(\\d{2})(\\d{2})(\\d{4})", "$1.$2.$3");
System.out.println(s2);
I dont think that you could do that only with split.
You could expand your expression to:
"(^(\d{2})(\d{2})(\d{4}))"
Then access the groups with the Regex language of your choice and build the string you want.
Note that - besides all regex learning - alternatively you could always parse the original string into strongly typed Date or DateTime variables and output the value using the appropriate locales.

RegEx for a price in £

i have: \£\d+\.\d\d
should find: £6.95 £16.95 etc
+ is one or more
\. is the dot
\d is for a digit
am i wrong? :(
JavaScript for Greasemonkey
// ==UserScript==
// #name CurConvertor
// #namespace CurConvertor
// #description noam smadja
// #include http://www.zavvi.com/*
// ==/UserScript==
textNodes = document.evaluate(
"//text()",
document,
null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null);
var searchRE = /\£[0-9]\+.[0-9][0-9];
var replace = 'pling';
for (var i=0;i<textNodes.snapshotLength;i++) {
var node = textNodes.snapshotItem(i);
node.data = node.data.replace(searchRE, replace);
}
when i change the regex to /Free for example it finds and changes. but i guess i am missing something!
Had this written up for your last question just before it was deleted.
Here are the problems you're having with your GM script.
You're checking absolutely every
text node on the page for some
reason. This isn't causing it to
break but it's unnecessary and slow.
It would be better to look for text
nodes inside .price nodes and .rrp
.strike nodes instead.
When creating new regexp objects in
this way, backslashes must be
escaped, ex:
var searchRE = new
RegExp('\\d\\d','gi');
not
var
searchRE = new RegExp('\d\d','gi');
So you can add the backslashes, or
create your regex like this:
var
searchRE = /\d\d/gi;
Your actual regular expression is
only checking for numbers like
##ANYCHARACTER##, and will ignore £5.00 and £128.24
Your replacement needs to be either
a string or a callback function, not
a regular expression object.
Putting it all together
textNodes = document.evaluate(
"//p[contains(#class,'price')]/text() | //p[contains(#class,'rrp')]/span[contains(#class,'strike')]/text()",
document,
null,
XPathResult.UNORDERED_NODE_SNAPSHOT_TYPE,
null);
var searchRE = /£(\d+\.\d\d)/gi;
var replace = function(str,p1){return "₪" + ( (p1*5.67).toFixed(2) );}
for (var i=0,l=textNodes.snapshotLength;i<l;i++) {
var node = textNodes.snapshotItem(i);
node.data = node.data.replace(searchRE, replace);
}
Changes:
Xpath now includes only p.price and p.rrp span.strke nodes
Search regular expression created with /regex/ instead of new RegExp
Search variable now includes target currency symbol
Replace variable is now a function that replaces the currency symbol with a new symbol, and multiplies the first matched substring with substring * 5.67
for loop sets a variable to the snapshot length at the beginning of the loop, instead of checking textNodes.snapshotLength at the beginning of every loop.
Hope that helps!
[edit]Some of these points don't apply, as the original question changed a few times, but the final script is relevant, and the points may still be of interest to you for why your script was failing originally.
You are not wrong, but there are a few things to watch out for:
The £ sign is not a standard ASCII character so you may have encoding issue, or you may need to enable a unicode option on your regular expression.
The use of \d is not supported in all regular expression engines. [0-9] or [[:digit:]] are other possibilities.
To get a better answer, say which language you are using, and preferably also post your source code.
£[0-9]+(,[0-9]{3})*\.[0-9]{2}$
this will match anything from £dd.dd to £d[dd]*,ddd.dd. So it can fetch millions and hundreds as well.
The above regexp is not strict in terms of syntaxes. You can have, for example: 1123213123.23
Now, if you want an even strict regexp, and you're 100% sure that the prices will follow the comma and period syntaxes accordingly, then use
£[0-9]{1,3}(,[0-9]{3})*\.[0-9]{2}$
Try your regexps here to see what works for you and what not http://tools.netshiftmedia.com/regexlibrary/
It depends on what flavour of regex you are using - what is the programming language?
some older versions of regex require the + to be escaped - sed and vi for example.
Also some older versions of regex do not recognise \d as matching a digit.
Most modern regex follow the perl syntax and £\d+\.\d\d should do the trick, but it does also depend on how the £ is encoded - if the string you are matching encodes it differently from the regex then it will not match.
Here is an example in Python - the £ character is represented differently in a regular string and a unicode string (prefixed with a u):
>>> "£"
'\xc2\xa3'
>>> u"£"
u'\xa3'
>>> import re
>>> print re.match("£", u"£")
None
>>> print re.match(u"£", "£")
None
>>> print re.match(u"£", u"£")
<_sre.SRE_Match object at 0x7ef34de8>
>>> print re.match("£", "£")
<_sre.SRE_Match object at 0x7ef34e90>
>>>
£ isn't an ascii character, so you need to work out encodings. Depending on the language, you will either need to escape the byte(s) of £ in the regex, or convert all the strings into Unicode before applying the regex.
In Ruby you could just write the following
/£\d+.\d{2}/
Using the braces to specify number of digits after the point makes it slightly clearer

How do I assign many values to a particular Perl variable?

I am writing a script in Perl which searches for a motif(substring) in protein sequence(string). The motif sequence to be searched (or substring) is hhhDDDssEExD, where:
h is any hydrophobic amino acid
s is any small amino acid
x is any amino acid
h,s,x can have more than one value separately
Can more than one value be assigned to one variable? If yes, how should I do that? I want to assign a list of multiple values to a variable.
It seems like you want some kind of pattern matching. This can be done with strings using regular expressions.
You can use character classes in your regular expression. The classes you mentioned would be:
h -> [VLIM]
s -> [AG]
x -> [A-IK-NP-TV-Z]
The last one means "A to I, K to N, P to T, V to Z".
The regular expression for your example would be:
/[VLIM]{3}D{3}[AG]{2}E{2}[A-IK-NP-TV-Z]D/
I am no great expert in perl, so there is quite possibly a quicker way to this, but it seems like the match operator "//" in list context is what you need. When you assign the result of a match operation to a list, the match operator takes on list context and returns a list with each of the parenthesis delimited sub-expressions. If you specify global matches with the "g" flag, it will return a list of all the matches of each sub-expression. Example:
# print a list of each match for "x" in "xxx"
#aList = ("xxx" =~ /(x)/g);
print(join(".", #aList));
Will print out
x.x.x
I'm assuming you have a regular expression for each of those 5 types h, D, s, E, and x. You didn't say whether each of these parts is a single character or multiple, so I'm going to assume they can be multiple characters. If so, your solution might be something like this:
$h = ""; # Insert regex to match "h"
$D = ""; # Insert regex to match "D"
$s = ""; # Insert regex to match "s"
$E = ""; # Insert regex to match "E"
$x = ""; # Insert regex to match "x"
$sequenceRE = "($h){3}($D){3}($s){2}($E){2}($x)($D)"
if ($line =~ /$sequenceRE/) {
$hPart = $1;
$sPart = $3;
$xPart = $5;
#hValues = ($hPart =~ /($h)/g);
#sValues = ($sPart =~ /($s)/g);
#xValues = ($xPart =~ /($x)/g);
}
I'm sure there is something I've missed, and there are some subtleties of perl that I have overlooked, but this should get you most of the way there. For more information, read up on perl's match operator, and regular expressions.
I could be way off, but it sounds like you want an object with a built in method to output as a string.
If you start with a string, like the one you mentioned, you could pass the string to the class as a new object, use regular expressions like everyone has already suggested to parse out the chunks that you would then assign as variables to that object. Finally, you could have it output a string based on the variables of that object, for instance:
$string = "COHOCOHOCOHOCOHOCOHOC";
$sugar = new Organic($string);
Class Organic {
$chem;
function __construct($chem) {
$hydro_find = "OHO";
$carb_find = "C";
$this-> hydro = preg_find ($hydro_find, $chem);
$this -> carb = preg_find ($carb_find, $chem);
function __TO_STRING() {
return $this->carb."="$this->hydro;
}
}
echo $sugar;
Okay, that kind of fell apart in the end, and it was pseudo-php, not perl. But if I understand your question correctly, you are looking for a way to get all of the info from the string but keep it tied to that string. That would be objects and classes.
You probably want an array (or arrayref) or a pattern (qr//).
Or maybe Quantum::Superpositions.