How to extract line numbers from a multi-line string in Vim? - regex

In my opinion, Vimscript does not have a lot of features for manipulating strings.
I often use matchstr(), substitute(), and less often strpart().
Perhaps there is more than that.
For example, what is the best way to remove all text between line numbers in the following string a?
let a = "\%8l............\|\%11l..........\|\%17l.........\|\%20l...." " etc.
I want to keep only the digits and put them in a list:
['8', '11', '17', '20'] " etc.
(Note that the text between line numbers can be different.)

You're looking for split()
echo split(a, '[^0-9]\+')
EDIT:
Given the new constraint: only the numbers from \%d\+l, I'd do:
echo map(split(a, '|'), "matchstr(v:val, '^%\\zs\\d\\+\\zel')")
NB: your vim variable is incorrectly formatted, to use only one backslash, you'd need to write your string with single-quotes. With double-quotes, here you'd need two backslashes.
So, with
let b = '\%8l............\|\%11l..........\|\%17l.........\|\%20l....'
it becomes
echo map(split(b, '\\|'), "matchstr(v:val, '^\\\\%\\zs\\d\\+\\zel')")

One can take advantage of the substitute with an expression feature (see
:help sub-replace-\=) to run over all of the target matches, appending them
to a list.
:let l=[] | call substitute(a, '\\%\(\d\+\)l', '\=add(l,submatch(1))[1:0]', 'g')

Related

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)

Regular expression to match CSV delimiters

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:
1,"abcd",2,"de,fg",3,"hijk"
I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.
See my post that solves this problem for more detail.
^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.
CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.
What language are you using?
As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.
I suggest the tool CSVFIX as likely to do what you need.
To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):
"""",,"",a,"a,b"
Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:
"",,"",a",b c",
The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!
This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:
use strict;
use warnings;
my #list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );
foreach my $string (#list)
{
print "Pattern: <<$string>>\n";
while ($string =~ m/ (?: " ( (?:""|[^"])* ) " | ( [^,"] [^,]* ) | ( .? ) )
(?: $ | , ) /gx)
{
print "Found QF: <<$1>>\n" if defined $1;
print "Found PF: <<$2>>\n" if defined $2;
print "Found EF: <<$3>>\n" if defined $3;
}
}
Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!
Output:
Pattern: <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern: <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>
We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct.
Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.
Without thinking to hard, I would do something like [0-9]+|"[^"]*" to match everything except the comma delimiters. Would that do the trick?
Without context it's impossible to give a more specific solution.
Andy's right: correctly parsing CSV is a lot harder than you probably realise, and has all kinds of ugly edge cases. I suspect that it's mathematically impossible to correctly parse CSV with regexes, particularly those understood by sed.
Instead of sed, use a Perl script that uses the Text::CSV module from CPAN (or the equivalent in your preferred scripting language). Something like this should do it:
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new ( { binary => 1, eol => $/ } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my $rows = $csv->getline_all(STDIN);
for my $row (#$rows) {
say join("\t", #$row);
}
That assumes that you don't have any tab characters embedded in your data, of course - perhaps it would be better to do the subsequent stages in a Real Scripting Language as well, so you could take advantage of proper lists?
I know this is old, but this RegEx works for me:
/(\"[^\"]+\")|[^,]+/g
It could be use potentially with any language. I tested it in JavaScript, so the g is just a global modifier. It works even with messed up lines (extra quotes), but empty is not dealt with.
Just sharing, maybe this will help someone.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

regular expression matching all 8 character strings except "00000000"

I am trying to figure out a regular expression which matches any string with 8 symbols, which doesn't equal "00000000".
can any one help me?
thanks
In at least perl regexp using a negative lookahead assertion: ^(?!0{8}).{8}$, but personally i'd rather write it like so:
length $_ == 8 and $_ ne '00000000'
Also note that if you do use the regexp, depending on the language you might need a flag to make the dot match newlines as well, if you want that. In perl, that's the /s flag, for "single-line mode".
Unless you are being forced into it for some reason, this is not a regex problem. Just use len(s) == 8 && s != "00000000" or whatever your language uses to compare strings and lengths.
If you need a regex, ^(?!0{8})[A-Za-z0-9]{8}$ will match a string of exactly 8 characters. Changing the values inside the [] will allow you to set the accepted characters.
As mentioned in the other answers, regular expressions are not the right tool for this task. I suspect it is a homework, thus I'll only hint a solution, instead of stating it explicitly.
The regexp "any 8 symbols except 00000000" may be broken down as a sum of eight regexps in the form "8 symbols with non-zero symbol on the i-th position". Try to write down such an expression and then combine them into one using alternative ("|").
Unless you have unspecified requirements, you really don't need a regular expression for this:
if len(myString) == 8 and myString != "00000000":
...
(in the language of your choice, of course!)
If you need to extract all eight character strings not equal to "000000000" from a larger string, you could use
"(?=.{8})(?!0{8})."
to identify the first character of each sequence and extract eight characters starting with its index.
Of course, one would simply check
if stuff != '00000000'
...
but for the record, one could easily employ
heavyweight regex (in Perl) for that ;-)
...
use re 'eval';
my #strings = qw'00000000 00A00000 10000000 000000001 010000';
my $L = 8;
print map "$_ - ok\n",
grep /^(.{$L})$(??{$^Nne'0'x$L?'':'^$'})/,
#strings;
...
prints
00A00000 - ok
10000000 - ok
go figure ;-)
Regards
rbo
Wouldn't ([1-9]**|\D*){8} do it? Or am I missing something here (which is actually just the inverse of ndim's, which seems like it oughta work).
I am assuming the characters was chosen to include more than digits.
Ok so that was wrong, so Professor Bolo did I get a passing grade? (I love reg expressions so I am really curious).
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]2}?|[^0]{1}?)", '00000000'):
print 'match'
...
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10000000'):
... print 'match'
match
>>> if re.match(r"(?:[^0]{8}?|[^0]{7}?|[^0]{6}?|[^0]{5}?|[^0]{4}?|[^0]{3}?|[^0]{2}?|[^0]{1}?)", '10011100'):
... print 'match'
match
>>>
That work?

what do I use to match MS Word chars in regEx

I need to find and delete all the non standard ascii chars that are in a string (usually delivered there by MS Word). I'm not entirely sure what these characters are... like the fancy apostrophe and the dual directional quotation marks and all that. Is that unicode? I know how to do it ham-handed [a-z etc. etc.] but I was hoping there was a more elegant way to just exclude anything that isn't on the keyboard.
Probably the best way to handle this is to work with character sets, yes, but for what it's worth, I've had some success with this quick-and-dirty approach, the character class
[\x80-\x9F]
this works because the problem with "Word chars" for me is the ones which are illegal in Unicode, and I've got no way of sanitising user input.
Microsoft apps are notorious for using fancy characters like curly quotes, em-dashes, etc., that require special handling without adding any real value. In some cases, all you have to do is make sure you're using one of their extended character sets to read the text (e.g., windows-1252 instead of ISO-8859-1). But there are several tools out there that replace those fancy characters with their plain-but-universally-supported ewquivalents. Google for "demoronizer" or "AsciiDammit".
I usually use a JEdit macro that replaces the most common of them with a more ascii-friendly version, i.e.:
hyphens and dashes to minus sign;
suspsension dots (single char) to multiple dots;
list item dot to asterisk;
etc.
It is easily adaptable to Word/Openoffice/whatever, and of course modified to suit your needs. I wrote an article on this topic:
http://www.megadix.it/node/138
Cheers
What you are probably looking at are Unicode characters in UTF-8 format. If so, just escape them in your regular expression language.
My solution to this problem is to write a Perl script that gives me all of the characters that are outside of the ASCII range (0 - 127):
#!/usr/bin/perl
use strict;
use warnings;
my %seen;
while (<>) {
for my $character (grep { ord($_) > 127 } split //) {
$seen{$character}++;
}
}
print "saw $_ $seen{$_} times, its ord is ", ord($_), "\n" for keys %seen;
I then create a mapping of those characters to what I want them to be and replace them in the file:
#!/usr/bin/perl
use strict;
use warnings;
my %map = (
chr(128) => "foo",
#etc.
);
while (<>) {
s/([\x{80}-\x{FF}])/$map{$1}/;
print;
}
What I would do is, use AutoHotKey, or python SendKeys or some sort of visual basic that would send me all possible keys (also with shift applied and unapplied) to a Word document.
In SendKeys it would be a script of the form
chars = ''.join([chr(i) for i in range(ord('a'),ord('z'))])
nums = ''.join([chr(i) for i in range(ord('0'),ord('9'))])
specials = ['-','=','\','/',','.',',','`']
all = chars+nums+specials
SendKeys.SendKeys("""
{LWIN}
{PAUSE .25}
r
winword.exe{ENTER}
{PAUSE 1}
%(all)s
+(%(all)s)
"testQuotationAndDashAutoreplace"{SPACE}-{SPACE}a{SPACE}{BS 3}{LEFT}{BS}
{Alt}{PAUSE .25}{SHIFT}
changeLanguage
%(all)s
+%(all)s
"""%{'all':all})
Then I would save the document as text, and use it as a database for all displable keys in your keyboard layout (you might want to replace the default input language more than once to receive absolutely all displayable characters).
If the char is in the result text document - it is displayable, otherwise not. No need for regexp. You can of course afterward embed the characters range within a script or a program.