perl -i -pe 's/(,\h*"[^\n"]*)\n/$1 /g' /opt/data-integration/transfer/events/processing/Master_Events_List.csv
What is going on here? I tried a translator but its a bit vague. What are some examples that might return here?
First, don't try and manipulate CSV (or XML or HTML) with regexes. While CSV might seem simple, it can be subtle. Instead use Text::CSV. The exception is if your CSV is malformed and you're fixing it.
Now, for what your regex is doing. First, let's translate it it from s// to s{}{} which is a bit easier on the eyes and use \x so we can space things out a bit.
s{
# Capture to $1
(
# A comma.
,
# 0 or more `h` "horizontal whitespace": tabs and spaces
\h*
# A quote.
"
# 0 or more of anything which is not a quote or newline.
[^\n"]*
)
# A newline (not captured)
\n
}
# Put the captured bit in with a space after it.
# The `g` says to do it multiple times over the whole string.
{$1 }gx
It will change foo, "bar\n into foo, "bar. I'm guessing it's turning text fields in the CSV with newlines in them into ones with just spaces.
foo, "first
field", "second
field"
Will become
foo, "first field", "second field"
This is something better handled with Text::CSV. I suspect the purpose of the transform is to help out CSV parsers which cannot handle newlines. Text::CSV can with a little coercing.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use autodie;
use Text::CSV;
use IO::Scalar;
use Data::Dumper;
# Pretend our scalar is an IO object so we can use `getline`.
my $str = qq[foo, "bar", "this\nthat"\n];
my $io = IO::Scalar->new(\$str);
# Configure Text::CSV
my $csv = Text::CSV->new({
# Embedded newlines normally aren't allowed, this tells Text::CSV to
# treat the content as binary instead.
binary=> 1,
# Allow spaces between the cells.
allow_whitespace => 1
});
# Use Text::CSV->getline() to do the parsing.
while( my $row = $csv->getline($io) ) {
# Dump the contents of the row
say Dumper $row;
}
And it will correctly parse the row and its embedded newlines.
$VAR1 = [
'foo',
'bar',
'this
that'
];
Edited this to second Schwern (also upvoted): regular expressions seem to be a poor fit for manipulating CSV.
As for the regular expression in question, let's dissect it. Starting with the top level:
's/(,\h*"[^\n"]*)\n/$1 /g'
The s/part1/part2/g expression means "substitute the first part with the second part everywhere".
Now let's examing the "first part":
(,\h*"[^\n"]*)\n
The parentheses are enclosing a group. There is only one group, so it becomes group number 1. We'll come back to that in the next step.
Then, check out https://perldoc.perl.org/perlrebackslash.html for explanation of the character classes. \h is a horizontal whitespace and \n is a logical newline character.
The expression inside the group is stating: "starts with a comma, then any number of horizontal whitespace characters, then anything but a newline and quote; finally, there must be a trailing newline". So it is basically a comma follwed by a csv field.
Lastly, the "second part" reads:
$1
This is just a reference to the group number 1 that was captured earlier followed by a space.
In overall, the whole expression replaces a trailing string field that is not terminated with a quote and removing it's newline terminator.
The best way to fix newlines in quoted fields that masquerade as End-Of-Record :
First, don't try and manipulate CSV (or XML or HTML) with modules. While CSV might seem tricky, it is extremely simple. Don't use Text::CSV. Instead, use a substitute regex with a callback.
Also, you can use the regex to just correctly parse a csv without replacing
newlines, but you probably want to use Perl to fix it for use in some other language.
Regex (with trim)
/((?:^|,|\r?\n))\s*(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"[^\S\r\n]*(?=$|,|\r?\n))|([^,\r\n]*(?=$|,|\r?\n)))/
Explained
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add
# [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
(Note - this requires a script.)
Perl sample
use strict;
use warnings;
$/ = undef;
sub RmvNLs {
my ($delim, $quote, $non_quote) = #_;
if ( defined $non_quote ) {
return $delim . $non_quote;
}
$quote =~ s/\s*\r?\n/ /g;
return $delim . $quote;
}
my $csv = <DATA>;
$csv =~ s/
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
/RmvNLs($1,$2,$3)/xeg;
print $csv;
__DATA__
497,50,2008-08-02T16:56:53Z,469,4,
"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,
PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
PRO1235, "KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
PRO1235,"RICK SMITH",,, #
Output
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING HARD ON BEING A GOOD SERVENT."
PRO1235,"KEITH SMITH",ENGINEER,"keith has been working hard on being a good servent."
PRO1235,"KENNY SMITH",,"keith has been working hard on being a good servent."
PRO1235,"RICK SMITH",,,
Related
I'm trying to set up a regular expression that will allow me to replace 2 spaces with a tab, but only on lines containing a certain pattern.
foo: here is some sample text
bar: here is some sample text
In the above example I want to replace any groups of 2 spaces with a tab, but only on lines that contain "bar":
foo: here is some sample text
bar: here is some sample text
The closest that I've gotten has been using this:
Find: ^(\s.*)(bar)(.*) (.*)
Replace: \1\2\3\t\4
However, this only replaces one group of two spaces at a time, so I end up with this:
foo: here is some sample text
bar: here is some sample text
I could execute the replace 3 more times and get my desired result, but I am dealing with text files that may contain hundreds of these sequences.
I am using Sublime Text, but I'm pretty sure that it uses PCRE for its Regex.
This works as well
(?m-s)(?:^(?=.*\bbar\b)|(?!^)\G).*?\K[ ]{2}
https://regex101.com/r/vnM649/1
or
https://regex101.com/r/vnM649/2
Explained
(?m-s) # Multi-line mode, not Dot-All mode
(?:
^ # Only test at BOL for 'bar'
(?= .* \b bar \b )
| # or,
(?! ^ ) # Not BOL, must have found 2 spaces in this line before
\G # Start where last 2 spaces left off
)
.*? # Minimal any character (except newline)
\K # Ignore anything that matched up to this point
[ ]{2} # 2 spaces to replace with a \t
possible to translate this to work with Python?
Yes.
The \G construct gives the ability to do it all
in a single pass regex. Python regex module supports it,
but not it's re module. If using the re module, you need
to do it in 2 steps.
First is to match the line(s) where bar is
then to pass it to a callback to replace all double
spaces to a tabs, then return it as the replacement
back to the caller.
Sample Python code:
https://rextester.com/AYM96859
#python 2.7.12
import re
def replcall(m):
contents = m.group(1)
return re.sub( r'[ ]{2}',"\t", contents )
str = (
r'foo: here is some sample text' + "\n"
r'bar: here is some sample text' + "\n"
)
newstr = re.sub( r'(?m)(^(?=.*\bbar\b)(?=.*[ ]{2}).*)', replcall, str )
print newstr
The regex to get the line, expanded:
(?m)
( # (1 start)
^
(?= .* \b bar \b )
(?= .* [ ]{2} )
.*
) # (1 end)
This will work:
Find: (^(?!.*bar).*)|
Replace: \1\t
(notice the 2 spaces at the end of the "find" regex) but it'll add a tab at the end of the foo line.
See here a PCRE demo.
I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:
public String example(){
return "hello";
}
What would the regex patter for this look like?
Current Attempt:
use strict;
use warnings;
open ( FILE, "example.java" ) || die "can't open file!";
my #lines = <FILE>;
close (FILE);
my $line;
foreach $line (#lines) {
if($line =~ /String example(.*)}/s){
print $line;
}
}
**Adopted from this answer
Regex:
^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))
Breakdown:
^ \s*
( # (1 start)
[\w\s]+ \( .* \) \s* # How it matches a function definition
( # (2 start)
\{ # Opening curly bracket
( # (3 start)
(?> # Atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # Double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # Single quoted strings
| // .* $ # A comment block starting with //
| /\* [\s\S]*? \*/ # A multi-line comment block /*...*/
( \w+ ) # (4) ^
["']? [^;]+ \4 ; $ # ^
| [^{}<'"/]++ # Force engine to backtrack if it encounters special characters (possessive)
| [^{}]++ # Default matching behavior (possessive)
| (?2) # Recurs 2nd capturing group
)* # Zero to many times of atomic group
) # (3 end)
} # Closing curly bracket
) # (2 end)
) # (1 end)
Revo's regex is the Right Way To Do it (as much as a regex ever can be!).
But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".
So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:
( # Capture \1.
^([\t ])+ # Match and capture leading whitespace to \2.
(?:\w+\s*)? # Privacy specifier, if any.
\w+\s*\( # Name and opening round brace: is a function.
.*? # Need Dot-matches-newline, to match fn body.
\n\2} # Curly brace is as indented as start of fn.
) # End capture of \1.
Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.
Will work with K&R, Hortmann and Allman indent styles.
Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.
Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.
And neither of our regexes capture the comments preceding a function, which is pretty sinful.
How can I "block comment" SQL statements in Notepad++?
For example:
CREATE TABLE gmr_virtuemart_calc_categories (
id int(1) UNSIGNED NOT NULL,
virtuemart_calc_id int(1) UNSIGNED NOT NULL DEFAULT '0',
virtuemart_category_id int(1) UNSIGNED NOT NULL DEFAULT '0'
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It should be wrapped with /* at the start and */ at the end using regex in Notepad++ to produce:
/*CREATE TABLE ... (...) ENGINE=MyISAM DEFAULT CHARSET=utf8;*/
You only offer one sample input, so I am forced to build the pattern literally. If this pattern isn't suitable because there are alternative queries and/or other interfering text, then please update your question.
Tick the "Match case" box.
Find what: (CREATE[^;]+;) Replace with: /*$1*/
Otherwise, you can use this for sql query blocks that start with a capital and end in semicolon:
Find what: ([A-Z][^;]+;) Replace with: /*$1*/
To improve accuracy, you might include ^ start of line anchors or add \r\n after the semi-colon or match the CHARSET portion before the semi-colon. There are several adjustments that can be made. I cannot be confident of accuracy without knowing more about the larger body of text.
You could use a recursive regex.
I think NP uses boost or PCRE.
This works with both.
https://regex101.com/r/P75bXC/1
Find (?s)(CREATE\s+TABLE[^(]*(\((?:[^()']++|'.*?'|(?2))*\))(?:[^;']|'.*?')*;)
Replace /*$1*/
Explained
(?s) # Dot-all modifier
( # (1 start) The whole match
CREATE \s+ TABLE [^(]* # Create statement
( # (2 start), Recursion code group
\(
(?: # Cluster group
[^()']++ # Possesive, not parenth's or quotes
| # or,
' .*? ' # Quotes (can wrap in atomic group if need be)
| # or,
(?2) # Recurse to group 2
)* # End cluster, do 0 to many times
\)
) # (2 end)
# Trailer before colon statement end
(?: # Cluster group, can be atomic (?> ) if need be
[^;'] # Not quote or colon
| # or,
' .*? ' # Quotes
)* # End cluster, do 0 to many times
; # Colon at the end
) # (1 end)
I just started learning Perl and trying to do regex to break down a token key.
The token itself has multiple "columns" and I only need the KEY section.
token:
{
"token_key":"C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8",
"key_type":"xyz",
"expires":1200
}
but I only need this part
C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8
everything else can be ignored when I output it.
Any suggestions or advice are welcome!
Thank You!
This looks like JSON -- perhaps a proper parser would be better?
use JSON::PP;
my $json = JSON::PP->new->utf8->allow_barekey;
my $token = $json->decode('{' . $str . '}')->{'token'};
print $token->{'token_key'};
In any case, you can extract it (a bit more hackishly) with a regex like so:
$str =~ /['"]token_key['"]:\s*['"]([a-f0-9]+)['"]/i;
print $1;
A slightly less hackish regex would be:
# (["'])\s*token_key\s*\1\s*:\s*(["'])((?:(?!\2)[\S\s])*)\2
( ["'] ) # (1)
\s* token_key \s*
\1
\s* : \s*
( ["'] ) # (2)
( # (3 start)
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
Assuming the following:
absence of white characters between "token_key" and its value,
double quotes, not single quotes have been used in parsed strings,
a string of file is in $_ variable.
In that case
if (/"token_key":"([^"]+)/)
{ print "$1\n" }
I have a regular expression to test whether a CSV cell contains a correct file path:
EDIT The CSV lists filepaths that does not yet exists when script runs (I cannot use -e), and filepath can include * or %variable% or {$variable}.
my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]{0,2})*\1$';
Since CSV cells sometimes contains wrappers of double quotes, and sometimes the filename itself needs to be wrapped by double quotes, I made this grouping (|"|""") ... \1
Then using this function:
sub ValidateUNCPath{
my $input = shift;
if ($input !~ /$FILENAME_REGEXP/){
return;
}
else{
return "This is a Valid File Path.";
}
}
I'm trying to test if this phrase is matching my regexp (It should not match):
"""c:\my\dir\lord"
but my dear Perl gets into infinite loop when:
ValidateUNCPath('"""c:\my\dir\lord"');
EDIT actually it loops on this:
ValidateUNCPath('"""\aaaaaaaaa\bbbbbbb\ccccccc\Netwxn00.map"');
I made sure in http://regexpal.com that my regexp correctly catches those non-symmetric """ ... " wrapping double quotes, but Perl got his own mind :(
I even tried the /g and /o flags in
/$FILENAME_REGEXP/go
but it still hangs. What am I missing ?
First off, nothing you have posted can cause an infinite loop, so if you're getting one, its not from this part of the code.
When I try out your subroutine, it returns true for all sorts of strings that are far from looking like paths, for example:
.....
This is a Valid File Path.
.*.*
This is a Valid File Path.
-
This is a Valid File Path.
This is because your regex is rather loose.
^(|"|""") # can match the empty string
(?:[a-zA-Z]:[\\\/])? # same, matches 0-1 times
[\\\/]{0,2} # same, matches 0-2 times
(?:(?:[\w\s\.\*-]+|\{\$\w+}|%\w+%)[\\\/]?)+\1$ # only this is not optional
Since only the last part actually have to match anything, you are allowing all kinds of strings, mainly in the first character class: [\w\s\.\*-]
In my personal opinion, when you start relying on regexes that look like yours, you're doing something wrong. Unless you're skilled at regexes, and hope noone who isn't will ever be forced to fix it.
Why don't you just remove the quotes? Also, if this path exists in your system, there is a much easier way to check if it is valid: -e $path
If the regex engine was naïve,
('y') x 20 =~ /^.*.*.*.*.*x/
would take a very long time to fail since it has to try
20 * 20 * 20 * 20 * 20 = 3,200,000 possible matches.
Your pattern has a similar structure, meaning it has many components match wide range of substrings of your input.
Now, Perl's regex engine is highly optimised, and far far from naïve. In the above pattern, it will start by looking for x, and exit very very fast. Unfortunately, it doesn't or can't similarly optimise your pattern.
Your patterns is a complete mess. I'm not going to even try to guess what it's suppose to match. You will find that this problem will solve itself once you switch to a correct pattern.
Update
Edit: From trial and error, the below grouping sub-expression [\w\s.*-]+ is causing backtrack problem
(?:
(?:
[\w\s.*-]+
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
Fix #1,
Unrolled method
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)+
(?:
[\\\/]
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)+
)*
[\\\/]?
\1
$
'
Fix #2, Independent Sub-Expression
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?>
(?:
(?:
[\w\s.*-]+
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
)
\1
$
'
Fix #3, remove the + quantifier (or add +?)
'
^
( # Nothing
|" # Or, "
|""" # Or, """
)
# Here to end, there is no provision for quotes (")
(?: # If there are no balanced quotes, this will fail !!
[a-zA-Z]
:
[\\\/]
)?
[\\\/]{0,2}
(?:
(?:
[\w\s.*-]
| \{\$\w+\}
| %\w+%
)
[\\\/]?
)+
\1
$
'
Thanks to sln this is my fixed regexp:
my $FILENAME_REGEXP = '^(|"|""")(?:[a-zA-Z]:[\\\/])?[\\\/]{0,2}(?:(?:[\w\s.-]++|\{\$\w+\}|%\w+%)[\\\/]{0,2})*\*?[\w.-]*\1$';
(I also disallowed * char in directories, and only allowed single * in (last) filename)