Let's say I have the following sentence
Apples, "This, is, a test",409, James,46,90
I want to change the commas inside the quotation marks by ;. Or, alternatively, the ones outside the quotation marks by the same character ;. So far I thought of something like
perl -pe 's/(".*)\K,(?=.*")/;/g' <mystring>
However, this is only matching the last comma inside quotation marks because I am restarting the regex engine with \K. I also tried some regex's to change the commas outside quotation marks but I can't get it to work.
Note that the spaces after commas outside the quotation marks are there on purpose, so that
perl -pe 's/,\s/;/g' <mystring>
is not a valid answer.
The desired output would be
Apples, "This; is; a test",409, James,46,90
Or alternatively
Apples; "This, is, a test";409; James;46;90
Any thoughts on how to approach this problem?
I'd use an actual CSV parser instead of trying to hack something up with regular expressions. The very useful Text::AutoCSV module makes it easy to convert the comma field separators to semicolons in a one-liner:
$ echo 'Apples, "This, is, a test",409, James,46,90' |
perl -MText::AutoCSV -e 'Text::AutoCSV->new(out_sep_char => ";")->write()'
Apples;"This, is, a test";409;James;46;90
For a non-perl solution, csvformat from csvkit is another handy tool, though it's harder to get the quoting the same:
$ echo 'Apples, "This, is, a test",409, James,46,90' |
csvformat -S -U2 -D';'
"Apples";"This, is, a test";"409";"James";"46";"90"
Or (Self promotion alert!) my tawk utility (Which also won't get the quotes the same):
$ echo 'Apples, "This, is, a test",409, James,46,90' |
tawk -csv -quoteall 'line { set F(1) $F(1); print }' OFS=";"
"Apples";" This, is, a test";"409";" James";"46";"90"
Related
I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)
I'm using sed (actually ssed with extended "-r", so I have access to most any regex functionality I might need) to process files, one change being to convert doubled backslash characters to a single forward slashes inside quoted strings. The problem is that some quoted strings that contain the doubled backslashes should not be converted. All quoted strings I want to target have a certain word "myPhrase" inside the quotes.
So for a file with these two lines:
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
the output would be:
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
I've tried various combinations of lookahead like (?=myPhrase) within a search pattern that finds the desired quoted chunks and replaces a capture group (\\) with / as the replacement, but all my attempts either replace just the first occurance of the doubled backslashes, or those to the left of myPhrase, etc.
I'm sure there is some combination of lookahead/noncapture/recursion that should do this, but I'm blanking out completely right now.
With GNU awk for the 3rd arg to match():
$ cat file
"dont change \\this" "\\\\server\\dir\\myPhrase\\subdir" "nor\\this"
"Don't change \\something me!"
$ awk 'match($0,/(.*)("[^"]*myPhrase[^"]*")(.*)/,a){gsub(/[\\][\\]/,"/",a[2]); $0=a[1] a[2] a[3]} 1' file
"dont change \\this" "//server/dir/myPhrase/subdir" "nor\\this"
"Don't change \\something me!"
Try this Perl solution
perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' file
with the below inputs
$ cat ykdvd.txt
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
Another line
$ perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' ykdvd.txt
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
Another line
$
Problem: I have a variable with characters I'd like to prepend another character to within the same string stored in a variable
Ex. "[blahblahblah]" ---> "\[blahblahblah\]"
Current Solution: Currently I accomplish what I want with two steps, each step attacking one bracket
Ex.
temp=[blahblahblah]
firstEscaped=$(echo $temp | sed s#'\['#'\\['#g)
fullyEscaped=$(echo $firstEscaped | sed s#'\]'#'\\]'#g)
This gives me the result I want but I feel like I can accomplish this in one line using capturing groups. I've just had no luck and I'm getting burnt out. Most examples I come across involve wanting to extract the text between brackets instead of what I'm trying to do. This is my latest attempt to no avail. Any ideas?
There may be more efficient ways, (only 1 s/s/r/ with a fancier reg-ex), but this works, given your sample input
fully=$(echo "$temp" | sed 's/\([[]\)/\\\1/;s/\([]]\)/\\\1/') ; echo "$fully"
output
\[blahblahblah\]
Note that it is quite OK to chain together multiple sed operations, separated by ; OR if in a sed script file, by blank lines.
Read about sed capture-groups using \(...\) pairs, and referencing them by number, i.e. \1.
IHTH
$ temp=[blahblahblah]
$ fully=$(echo "$temp" |sed 's/\[\|\]/\\&/g'); echo "$fully"
\[blahblahblah\]
Brief explanation,
\[\|\]: target to substitute '[' or ']', and for '[', ']', and '|' need to be escaped.
&: the character & to refer to the pattern which matched, and mind that it also needs to be escaped.
As #Gordon Davisson's suggestion, you may also use bracket expression to avoid the extended format regex,
sed 's/[][]/\\&/g'
I'm trying to return different replacement results with a perl regex one-liner if it matches a group. So far I've got this:
echo abcd | perl -pe "s/(ab)(cd)?/defined($2)?\1\2:''/e"
But I get
Backslash found where operator expected at -e line 1, near "1\"
(Missing operator before \?)
syntax error at -e line 1, near "1\"
Execution of -e aborted due to compilation errors.
If the input is abcd I want to get abcd out, if it's ab I want to get an empty string. Where am I going wrong here?
You used regex atoms \1 and \2 (match what the first or second capture captured) outside of a regex pattern. You meant to use $1 and $2 (as you did in another spot).
Further more, dollar signs inside double-quoted strings have meaning to your shell. It's best to use single quotes around your program[1].
echo abcd | perl -pe's/(ab)(cd)?/defined($2)?$1.$2:""/e'
Simpler:
echo abcd | perl -pe's/(ab(cd)?)/defined($2)?$1:""/e'
Simpler:
echo abcd | perl -pe's/ab(?!cd)//'
Either avoid single-quotes in your program[2], or use '\'' to "escape" them.
You can usually use q{} instead of single-quotes. You can also switch to using double-quotes. Inside of double-quotes, you can use \x27 for an apostrophe.
Why torture yourself, just use a branch reset.
Find (?|(abcd)|ab())
Replace $1
And a couple of even better ways
Find abcd(*SKIP)(*FAIL)|ab
Replace ""
Find (?:abcd)*\Kab
Replace ""
These use regex wisely.
There is really no need nowadays to have to use the eval form
of the regex substitution construct s///e in conjunction with defined().
This is especially true when using the perl command line.
Good luck...
I came across this awesome regex:
s/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg
It does magic, but is so obscure I can't understand it. It works very well:
echo 'a\tb\nc\r\n' | perl -lpe 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg'
a b
c
Let us watch it with cat -A :
echo 'a\tb\nc\r\n' | perl -lpe 's/((?:\\[a-zA-Z\\])+)/qq[qq[$1]]/eeg' | cat -A
a^Ib$
c^M$
$
I will keep it for future reference, but it would be really cool to understand it. I know /ee modifier evaluates RHS, but what are those qqs? Is the function qq for double quotes ? I would appreciate if someone could explain.
PS. I found this regex here
In perl re's you have single and double quotes, where "$foo" is expanded and '$foo' is literal.
The q operator lets you set which character does '
The qqoperator sets the character for ".
So in the awesome example, [ is getting set to expand variables, and perl magic is making it more readable by pairing ] with [. So it's expanding the variable twice, which without that highlighting would be deeply mysterious, and the " quotes get very confusing when mixed in with shell quoting.
A simple example to try out :
% perl -E '$foo=bar; say qq[$foo];'
bar
%
qq is the interpolating quote operator. It's the same thing as putting a string between double quotes, but can use open-close character pairs like [] here. This has the advantage that you can nest it, which you couldn't do with double quotes.