I'm using sed (actually ssed with extended "-r", so I have access to most any regex functionality I might need) to process files, one change being to convert doubled backslash characters to a single forward slashes inside quoted strings. The problem is that some quoted strings that contain the doubled backslashes should not be converted. All quoted strings I want to target have a certain word "myPhrase" inside the quotes.
So for a file with these two lines:
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
the output would be:
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
I've tried various combinations of lookahead like (?=myPhrase) within a search pattern that finds the desired quoted chunks and replaces a capture group (\\) with / as the replacement, but all my attempts either replace just the first occurance of the doubled backslashes, or those to the left of myPhrase, etc.
I'm sure there is some combination of lookahead/noncapture/recursion that should do this, but I'm blanking out completely right now.
With GNU awk for the 3rd arg to match():
$ cat file
"dont change \\this" "\\\\server\\dir\\myPhrase\\subdir" "nor\\this"
"Don't change \\something me!"
$ awk 'match($0,/(.*)("[^"]*myPhrase[^"]*")(.*)/,a){gsub(/[\\][\\]/,"/",a[2]); $0=a[1] a[2] a[3]} 1' file
"dont change \\this" "//server/dir/myPhrase/subdir" "nor\\this"
"Don't change \\something me!"
Try this Perl solution
perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' file
with the below inputs
$ cat ykdvd.txt
"\\\\server\\dir\\myPhrase\\subdir"
"Don't change \\something me!"
Another line
$ perl -pe ' s{"(.+)?"}{$x=$1; if($x=~m/myPhrase/ ) {$x=~s!\\\\!/!g};sprintf("\x22%s\x22",$x)}ge ' ykdvd.txt
"//server/dir/myPhrase/subdir"
"Don't change \\something me!"
Another line
$
Related
I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)
Let's say I have the following sentence
Apples, "This, is, a test",409, James,46,90
I want to change the commas inside the quotation marks by ;. Or, alternatively, the ones outside the quotation marks by the same character ;. So far I thought of something like
perl -pe 's/(".*)\K,(?=.*")/;/g' <mystring>
However, this is only matching the last comma inside quotation marks because I am restarting the regex engine with \K. I also tried some regex's to change the commas outside quotation marks but I can't get it to work.
Note that the spaces after commas outside the quotation marks are there on purpose, so that
perl -pe 's/,\s/;/g' <mystring>
is not a valid answer.
The desired output would be
Apples, "This; is; a test",409, James,46,90
Or alternatively
Apples; "This, is, a test";409; James;46;90
Any thoughts on how to approach this problem?
I'd use an actual CSV parser instead of trying to hack something up with regular expressions. The very useful Text::AutoCSV module makes it easy to convert the comma field separators to semicolons in a one-liner:
$ echo 'Apples, "This, is, a test",409, James,46,90' |
perl -MText::AutoCSV -e 'Text::AutoCSV->new(out_sep_char => ";")->write()'
Apples;"This, is, a test";409;James;46;90
For a non-perl solution, csvformat from csvkit is another handy tool, though it's harder to get the quoting the same:
$ echo 'Apples, "This, is, a test",409, James,46,90' |
csvformat -S -U2 -D';'
"Apples";"This, is, a test";"409";"James";"46";"90"
Or (Self promotion alert!) my tawk utility (Which also won't get the quotes the same):
$ echo 'Apples, "This, is, a test",409, James,46,90' |
tawk -csv -quoteall 'line { set F(1) $F(1); print }' OFS=";"
"Apples";" This, is, a test";"409";" James";"46";"90"
I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .
Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.
use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5
Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
I need to replace within a large text file all occurrences such as 'yw234DV w-23-sDf wef23s-d-f' with the same strings but with underscores instead of spaces for all spaces within quotes, without replacing any spaces outside quotes with underscores.
I'm trying to find a solution for substitution within vim, but a sed solution would also be much appreciated. The number of tokens in each quote-delimited string may vary.
I've been playing with some regexes in vim, but they're pretty elementary and seem to be missing what I need.
My current attempt:
%s/'{[:alnum:] }*/'\0\_/g
And I'm experimenting with variations on that.
This is most similar to my question, though it is Java:
Replacing spaces within quotes
Sample Input:
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
Sample Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
You may try this with VIM, tried this on Macvim:
%s/\%('[^']*'\)*\('[^']*'\)/\=substitute(submatch(1), ' ', '_', 'g')/g
Much simpler solution , Thanks to #SergioAraujo:
#%s/\v%(('[^']*'))/\=substitute(submatch(1),' ', '_', 'g')/g
Not sure however, if below is the outcome you have expected
Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
In perl:
perl -i -pe's{(\x27.*?\x27)}{ (my $subst = $1) =~ tr/ /_/ }ge' yourfile
or with perl5.14 or above:
perl -i -pe's{(\x27.*?\x27)}{ $1 =~ tr/ /_/r }ge'
With this the input file:
$ cat file
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
We can convert all spaces inside of single-quotes into underscores with:
$ sed -E ":a; s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/; ta" file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
How it works
:a
This creates a label a.
s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/
This inserts the underscores where we want them.
^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]
This looks for any odd number of single quotes followed by any number of non-quote characters followed by a space. Everything before that space is saved in group 1.
\1_
This replaces the matched text with group 1 followed by an underscore.
ta
If the previous command put any new underscores in the string, then jump back to label a and try again.
Using FPAT variable in gnu awk you can do this:
awk -v OFS=', ' -v FPAT="'[^']*'" '{for (h=1; h<=NF; h++)
{gsub(/[[:blank:]]/, "_", $h); printf "%s%s", $h, (h < NF ? OFS : ORS)}}' file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0