GREP: Extracting all characters from inside double quote

GREP: Extracting all characters from inside double quote - regex

What I did:
grep -E -o -e "[^"]+"
It can extract, for example: "Poland" and "New York" but can't extract "Marcos Juárez" due to the existence of 'á'...it cuts the output to "Marcos Ju" and "rez"
How can I prevent this?

I don't think this is a regex problem per say. It could be a Unicode or wide-char issue.
Your regex should be "[^"]+" thats a NOT double quote.
I don't know unix command line, but what is delimiting the "[^']+" parameter,
is it done by just spaces ?
Try ".*?", it should match. If not its a unicode problem.

Try:
grep -Po '(?<=\")(.*?)(?=\")'
for me it output all the three correctly.

Related

sed -e : is it possible to match everything between two quotation marks?

Sed, is it possible to match everything between two chars?
In a script that I have to use there is a bug.
The script has to replace the value of
#define MAPPING,
The line containing the bug is the one below:
sed -i -e "s/#define MAPPING \"\"/#define MAPPING \"$string\"/1" file.hpp
Since in file.hpp MAPPING is defined as:
#define MAPPING ""
the script works, but if I try to call the script again and MAPPING was already redefined, now sed won't match #define MAPPING "" and thus not override anything.
I'm not a sed expert, and with a quick search couldn't find the way to let it match
#define MAPPING "<everything>".
Is it possible to achieve this?

This is does you want:
sed -Ei 's/(#define MAPPING ")[^"]*(")/\1'"$string\2/" file.hpp
[^"]* means zero or more non double quote characters.
I used back references instead of repeating the same text, it's up to you.
1 at the end of your example means replace the first occurence. However this is the default, so it can be removed.
Be aware: if $string contains sequences like &, \5, or \\, they won't be passed literally, and can even cause an error. Also, C escapes like \t for tab are expanded by many sed implementations (so you'll end up with a literal tab in the file, instead of \t).
For what it's worth, this sed does the same thing, but is more accomodating of varied whitespace:
sed -Ei 's/(^[[:space:]]*#[[:space:]]*define[[:space:]]+MAPPING[[:space:]]+")[^"]*(")/\1'"$string\2/" file.hpp

You can also try:
sed -i -e "s/#define MAPPING \".*\"/#define MAPPING \"$string\"/1" file.hpp
The dot means anything can go here and the star means at least 0 times so .* accepts any sequence of characters, including an empty string.

s/// returns out of place newline

I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)

[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .

Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.

use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5

Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101

Select a single character in an alphanumeric string in bash

I have an issue with string manipulation in bash. I have a list of names, each name being composed of two parts, chars and numbers: for example
abcdef01234
I want to cut the last character before the numeric part starts, in this case
f
I think there is a regular expression to help me with this but just can't figure it out. AWK/sed solutions are accepted too. Hope someone can help.
Thank you.

In bash it can be done with parameter expansion with substring removal and string indexes, e.g.,
a=abcdef01234 # your string
tmp=${a%%[0-9]*} # remove all numbers from right
echo ${tmp:(-1)} # output last of remaining chars
Output: f

You can use a regexp like [a-zA-Z]+([a-zA-Z])[0-9]+. If you know how to use sed is pretty easy.
Check https://regex101.com/r/XCkKM5/1

The match will be the letter you want.
^\w+([a-zA-Z])\d+$
As a sed command (on OSX) this will be :
echo "abcdef12345" | sed -E "s#^[a-zA-Z]+([a-zA-Z])[0-9]+\$#\1#"

try following too once.
echo "abcdef01234" | awk '{match($0,/[a-zA-Z]+/);print substr($0,RLENGTH,1)}'

I have a list of names I assume is a file, file. Using grep's PCRE and (positive) lookahead:
$ grep -oP "[a-z](?=[^a-z])" file
f
It prints out the first (lowercase) letter followed by a non-(lowercase)-letter.

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks

You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'

Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise

That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input

I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.

or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input

select part of filename using regex

I got a file that looks like
dcdd62defb908e37ad037820f7 /sp/dir/su1/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/su2/89/sfdh.gz
ee1d443b8a0cc27749f4b31e56 /sp/dir/su3/89/24.gz
33c02e311fd0a894f7f0f8aae4 /sp/dir/su4/89/dfad.gz
43f6cdce067f6794ec378c4e2a /sp/dir/su5/89/adf.gz
2f6c584116c567b0f26dfc8703 /sp/dir/su6/895/895.gz
a864b7e327dac1bb6de59dedce /sp/dir/su7/895/895.gz
How do i use sed to substitue all the su* such that I can replace with a single value like
sed "s/REXEXP/newfolder/g" myfile
thanks in advance

I think you want
sed 's/su./newfolder/g'
If you actually want to keep the number in su1...su7 as a part of newfolder (for example newfolder1...newfolder7), you can do:
sed 's/su\(.\)/newfolder\1/g'
It also depends upon how "strict" do you want your patterns to be. The above will match su followed by any character and do the replacement. On the other hand, a command like s#/su\([0-9]\)/#/newfolder\1/#g will only match /su followed by a digit, followed by /. So you may need to adjust your pattern accordingly.

$ sed -e 's|/su[^/]*|/newfolder|' /tmp/files\
dcdd62defb908e37ad037820f7 /sp/dir/newfolder/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/newfolder/89/sfdh.gz
...
If you want to get rid of the checksums as well:
$ sed -r -e 's|/su[^/]*|/newfolder|' -e 's/^[^ ]+ +//' /tmp/files\
/sp/dir/newfolder/89/asga.gz
/sp/dir/newfolder/89/sfdh.gz
...

su[0-9] will match a single digit.

sed requires a dirty amount of metacharacter escaping, some of it may be slightly off.
sed -i -e 's/\/su[^\/]+\//\/newFolder\//g' myfile

I vote for Wayne Conrad's answer as the most likely to be what the OP wants, but I'd suggest using an alternate character for the sed expression separator, thus:
sed 's|/su[^/]*|/newfolder|' /tmp/files
That makes it a bit cleaner.
Note also that the trailing 'g' is probably not wanted.

use awk. since there is a delimiter you can use , '/'. after that, column 4 is what you want to change. So if you have paths like /sp/su3dir/su2/89/sfdh.gz , su3dir will not be affected.
awk -F"/" '{$4="newfolder";}1' OFS="/" file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

GREP: Extracting all characters from inside double quote - regex

What I did: grep -E -o -e "[^"]+" It can extract, for example: "Poland" and "New York" but can't extract "Marcos Juárez" due to the existence of 'á'...it cuts the output to "Marcos Ju" and "rez" How can I prevent this?

Try: grep -Po '(?<=\")(.*?)(?=\")' for me it output all the three correctly.

Related

sed -e : is it possible to match everything between two quotation marks?

s/// returns out of place newline

Select a single character in an alphanumeric string in bash

using sed to copy lines and delete characters from the duplicates

select part of filename using regex

Categories

Resources