Using Variables with Regex that contain a space (\s) and sed

Using Variables with Regex that contain a space (\s) and sed - regex

Im trying to create a sort script using literal string variables and Regex and a sort using sed in bash. I cannot seem to find the liternal strings with spaces when using variables, although can find them when using the regex directly. So :
#!/bin/bash
group1="IRISHFHD"
group2="REGIONAL FHD"
sed -i '/group-title="'${group1}/',+1d' JWLINE.m3u
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
Ive tried adding \s into the group variable but it doesnt work.
John

The problem has nothing to do with regex, it's all down to how the shell treats variables' values. When a variable is expanded without double-quotes around it (i.e. ${group2}), the shell will split it into "words" based on whitespace. It'll also try to expand any words that contain shell wildcards into lists of matching files, and several regex metacharacters look like shell wildcards, which can cause serious chaos.
In this example:
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
It's a little more complicated, because the variable reference is in between two single-quoted sections. In this case, the part before the variable reference gets attached to the first "word" in the variable, and the part after gets attached to the last word. Essentially, it expands into the equivalent of this:
sed -i '/group-title="REGIONAL' 'FHD/,+1d' JWLINE.m3u
^ That's a space between arguments
Anyway, since it gets split on the whitespace, sed gets two partial arguments instead of one whole one, and it doesn't work at all.
Solution: as in almost all situations, you should have double-quotes around the variable reference to prevent weird effects like this. There are a few options for this. You could just add double-quotes around the variable part:
sed -i '/group-title="'"${group2}"/',+1d' JWLINE.m3u
...but IMO this is confusing; some of those quotes are syntactic (i.e. parsed by the shell), and one is literal (passed to sed as part of the regex), and it's not obvious which are which. I'd prefer to just use double-quotes around the whole thing, and escape the double-quote that's supposed to be literal:
sed -i "/group-title=\"${group2}/,+1d" JWLINE.m3u
^^ Escape makes this " a literal part of the argument.
(In double-quotes, you'd also need to escape any dollar signs, backslashes, or backticks that were supposed to be literal parts of the argument. But in this case, there aren't any of those.)

Related

Difference between using grep regex pattern with or without quotes?

I'm learning from Linux Academy and the tutorial shows how to use grep and regex.
He is putting his regex pattern in between quotes something like this:
grep 'pattern' file.txt
This seems to be the same than doing it without quotes:
grep pattern file.txt
But when he does something like this, he needs to escape the { and }:
grep '^A\{1,4\}' file.txt
And after doing some testing these scape characters don't seem to be needed when writing the pattern without the quotes.
grep ^A{1,4} file.txt
So what is the difference between these two methods?
Are the quotations necessary?
Why in the first case the escape characters are needed?
Lastly, I've also seen other methods like grep -E and egrep, which is the most common method that people use to grep with regex?
Edit: Thanks for the reminder that the pattern goes before the file.
Many thanks!

You can sometimes get away with omitting quotes, but it's safest not to. This is because the syntax of regular expressions overlaps that of filename wildcard patterns, and when the shell sees something that looks like a wildcard pattern (and it isn't in quotes), the shell will try to "expand" it into a list of matching filenames. If there are no matching files, it gets passed through unchanged, but if there are matches it gets replaced with the matching filenames.
Here's a simple example. Suppose we're trying to search file.txt for an "a" followed optionally by some "b"s, and print only the matches. So you run:
grep -o ab* file.txt
Now, "ab* could be interpreted as a wildcard pattern looking for files that start with "ab", and the shell will interpret it that way. If there are no files in the current directory that start with "ab", this won't cause a problem. But suppose there are two, "abcd.txt" and "abcdef.jpg". Then the shell expands this to the equivalent of:
grep -o abcd.txt abcdef.jpg file.txt
...and then grep will search the files abcdef.jpg and file.txt for the regex pattern abcd.txt.
So, basically, using an unquoted regex pattern might work, but is not safe. So don't do it.
BTW, I'd also recommend using single-quotes instead of double-quotes, because there are some regex characters that're treated specially by the shell even when they're in double-quotes (mostly dollar sign and backslash/escape). Again, they'll often get passed through unchanged, but not always, and unless you understand the (somewhat messy) parsing rules, you might get unexpected results.
BTW^2, for similar reasons you should (almost) always put double-quotes around variable references (e.g. grep -O 'ab* "$filename" instead of grep -O 'ab*' $filename). Single-quotes don't allow variable references at all; unquoted variable references are subject to word splitting and wildcard expansion, both of which can cause trouble. Double-quoted variables get expanded and nothing else.
BTW^3, there are a bunch of variants of regular expression syntax. The reason the curly braces in your example expression need to be escaped is that, by default, grep uses POSIX "basic" regular expression syntax ("BRE"). In BRE syntax, some regex special characters (including curly brackets and parentheses) must be escaped to have their special meaning (and some others, like alternation with |, are just not available at all). grep -E, on the other hand, uses "extended" regular expression syntax ("ERE"), in which those characters have their special meanings unless they're escaped.
And then there's the Perl-compatible syntax (PCRE), and many other variants. Using the wrong variant of the syntax is a common cause of trouble with regular expressions (e.g. using perl extensions in an ERE context, as here and here). It's important to know which variant the tool you're using understands, and write your regex to that syntax.
Here's a simple example: "a", followed by 1 to 3 space-like characters, followed by "b", in various regex syntax variants:
a[[:space:]]\{1,3\}b # BRE syntax
a[[:space:]]{1,3}b # ERE syntax
a\s{1,3}b # PCRE syntax
Just to make things more complicated, some tools will nominally accept one syntax, but also allow some extensions from other syntax variants. In the example above, you can see that perl added the shorthand \s for a space-like character, which is not part of either POSIX standard syntax. But in fact many tools that nominally use BRE or ERE will actually accept the \s shorthand.

Actually, there are two completely unrelated aspects of escaping in your question. The first has to do how to represent strings in bash. This is about readability, which usually means personal taste. For example, I don't like escaping, hence I prefer writing ab\ cd as 'ab cd'. Hence, I would write
echo 'ab cd'
grep -F 'ab cd' myfile.txt
instead of
echo ab\ cd
grep -F ab\ cd myfile.txt
but there is nothing wrong with either one, and you can choose whichever looks simpler to you.
The other aspect indeed is related to grep, at least as long as you do not use the -F option in grep, which always interprets the search argument literally. In this case, the shell is not involved, and the question is whether a certain character is interpreted as a regexp character or as a literal. Gordon Davisson has already explained this in detail, so I give only an example which combines both aspects:
Say you want to grep for a space, followed by one or more periods, followed by another space. You can't write this as
grep -E .+ myfile.txt
because the spaces would be eaten by bash and the . would have special meaning to grep. Hence, you have to choose some escape mechanism. My personal style would be
grep -E ' [.]+ ' myfile.txt
but many people dislike the [.] and prefer \. instead. This would then become
grep -E ' \.+ ' myfile.txt
This still uses quotes to salvage the spaces from the shell, but escapes the period for grep. If you prefer to use no quotes at all, you can write
grep -E \ \\.+\ myfile.txt
Note that you need to prefix the \ which is intended for grep by another \, because the backslash has, like a space, a special meaning for the shell, and if you would not write \\., grep would not see a backslash-period, but just a period.

Trying to do a simple sed substitution but I'm confused about what needs to be escaped

I have this string: '$'nwnwnwnnn
And want to change it to: { bitset<9>(0bnwnwnwnnn), '$'},
I've looked at many similar questions for different shells using their methods but nothing has worked. I'm generally in zsh but I can use bash or another shell.
The general form I've been trying is this:
sed -E -i new s/(\'.\')([nw]+)/{ bitset<9>(0b\2), \1},/g thing.txt
It should work for any character other than $ and any sequence of n or w.
I'm generally confused as to what I need to escape here. Some answers on this site said to escape the parenthesis in the first part of the substitution.
Am I using -i incorrectly?

You need to escape the parentheses to create a capture group if you're using basic regexp, you don't escape them if you're using extended regexp. The -E option to GNU sed, and the -r option to standard sed, enable extended regexp, so you don't need to escape them.
If you only want to match $ rather than allow any character in the quotes, you need an escaped $.
You need to put the entire s/// command inside quotes, as it must be a single argument to the sed command.
When using -i, it's conventional to put a . before the suffix. Also, the suffix is put on the saved copy of the original file, not the new file that you're creating with the changes, so new is a poor suffix.
sed -E -i .bak "s/('\$')([nw]+)/{ bitset<9>(0b\2), \1},/g" thing.txt

Sed script to to rewrite certain strings

I'm dealing with a body of XML files containing unstructured texts with semantic markup for personal names.
For reasons to do with the stylesheet that will eventually show them via a web application, I need to replace:
<persName>Fred</persName>'s
<persName>Wilma</persName>'s
with
<persName>Fred's</persName>
<persName>Wilma's</persName>
I have a single line in a shell script, being run in Gitbash for Windows, below. It runs OK, but has no effect. I suppose I'm missing something obvious, perhaps to do with escaping characters, but any help appreciated.
sed -i "s/<\/persName>\'s/\'s<\/persName>/g" test.xml

You may use
sed -i "s,</persName>'s,'s</persName>,g" test.xml
Details
s - we want to replace
, - a delimiter
</persName>'s - this string to find
, - delimiter
's</persName> - replace with this string
, - delimiter
g - multiple times if more than one is found
The -i option makes the replacements directly in the file.
Note that you do not have to escape ' when defining the sed command inside a double quoted string.
It is a good idea to use a delimiter char other than the common / if there are / chars inside the regex or/and replacement pattern.

The comment on your question suggests an easier solution, but I guess, that there might be names where the suffix 's differs, like names ending with an s. So I chose a solution where you grab what's right and put it in the middle.
As separator for the search and replace command in sed you can choose whatever you want. I've chosen #, so you don't have to escape the backslashes in the text. The escaped parantheses store what's inside in variables \1 and \2.
sed 's#<persName>\(.*\)</persName>\(.*\)#<persName>\1\2</persName>#g' testfile
Result:
<persName>Fred's</persName>
<persName>Wilma's</persName>
If you want to replace it in file, you can use the -i parameter. But be sure to check the result first.

grep and regex stored in string

my question is quite short:
a="'[0-9]*'"
grep -E '[0-9]*' #for example, line containing 000 will be recognized and printed
but
grep -E $a #line containing 000 WILL NOT be printed, why is that?
Does substitution for grep regex change the command's behaviour or have I missed something from a syntactic point of view? In other words, how do I make it so that grep accepts regex from a string stored in a variable.
Thank you in advance.

Quotes go around data, not in data. That means, when you store data (in this case, a regex expression) in a variable, don't embed quotes in the variable; instead, put double-quotes around the variable when you use it:
a="[0-9]*"
grep -E "$a"
You can sometimes get away with leaving the double-quotes off when using variables (as in Avinash Raj's comment), but it's not generally safe. In this case, it'll work fine provided there are no files or subdirectories in the current working directories with names that happen to start with a digit. You see, without double-quotes around $a, the shell will take its value, try to split it into multiple words (not a problem here), try to expand each word that contains shell wildcards into a list of matching files (potential problem here), and pass that to the command (grep) as its list of arguments. That means that if you happen to have files that start with digits in the current directory, grep thinks you ran a command like this:
grep -E 1file.txt 2file.jpg 3file.etc
... and it treats the first filename as the pattern to search for, and any other filenames as files to be searched. And you'll be scratching your head wondering why your script works or fails depending on which directory you happen to be in.
Note: the pattern [0-9]* is a valid regular expression, and a valid shell glob (wildcard) pattern, but it means very different things in the two contexts. As a regex, it means 0 or more digits in a row. As a shell glob, it means something that starts with a digit. Speaking of which, grep -E '[0-9]*' is not actually going to be very useful, since everything contains strings of 0 or more digits, so it'll match every line of every file you feed it.

Bash script - variable expansion within backtick/grep regex string

I'm trying to expand a variable in my bash script inside of the backticks, inside of the regex search string.
I want the $VAR to be substituted.
The lines I am matching are like:
start....some characters.....id:.....some characters.....[variable im searching for]....some characters....end
var=`grep -E '^.*id:.*$VAR.*$' ./input_file.txt`
Is this a possibility?
It doesn't seem to work. I know I can normally expand a variable with "$VAR", but won't this just search directly for those characters inside the regex? I am not sure what takes precedence here.

Variables do expand in backticks but they don't expand in single quotes.
So you need to either use double quotes for your regex string (I'm not sure what your concern with that is about) or use both quotes.
So either
var=`grep -E "^.*id:.*$VAR.*$" ./input_file.txt`
or
var=`grep -E '^.*id:.*'"$VAR"'.*$' ./input_file.txt`
Also you might want to use $(grep ...) instead of backticks since they are the more modern approach and have better syntactic properties as well as being able to be nested.

You need to have the expression in double quotes (and, then, escape anything which needs to be escaped) in order for the variable to be interpolated.
var=$(grep -E "^.*id:.*$VAR.*\$" ./input_file.txt)
(The backslash is not strictly necessary here, but I put it in to give you an idea. Your real expression is perhaps more complex.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js