grep and regex stored in string - regex

my question is quite short:
a="'[0-9]*'"
grep -E '[0-9]*' #for example, line containing 000 will be recognized and printed
but
grep -E $a #line containing 000 WILL NOT be printed, why is that?
Does substitution for grep regex change the command's behaviour or have I missed something from a syntactic point of view? In other words, how do I make it so that grep accepts regex from a string stored in a variable.
Thank you in advance.

Quotes go around data, not in data. That means, when you store data (in this case, a regex expression) in a variable, don't embed quotes in the variable; instead, put double-quotes around the variable when you use it:
a="[0-9]*"
grep -E "$a"
You can sometimes get away with leaving the double-quotes off when using variables (as in Avinash Raj's comment), but it's not generally safe. In this case, it'll work fine provided there are no files or subdirectories in the current working directories with names that happen to start with a digit. You see, without double-quotes around $a, the shell will take its value, try to split it into multiple words (not a problem here), try to expand each word that contains shell wildcards into a list of matching files (potential problem here), and pass that to the command (grep) as its list of arguments. That means that if you happen to have files that start with digits in the current directory, grep thinks you ran a command like this:
grep -E 1file.txt 2file.jpg 3file.etc
... and it treats the first filename as the pattern to search for, and any other filenames as files to be searched. And you'll be scratching your head wondering why your script works or fails depending on which directory you happen to be in.
Note: the pattern [0-9]* is a valid regular expression, and a valid shell glob (wildcard) pattern, but it means very different things in the two contexts. As a regex, it means 0 or more digits in a row. As a shell glob, it means something that starts with a digit. Speaking of which, grep -E '[0-9]*' is not actually going to be very useful, since everything contains strings of 0 or more digits, so it'll match every line of every file you feed it.

Related

Using Variables with Regex that contain a space (\s) and sed

Im trying to create a sort script using literal string variables and Regex and a sort using sed in bash. I cannot seem to find the liternal strings with spaces when using variables, although can find them when using the regex directly. So :
#!/bin/bash
group1="IRISHFHD"
group2="REGIONAL FHD"
sed -i '/group-title="'${group1}/',+1d' JWLINE.m3u
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
Ive tried adding \s into the group variable but it doesnt work.
John
The problem has nothing to do with regex, it's all down to how the shell treats variables' values. When a variable is expanded without double-quotes around it (i.e. ${group2}), the shell will split it into "words" based on whitespace. It'll also try to expand any words that contain shell wildcards into lists of matching files, and several regex metacharacters look like shell wildcards, which can cause serious chaos.
In this example:
sed -i '/group-title="'${group2}/',+1d' JWLINE.m3u
It's a little more complicated, because the variable reference is in between two single-quoted sections. In this case, the part before the variable reference gets attached to the first "word" in the variable, and the part after gets attached to the last word. Essentially, it expands into the equivalent of this:
sed -i '/group-title="REGIONAL' 'FHD/,+1d' JWLINE.m3u
^ That's a space between arguments
Anyway, since it gets split on the whitespace, sed gets two partial arguments instead of one whole one, and it doesn't work at all.
Solution: as in almost all situations, you should have double-quotes around the variable reference to prevent weird effects like this. There are a few options for this. You could just add double-quotes around the variable part:
sed -i '/group-title="'"${group2}"/',+1d' JWLINE.m3u
...but IMO this is confusing; some of those quotes are syntactic (i.e. parsed by the shell), and one is literal (passed to sed as part of the regex), and it's not obvious which are which. I'd prefer to just use double-quotes around the whole thing, and escape the double-quote that's supposed to be literal:
sed -i "/group-title=\"${group2}/,+1d" JWLINE.m3u
^^ Escape makes this " a literal part of the argument.
(In double-quotes, you'd also need to escape any dollar signs, backslashes, or backticks that were supposed to be literal parts of the argument. But in this case, there aren't any of those.)

Is there a bash script for finding a specific character between two given expressions?

I have a 3-step problem: I need to
find all occurrences of the character : in a latex file but only when it is in a \ref{} or in a \label{}, in which there can be other characters. Example: The system's total energy (\ref{eq:E}).
replace those : with _. Example becomes: The system's total energy (\ref{eq_E}).
do this for all such occurrences of : in references or labels, in about 100 files.
I've never done this before. I've worked out that I can use regular expressions to find complex occurrences. I can find either \ref{ or \label{ with (\\ref\{|\\label\{), but I can't put it in a lookbehind because it is not fixed width. My other problem with lookbehind and lookahead is that I can only match everything between my assertions, not specific characters (from what I've understood).
I've also worked out that I can use sed for find and replace. I was planning on using a regular expression as my sed "find". Does that make sense?
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
I know that my questions are all over the place, as I said, never done this before and there is a mountain of documentation I'm only beginning to tackle. Any help or pointers would be appreciated.
You can use the following command which relies on capturing groups to extract the different parts of a ref or label containing a colon to replace it with the equivalent using an underscore :
sed -E 's/\\(ref|label)\{([^:]*):([^}]*)}/\\\1\{\2_\3}/g'
The expression captures the whole ref or label tag, matching the tag name in the first capturing group, the part that precedes the colon in the second capturing group and the part that follows the colon in the third capturing group. The replacement pattern uses references to these capturing groups and can be read as \<tagName>{<before colon>_<after colon>}.
You can try it here.
Note that it would be prefereable to use a parser that understands the latex format, the regex is likely to fail for some edge cases.
And finally, I'm not sure how to go about looping on all my files (which have ordered names). Can I do an if or while loop in a bash script?
sed accepts a list of files as parameter and will apply its command on all of them. The list of files can be produced by the expansion of a glob, e.g. sed 'sedCommand' /your/directory/*.txt which would work on all file of /your/directory/ whose name end in .txt.
In this case you will likely want to use sed's -i "in place" flag which asks sed to direcly write its result in the target file rather than on its standard output. The flag can be followed by a suffix if you want a backup of the original, for instance sed -i.bak 'command' file.txt will have file.txt contain the result and file.txt.bak the original.

regular expression for "11th to 16th letter"

I am new to regular expression. Need help for reading files in unix system. I want to apply regular expression on ls command.
I have below files :
DLERMS08001708161708209683.csv.gz
DLERMS13001708161330170816.csv.gz
DLERMS13001708171330170816.csv.gz
and would like to extract files which have 170816 between 11th record to 16th digit.
I tried with below command ls *170816*.gz. However I am getting 3 filenames instead of two. I want only first two filenames instead of all 3. Could you please help.
Also want to add here that my third filename already contains 170816 at the end DLERMS13001708171330170816.csv.gz. I want to avoid this in my ls command output.
Using bash parameter-expansion alone,
for file in *.csv.gz; do
[ -e "$file" ] || continue
[ "${file:10:6}" == "170816" ] && printf "%s\n" "$file"
done
${PARAMETER:OFFSET:LENGTH}
This one can expand only a part of a parameter's value, given a position to start and maybe a length. If LENGTH is omitted, the parameter will be expanded up to the end of the string. If LENGTH is negative, it's taken as a second offset into the string, counting from the end of the string
Based on comments from below, apparently OP wants to copy the files intended to an alternate path, in which case the printf() should be replaced with cp with necessary arguments
[ "${file:10:6}" == "170816" ] && cp -- "$file" path/to/destination
Firstly, be careful not to confuse regular expressions with shell glob patterns (which is what you want here).
Your glob could be:
??????????170816*.gz
Which matches 10 unknown characters followed by the sequence you specified.
Depending on your next step, you might not need to use ls at all, for example you can loop over these files like this:
for file in ??????????170816*.gz; do
something_with "$file"
done
Or output the files that match using one of the following:
echo ??????????170816*.gz
printf '%s\n' ??????????170816*.gz
If there is a possibility that no files match, then you may wish to consider enabling nullglob (using shopt -s nullglob), which would expand to nothing in that case.
If you want to use globbing, it's not the same as using regular expression.
In your example you can use "?" as a placeholder for matching a single character:
Hence to achieve what you want as output, use ls with pattern below -
ls ??????????170816*
You want to use the wildcard (not regex) "any single letter" ? appropriatly often.
ls DLERMS????170816*.csv.gz
Regexes are much more flexible/powerful and overkill for this simple use case.
But as far as I know, ls does not support them, so you would have to go via other bash tools to identify the files in case you ever need to actually use regexes for anything.
I also reflected what I perceive to be another common of your filenames, the DLERMS at the beginning, if that is NOT common, replace those letter by ?, too.
Try this:
ls ??????????170816*
A solution with find and regex
find . -regextype egrep -regex "^.{12}170816.*\.gz"
find read: ./xxxxxxxxxxxxx and .{12} means the first twelve, so 170816 is the expression between 13th record to 18th
I don't think you can use regular expressions with ls directly, but with egrep, it works fine.
ls * | egrep "DLERMS[0-9]{4}170816[0-9]{10}.csv.gz"
[0-9]{4} - any number, four times.
[0-9]{10} - any number, ten times.
Also could be used instead "egrep" the command "grep -E", the -E option allows especial regular expressions like "[{|" without need to escape them "\".

Does wildcard characters behave differently?

I want to understand if behavior of wildcard characters is same or not, for example:
1) On unix prompt, if i type ls *.xml, then it lists all files ending with .xml, for example: 1.xml, first.xml etc.
--> So as it appears, * matches any character.
2) Now, i was trying to find some text using grep in a file, and i executed following command:
grep -i "*.xml" first.txt
To my utter surprise there was no results returned, even though first.txt had contents like: first.xml, second.xml.
If i do grep -i "xml" first.xml, then I get the results.
This behavior is causing confusion, how does * matches any text in case (1) and in case (2) it is failing?
Does this behave different in different situations, and if so, where to find this info.
Shell uses globbing. https://en.wikipedia.org/wiki/Glob_(programming)
Grep uses regular expressions. https://en.wikipedia.org/wiki/Regular_expression
They are two different languages, and for grep, * is not a wild card. It means the previous character, repeated zero or more times
You want
grep '.*\.xml'
which means any character (.) repeated zero or more times (*) followed by a literal '.' (\.) xml
(For practical purposes you would just use fgrep .xml or grep '\.xml$' of course)

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.