I need to match a pattern across multiple lines with pdfgrep
pdfgrep -in -C line 'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE' ~/temp.pdf
works ok and outputs
12: CHAPTER 1
THIS IS THE TITLE
Now
$ pattern="CHAPTER 1 - THIS IS THE TITLE"
$ echo "'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'"
'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE'
$ pdfgrep -in -C line "'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'" ~/temp.pdf
doesn't work anymore, gives me nothing. I guess there is something going on with the parameter substitution, but I can't figure out what's happening. Anyone can help?
Background infos:
From "man pdfgrep"
pdfgrep works much like grep, with one distinction: It operates on pages and not on lines.
"." matches any character, line breaks INCLUDED.
You are using extra ' characters:
"'${pattern:0:9}'[$'\n'][$' ']*'${pattern:12:${#pattern}}'"
^ ^ ^ ^
Also, you are using $'\n' and $' ' inside double quotes, and this prevents their expansion.
The correct expression is:
"${pattern:0:9}"[$'\n'][$' ']*"${pattern:12:${#pattern}}"
In fact:
$ echo 'CHAPTER 1'[$'\n'][$' ']*'THIS IS THE TITLE'
CHAPTER 1[
][ ]*THIS IS THE TITLE
$ pattern="CHAPTER 1 - THIS IS THE TITLE"
$ echo "${pattern:0:9}"[$'\n'][$' ']*"${pattern:12:${#pattern}}"
CHAPTER 1[
][ ]*THIS IS THE TITLE
Note that the output of echo when given the two expressions is the equivalent (if you did things right, echo should not return a Bash expression, it should return the final string).
It's not required, but as a best practice you should quote the *, [ and ] characters (thanks chepner for noticing). Also, $' ' is pretty useless here:
"${pattern:0:9}["$'\n'"][ ]*${pattern:12:${#pattern}}"
^ ^ ^
This will prevent glob expansion (which is unlikely to happen in your case, but still something to care about).
$'\n' doesnt interpolates to the line feed when the string is double-quoted:
prompt $ echo "$'\n'"
$'\n'
prompt $ echo $'\n'
Don't use double-quotes around the string:
prompt $ a='abcd'$'\n''efgc'
prompt $ echo "$a"
abcd
efgc
P.S. Your regular expression looks very strange. Why do you use square brackets around the \n and \s?
Related
I'm trying to use Perl to reorder the content of an md5 file. For each line, I want the filename without the path then the hash. The best command I've come up with is:
$ perl -pe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
The input file (DCIM.md5) is produced by md5sum on Linux. It looks like this:
e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
The hash is matched by the first group ([[:alnum:]]+) in the
regular expression.
Then the spaces and the path to the file are
matched by .*?.
Then the filename is matched by ([^/]+).
The expression is enclosed with ^ (apparently non-necessary here)
and $. Without the $, the expression does not output what I expect.
I use | rather than / as a separator to avoid escaping it in file paths.
That command returns:
IMG_20150201_160548.jpg
e26ff03dc1bac80226e200c0c63d17a2IMG_20150204_190528.jpg
01f92572e4c6f2ea42bd904497e4f939IMG_20151011_193008.jpg
afce027c977944188b4f97c5dd1bd101IMG_20151011_195133.jpg
The matching is correct, the output sequence is correct (filename without path then hash) but the spacing is not: there's a newline after the filename. I expect it after the hash, like this:
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
It seems to me that my command outputs the newline character, but I don't know how to change this behavior.
Or possibly the problem comes from the shell, not the command?
Finally, some version information:
$ perl -version
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-linux-gnu-thread-multi-64int
(with 69 registered patches, see perl -V for more detail)
[^/]+ matches newlines, so the ones in your input are part of $2, which gets put first in your transformed $_ (And there's no newline in $1 so there's no newline at the end of $_...)
Solution: Read up on the -l option from perlrun. In particular:
-l[octnum]
enables automatic line-ending processing. It has two separate effects. First, it automatically chomps $/ (the input record separator) when used with -n or -p. Second, it assigns $\ (the output record separator) to have the value of octnum so that any print statements will have that separator added back on. If octnum is omitted, sets $\ to the current value of $/ .
Alternate solution, which uses lots of concepts from other answers, and comments ...
$ perl -pe 's|(\p{hex}+).*?([^/]+?)$|$2 $1|' DCIM.md5
... and explanation.
After investigating all the answers and trying to figure them out, I've decided that the base of the problem is that the [^/]+ is greedy. Its greediness causes it to capture the newline; it ignores the $ anchor.
This was hard for me to figure out, since I did a lot of parsing using sed before using Perl, and even a greedy wildcard won't capture a newline in sed. Hopefully this post will help those who (being used to sed as I am) are also wondering (as I did) why the $ isn't acting "as I expect it to."
We can see the "greedy" issue by trying what I'll post as another, alternate answer.
Write the file:
$ cat > DCIM.md5<<EOF
> e26ff03dc1bac80226e200c0c63d17a2 ./Path1/IMG_20150201_160548.jpg
> 01f92572e4c6f2ea42bd904497e4f939 ./Path 2/IMG_20150204_190528.jpg
> afce027c977944188b4f97c5dd1bd101 ./Path3/Path 4/IMG_20151011_193008.jpg
> EOF
Get rid of the greedy [^/]+ by changing it to [^/]+?. Parse.
$ perl -pe 's|([[:alnum:]]+).*?([^/]+?)$|$2 $1|' DCIM.md5
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
Desired output accomplished.
The accepted answer, by #Shawn,
$ perl -lpe 's|^([[:alnum:]]+).*?([^/]+)$|$2 $1|' DCIM.md5
basically changes the $ anchor so as to behave the way a sed person would expect it to.
The answer by #CrafterKolyan takes care of the greedy [^/] capturing the newline by saying you can't have a forward-slash or a newline. This answer still needs the $ anchor to prevent the following situation
1) .* captures the empty string (0 or more of any character)
2) [^/\n]+ captures . .
The answer by #Borodin takes a quite different approach, but it's a great concept.
#Borodin, in addition, made a great comment that allows a more-precise/more-exact version of this answer, which is the version I put at the top of this post.
Finally, if one wants to follow the Perl programming model, here's another alternative.
$ perl -pe 's|([[:xdigit:]]+).*?([^/]+?)(\n\|\Z)|$2 $1$3|' DCIM.md5
P.S. Because sed isn't quite like perl (no non-greedy wildcards,) here's a sed example that shows the behavior I discuss.
$ sed 's|^\([[:alnum:]]\+\).*/\([^/]\+\)$|\2 \1|' DCIM.md5
This is basically a "direct translation" of the perl expression except for the extra '/' before the [^/] stuff. I hope it will help those comparing sed and perl.
use [^/\n] instead of [^/]:
perl -pe 's|^([[:alnum:]]+).*?([^/\n]+)$|$2 $1|' DCIM.md5
Doing a substitution leaves you having to write a regex pattern that matches everything you don't want as well as everything you do. It's usually much better to match just the parts you need and build another string from them
Like this
for ( <> ) {
die unless m< (\w++) .*? ([^/\s]+) \s* \z >x;
print "$2 $1\n";
}
or if you must have a one-liner
perl -ne 'die unless m< (\w++) .*? ([^/\s]+) \s*\z >x; print "$2 $1\n";' myfile.md5
output
IMG_20150201_160548.jpg e26ff03dc1bac80226e200c0c63d17a2
IMG_20150204_190528.jpg 01f92572e4c6f2ea42bd904497e4f939
IMG_20151011_193008.jpg afce027c977944188b4f97c5dd1bd101
I have a simple regular expression that creates a group match for any semicolon contained within double quotes. I'm trying to use sed on Mac OS X to replace the semicolon with 'SEMICOLON'.
However, it's not working.
Here's the command I used:
sed -i.bu "s|.*?(;).*?|SEMICOLON|g" output/html/index.html
The result is that nothing is matched and nothing is replaced.
Desired behavior:
Input
"The man sat; the man cried;" cats; dogs;
Output
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
UPDATE:
Thanks for your help everyone. So my example wasn't very good. In reality, I process a JavaScript file that's been condensed to one line, and make sure each JavaScript statement has its own line. The problem is that the JavaScript is mostly translated text, so trying to make a simple regex that would insert a newline after each ; was difficult, because I obviously don't want a newline added if the semicolon is in quotes.
Long story short... I realized I was trying to reinvent the wheel, and decided to use js-beautify to pretty print the file. It's doing a little more than I need... but it's the best solution for now.
Thanks again!
Let's take this as a test file:
$ cat file
"The man sat; the man cried;" cats; dogs;
1; 2; "man;"; 3; ";dog";
Try this sed command:
$ sed -E ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
1; 2; "manSEMICOLON"; 3; "SEMICOLONdog";
How it works:
:a
This creates a label a that we can refer to later.
s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/
This replaces the last ; that is inside double-quotes with SEMICOLON. Let's look at ^(([^"]*"[^"]*")*[^"]*"[^"]*); in more detail:
^ matches at the beginning of a string.
([^"]*"[^"]*")* matches from the beginning of the line through any number of complete quoted strings.
Because, in sed, regular expressions are greedy (more precisely, leftmost-longest), this will try to match as many complete quoted strings as it can.
[^"]*"[^"]*; matches any non-quotes that follow the complete quoted strings (as above), followed the next quote character, followed by any number of non-quote characters, followed by ;.
Since the above regex minus the final ; is itself inside parens, it is saved as group 1. We replace the matched text with group 1 followed by SEMICOLON.
ta
If the last command resulted in a substitution (in other words, we found a ; that needed to be replaced), then jump back to label a and repeat.
Discussion
Let's consider:
sed "s|.*?(;).*?|SEMICOLON|g"
In Python and elsewhere, .*? is a non-greedy match. Sed, however, has no such concept. For that matter, by default, sed uses Basic Regular Expressions (BRE) in which ? just means a literal question mark.
Also, it is asking for trouble to put sed commands in double-quotes as this invites the shell to modify it.
So, since BRE are obsolete, let's (1) switch to Extended Regular Expressions (ERE) using the -E switch, (2) put the command in single-quotes, and (3) change .*? to .*:
$ sed -E 's|.*(;).*|SEMICOLON|g' file
SEMICOLON
(Compatibility note: if you are on a very old linux system, you may need to replace -E with -r.)
.*(;).* matches everything up to the last semicolon on the line, followed by the semicolon, followed by whatever follows the last semicolon. In other words, if the line contains a semicolon, .*(;).* matches the whole line. That is why the output is just SEMICOLON.
Also, (;) matches a semicolon and saves it in group 1. Since we never use group 1 anywhere, this does nothing for us. We would get the same result with:
$ sed -E 's|.*;.*|SEMICOLON|g' file
SEMICOLON
If we remove the .*, then every ; will be replaced:
$ sed -E 's|;|SEMICOLON|g' file
"The man satSEMICOLON the man criedSEMICOLON" catsSEMICOLON dogsSEMICOLON
If we want to replace the last ; in the first quoted string, we could use:
$ sed -E 's|^([^"]*"[^"]*);|\1SEMICOLON|g' file
"The man sat; the man criedSEMICOLON" cats; dogs;
If we want to replace all ; that are within any quoted string on the line, then we are back to the command at the top.
Strings spanning across lines
Let's consider a test file with a string spanning 2 lines:
$ cat file2
"man;" cat "dog
;"; ";man";
If you have GNU sed:
$ sed -Ez ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
In general for any POSIX sed:
$ sed -E 'H;1h;$!d;x; :a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
sed is for simple s/old/new that is all. With any awk:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART,RLENGTH)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)
} 1' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
That's assuming you actually want all semicolons in the quoted string treated the same way. If not, whatever it is you want to do is an easy tweak, e.g. if you want that last semicolon after cried removed instead of replaced as shown in your sample output:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART+1,RLENGTH-2)
sub(/;$/,"",str)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART) str substr($0,RSTART+RLENGTH-1)
} 1' file
"The man satSEMICOLON the man cried" cats; dogs;
I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?
set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.
This code is for check if a character is a integer or not (i think). I'm trying to understand what this means, I mean... each part of that line, checking the GREP man pages, but it's really difficult for me. I found it on the internet. If anyone could explain me the part of the grep... what means each thing put there:
echo $character | grep -Eq '^(\+|-)?[0-9]+$'
Thanks people!!!
Analyse this regex:
'^(\+|-)?[0-9]+$'
^ - Line Start
(\+|-)? - Optional + or - sign at start
[0-9]+ - One or more digits
$ - Line End
Overall it matches strings like +123 or -98765 or just 9
Here -E is for extended regex support and -q is for quiet in grep command.
PS: btw you don't need grep for this check and can do this directly in pure bash:
re='^(\+|-)?[0-9]+$'
[[ "$character" =~ $re ]] && echo "its an integer"
I like this cheat sheet for regex:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
It is very useful, you could easily analyze the
'^(+|-)?[0-9]+$'
as
^: Line must begin with...
(): grouping
\: ESC character (because + means something ... see below)
+|-: plus OR minus signs
?: 0 or 1 repetation
[0-9]: range of numbers from 0-9
+: one or more repetation
$: end of line (no more characters allowed)
so it accepts like: -312353243 or +1243 or 5678
but do not accept: 3 456 or 6.789 or 56$ (as dollar sign).
Can anyone explain me how the regular expression works in the sed substitute command.
$ cat path.txt
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/sbin:/sbin:/bin/:/usr/sbin:/usr/bin:/opt/omni/bin:
/opt/omni/lbin:/opt/omni/sbin:/root/bin
$ sed 's/\(\/[^:]*\).**/\1/g' path.txt
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
From the above sed command they used back reference and save operator concept.
Can anyone explain me how the regular expression especially /[^:]* work in the substitute command to get only the first path in each line.
I think you wrote an extra asterisk * in your sed code, so it should be like this:
$ sed 's/\(\/[^:]*\).*/\1/g' file
/usr/kbos/bin
/usr/local/sbin
/opt/omni/lbin
To change the delimiter will help to understand it a little bit better:
sed 's#\(/[^:]*\).*#\1#g'
The s#something#otherthing#g is a basic sed command that looks for something and changes it for otherthing all over the file.
If you do s#(something)#\1#g then you "save" that something and then you can print it back with \1.
Hence, what it is doing is to get a pattern like /[^:]* and then print is back. /[^:]* means / and then every char except :. So it will get / + all the string until it finds a semicolon :. It will store that piece of the string and then print it back.
Small examples:
# get every char
$ echo "hello123bye" | sed 's#\([a-z]*\).*#\1#g'
hello
# get everything until it finds the number 3
$ echo "hello123bye" | sed 's#\([^3]*\).*#\1#g'
hello12
[^:]*
in regex would match all characters except for :, so it would match until this:
/usr/kbos/bin
also it would match these,
/usr/local/bin
/usr/jbin
/usr/bin
/usr/sas/bin
As, these all contains characters, that are not :
.* match any character, zero or more times.
Thus, this regex [^:]*.*, would match all this expressions:
/usr/kbos/bin:/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/local/bin:/usr/jbin:/usr/bin:/usr/sas/bin
/usr/jbin:/usr/bin:/usr/sas/bin
/usr/bin:/usr/sas/bin
However, you get only the first field (ie,/usr/kbos/bin, by using back reference in sed), because, regular expression output the longest possible match found.