Why this regex does not work with grep? - regex

I have a text file this way
"an arbitrary string" = "this is the text one"
"other arbitrary string" = "second text"
"a third arbitrary string" = "the text number three"
I want to obtain only this
an arbitrary string
other arbitrary string
a third arbitrary string
That is, the text inside the first quotes, or between the first " and the " =. I used this regex
(?!").*(?=(" =))
This is working when I tried it in RegExr and in this online tool. But in my OSX Terminal it does not work, the output is empty
grep -o '(?!").*(?=(" =))' input.txt
What is wrong here? Do I have to escape some characters? I try everyone and nothing changes.
Thank you so much and please excuse my lack of knowledge about this topic.

Lookaheads and lookbehinds are PCRE features so you have to use the parameter -P:
grep -Po '(?!").*(?=(" =))' input.txt

This should do:
awk -F\" '{print $2}' file
It uses " as separators, and then print second field.

steffen`s answer is right, you have to use -P flag. But there is also a problem with your regex.
Imagine this input:
"an arbitrary string" = " =this is the text one"
Your regex will fail dramatically.
To solve this you have to use something like this:
grep -Po '^"\K.*?(?=(" =))'
^ to prevent other matches that do not begin from the line start.
\K is just easier to read. (It also allows you to match strings with arbitrary length)
.*? to make it non-greedy.

Related

Bash Script for Concatenating Broken Dashed Words

I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)

Highlight all keys that look like '&name=' in a text with grep console [duplicate]

I want to grep the shortest match and the pattern should be something like:
<car ... model=BMW ...>
...
...
...
</car>
... means any character and the input is multiple lines.
You're looking for a non-greedy (or lazy) match. To get a non-greedy match in regular expressions you need to use the modifier ? after the quantifier. For example you can change .* to .*?.
By default grep doesn't support non-greedy modifiers, but you can use grep -P to use the Perl syntax.
Actualy the .*? only works in perl. I am not sure what the equivalent grep extended regexp syntax would be. Fortunately you can use perl syntax with grep so grep -P would work but grep -E which is same as egrep would not work (it would be greedy).
See also: http://blog.vinceliu.com/2008/02/non-greedy-regular-expression-matching.html
grep
For non-greedy match in grep you could use a negated character class. In other words, try to avoid wildcards.
For example, to fetch all links to jpeg files from the page content, you'd use:
grep -o '"[^" ]\+.jpg"'
To deal with multiple line, pipe the input through xargs first. For performance, use ripgrep.
My grep that works after trying out stuff in this thread:
echo "hi how are you " | grep -shoP ".*? "
Just make sure you append a space to each one of your lines
(Mine was a line by line search to spit out words)
Sorry I am 9 years late, but this might work for the viewers in 2020.
So suppose you have a line like "Hello my name is Jello".
Now you want to find the words that start with 'H' and end with 'o', with any number of characters in between. And we don't want lines we just want words. So for that we can use the expression:
grep "H[^ ]*o" file
This will return all the words. The way this works is that: It will allow all the characters instead of space character in between, this way we can avoid multiple words in the same line.
Now you can replace the space character with any other character you want.
Suppose the initial line was "Hello-my-name-is-Jello", then you can get words using the expression:
grep "H[^-]*o" file
The short answer is using the next regular expression:
(?s)<car .*? model=BMW .*?>.*?</car>
(?s) - this makes a match across multiline
.*? - matches any character, a number of times in a lazy way (minimal
match)
A (little) more complicated answer is:
(?s)<([a-z\-_0-9]+?) .*? model=BMW .*?>.*?</\1>
This will makes possible to match car1 and car2 in the following text
<car1 ... model=BMW ...>
...
...
...
</car1>
<car2 ... model=BMW ...>
...
...
...
</car2>
(..) represents a capturing group
\1 in this context matches the sametext as most recently matched by
capturing group number 1
I know that its a bit of a dead post but I just noticed that this works. It removed both clean-up and cleanup from my output.
> grep -v -e 'clean\-\?up'
> grep --version grep (GNU grep) 2.20

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).

GREP expression

I need help with a GREP expression to find and replace a variable group of words.
The sentence always starts with the same two words (Bold italicized) and always ends with a (colon), but the bit in the middle varies.
So I need to search for:
Bold italicized then any string of words then :
ie. starts with "Bold italicized", then any group of words, ends with ":"
For example:
Bold italicized May 6, 2010:
I will then apply some formatting to that text.
Thank you.
The right tool do do this is not grep but sed :
EXAMPLE in a shell :
$ cat file.txt
Bold italicized foo bar:
Bold italicized qux:
$ sed 's/^Bold italicized\(.*\):/do something with "\1"/g' file.txt
do something with " foo bar"
do something with " qux"
$
NOTE
you will find tons of examples and documentation here or here
the basic sed substitution command is s/regex/substitution/modifier
that use regex, I use ^ that means beginning of line, and \( \) to make a capture
This should do it, although this is a pretty simple one, so it seems like you should have been able to come up with this yourself, even as a beginner.
^Bold italicized.+?:
If you want to learn a little bit more about how to use GREP, I would recommend the InDesign GREP reference.

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0