How to search and replace this string with sed? - regex

I'm desperately trying to search the following:
<texit info> author=MySelf title=MyTitle </texit>
and replace it with blank.
What I've tried so far is the following:
sed –I '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' test.txt
But it doesn't work.

Don't edit XML with sed -- the right tool would be something like XMLStarlet, with a line like the following:
xmlstarlet ed -u //texit[#info] -v 'author=NewAuthor title=NewTitle'
...if your goal were to update the text within the tag.
Regular expressions are not expressive enough to correctly handle XML (even formally -- regular expressions are theoretically sufficient to parse regular languages; XML is not one). For instance, your original would be just as valid written with newlines, as:
< texit
info >author=MySelf title=MyTitle</texit>
...and writing a sed command to handle that case would not be fun. XML-native tools, on the other hand, can correctly handle all of XML's corner cases.
That said, the sed expression you gave does indeed "work", inasmuch as it does exactly what it's written to do.
sed -e '1,5s/<texit//;s/info>//;s/author=MySelf//;s/title=MyTitle//' \
<<<"<texit info>author=MySelf title=MyTitle foo bar</texit>"
returns the output
foo bar</texit>
which is exactly what it should do, as it's removing the <texit string, the info> string, the author=MySelf, title=MyTitle, but leaving the closing </texit> and any excess text, just as you asked. If you expect or desire it to do something different, you should explain what that is.

sed 's/<texit\s\+info>\s*author=MySelf\s\+title=MyTitle\s*<\/texit>//g' test.txt
You should generally not edit XML with a regex, but if you only want to strip these tags, the above will work. You don't need multiple s commands, just use a single pattern with correctly defined whitespace.

Related

Bash - Regex for HTML contents

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.
Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'
Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

UNIX: How would I grep in a script using a variable as a search parameter for a file?

Before I Start, this isn't exactly how it seems and I did search the web for a while before coming here. Basically I have a script where the user passes in a string and stores it in a variable. I then have to take that word and search for all the subwords that could be made from it in a dictionary file. The problem I am having is I need to make sure the words are at least 4 characters long. I do not have the best grasp on regular expressions. I'm aware of the techniques you can use just logically can't piece it together sometimes. I will show you the line of code and explain my reasoning behind why I think it should be this way. Then, could someone correct me on my logic? I am not looking for someone to send me the working line of code but perhaps correct my logic so I can understand better and derive the answer on my own.
words=$(grep -iE '(["$text"]{4,})' /usr/dict/words)
echo "$words"
For example if I pass in string college I should get output like
cell
cello
clee
cleg
etc.....
I am storing the command in another variable to echo. I am not sure why exactly, It just seems from what I saw online most people were rather fond of this. Using grep with -i for ignore case and -E for regular expression or (egrep) I believe the expression needs to be enclosed in single quote parenthesis for expressions. $text is the variable I stored the users input in. I know $ usually signifies the ending in and [] is a range and "" makes it read the variable rather than print what is there. Then {4,} meaning four or more characters. then the last part is the path to the file. Any input would be appreciated and again, I do not like being spoon fed answers it's an easy way to learn nothing. I would just like corrections on my logic if all possible. Thanks everyone!!
If by "subwords" you mean permutations of its letters, then your command is fine except for the quotes. Unfortunately you have to do it like this:
words=$(grep -iE '(['"$text"']{4,})' /usr/dict/words)
This way you pass to grep the single quoted string so that the shell doesn't interpret its special symbols. But at the same time you have to expand your $text var, thus you have to make a gap inside your single-quoted string, and in that gap place your variable in double quotes.
Hope I didn't spoil it for you.

sed regexp matching in a long line

I have a XML file that I wish to extract all occurrences of some tag AB. The file is one long line with ~500 000 chars.
Now I do know about regexp and such, but when I try it with sed and try to extract only the characters within the tags I am totally lost regarding the result :).
Here's my command:
sed -r 's/(.*)<my_tag>([A-Z][A-Z])<\/my_tag>(.*)/hello\2/g' myfile.out
transforms the entire file with only "helloAB" e.g. While the expected should at least contain 100+ matches.
So I'm thinking around the concepts of greedy matching and such but not getting anywhere. Maybe awk is a better idea?
If you have python (2.6+), this should be fairly trivial:
import xml.dom.minidom as MD
tree = MD.parse("yourfile.xml")
for e in tree.getElementsByTagName("AB"):
print e.toprettyxml()
In general, trying to parse XML by hand should be avoided as there are much simpler solutions like these. Not to mention, these kinds of libraries will give you easy access to attributes and values without further parsing.
Thank your for your answers.
I tried #MannyD's suggestion and unfortunately the XML didn't seem to be well formed, thus the parsing failed. Since I cannot anticipate only well formed XML's I made grep solution, which does the job.
grep -o "<my_tag>[A-Z][A-Z]</my_tag>" myfile.out | sort -u
The -o option flag will print each match on a new line, from there I just sort and print the unique matches from the file.

Is posible to add characters to a string as part of a regular expression (regex)

I use an application to find specific text patterns in free text fields in XML records. It uses regex to identify the pattern and then it is tagged in the XML. For a specific project, it would be a great time saver (I am working with about 18 million records) if I could add 2 characters 27 in front of one of the pattern I have to use.
Can this be done or am I just going to have to go the long way around?
No, you can't have a regex match text that isn't there. A regex will only be able to return text that is part of the original text.
However, if you matched into groups, you could potentially use the group name for extra information about what you're matching.
Regex is not the right tool if you'd like to edit an XML file. Instead, use a modern language like Python, Perl, Ruby, PHP, Java with a proper XML parser module. If you work in Unix like shell, I recommend xmlstarlet
That said, if you'd like to go ahead with a substitution, you can try sed (at your own risks) :
sed -i -r 's/987654/27&/g' files*.xml
(use only -i switch only to modify in-place)

Need simple regex for LaTeX

In my LaTeX files, I have literally thousands of occurrences of the following construct:
$\displaystyle{...math goes here...}$
I'd like to replace these with
\mymath{...math goes here...}
Note that the $'s disappear, but the curly braces remain---if not for the trailing $, this would be a basic find-and-replace. If only I knew any regex, I'm sure it would handle this with no problem. What's the regex I need to make this happen?
Many thanks in advance.
Edit: Some issues and questions have arisen, so let me clarify:
Yes, $\displaystyle{ ... }$ can occur multiple times on the same line.
No, nested }$'s (such as $\displaystyle{...{more math}$...}$) cannot occur. I mean, I suppose it could if you put it in an \mbox or something, but I can't imagine why anyone would ever do that inside a $\displaystlye{}$ construct, the purpose of which is to display math inline with text. At any rate, it's not something I've ever done or am likely to do.
I tried using the perl suggestion, but while the shell raised no objections, the files remained unaffected.
I tried using the sed suggestion, but the shell objected to an "unexpected token near `('". I've never used sed before (and "man sed" was obtuse), but here's what I did: navigated to a directory containing .tex files and typed "sed s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g *.tex". No luck. How do I use sed to do what I want?
Again, many many thanks for all offered help.
Be very careful when using REGEX to do this type of substitution
because the theoretical answer is that
REGEX is incapable of matching this type of pattern.
REGEX is a finite state machine; it does not incorporate a pushdown stack so
it cannot work with nested structures such as "{...math goes here...}" if
there is any possibility of nesting such that something like "{more math}$"
can appear as part of a "math goes here" string. You need at a minimum a
context free grammar to describe this type of construct - a state machine
just doesn't cut it!
Now having said that, you may still be able to pull this off using REGEX
provided none of your "math goes here" strings are more complex than
what a state machine can handle.
Give it a shot.... but beware of the results!
sed:
s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g
perl -pi -e 's/$\\displaystyle({.*)}\$/\\mymath$1}/g' *.tex
if multiples }$ are on the same line you need a non greedy version:
perl -pi -e 's/$\\displaystyle({.*?)}\$/\\mymath$1}/g' *.tex