Replacing patterns with grep and regex - regex

I was wondering if with grep and regex we can do something in the spirit of the following example:
Original text:
name_1_extratext
name_2_extratext
Target:
name_extratext_1
name_extratext_2
I am particularly interested in doing this within Vim. Thanks

#bohemian's comment about grep only doing matching also applies within Vim. "grep" and "regex" are not, or should not, be vague buzzwords you throw at a problem. They are tools that may or may not be adapted to the class of problem you are having and a large part of learning is acquiring the correct intuition for what tool to use in what case.
In Vim, what you want to do is a substitution. It doesn't involve grep at all but it definitely involves regular expressions.
In this specific case, you would do something like this:
:%s/\(.*\)\(_\d\+\)\(.*\)/\1\3\2
or this variant of #bohemian's answer:
:%s/_\([^_]\+\)_\(.*\)/_\2_\1/
or anything that works and makes sense to you, really. Ideally not something you copy/pasted from the internet but something you really understand.
Reference:
The :s command is introduced in chapter 10 of the user manual: :help 10.2, and further documented under :help :s.
The % range is also introduced chapter 10 of the user manual: :help 10.3, and further documented under :help :range.
Vim's own regular expression dialect is extensively documented under :help pattern.

grep doesn't "do" anything to what it matches; it only matches.
Use sed:
echo "name_1_extratext" | sed -E 's/_([^_]+)_(.*)/_\2_\1/'

Related

Bash - Regex for HTML contents

I'm learning about Bash scripting, and need some help understanding regex's.
I have a variable that is basically the html of a webpage (exported using wget):
currentURL = "https://www.example.com"
currentPage=$(wget -q -O - $currentURL)
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
I started with this, but I need to modify the regex:
Test string (this is what currentURL contains, there can be zero to many instances of this):
<img src="./download/file.php?id=123456&t=1">
Current Regex:
.\/download\/file.php\?id=[0-9]{6}\&mode=view
Here's the regex I created, but it doesn't seem to work in bash.
The best solution would be to have the ID of each file. In this case, simply 123456. But if we can start with getting the /download/file.php?id=123456, that'd be a good start.
Don't parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(#href,"id=(\d+)",1)'
Let's first clarify a couple of misunderstandings.
I'm learning about Bash scripting, and need some help understanding regex's.
You seem to be implying some sort of relation between Bash and regex.
As if Bash was some sort of regex engine.
It isn't. The [[ builtin is the only thing I recall in Bash that supports regular expressions, but I think you mean something else.
There are some common commands executed in Bash that support some implementation of regular expressions such as grep or sed and others. Maybe that's what you meant. It's good to be specific and accurate.
I want to get the id's of all linked photos in this page. I just need help figuring out what the RegEx should be.
This suggests an underlying assumption that if you want to extract content from an HTML, then regex is the way to go. That assumption is incorrect.
Although it's best to extract content from HTML using an XML parser (using one of the suggestions in Gilles' answer),
and trying to use regex for it is not a good reflect,
for simple cases like yours it might just be good enough:
grep -oP '\./download/file\.php\?id=\K\d+(?=&mode=view)' file.html
Take note that you escaped the wrong characters in the regex:
/ and & don't have a special meaning and don't need to be escaped
. and ? have special meaning and need to be escaped
Some extra tricks in the above regex are good to explain:
The -P flag of grep enables Perl style (powerful) regular expressions
\K is a Perl specific symbol, it means to not include in the match the content before the \K
The (?=...) is a zero-width positive lookahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in the match.
The \K and the lookahead trickery is to work with grep -o, which outputs only the matched part. But without these trickeries the matched part would be for example ./download/file.php?id=123456&mode=view, which is more than what you want.

underscore to camelCase RegEx

Our standards have changed and I want to do a 'find and replace' in say Dreamweaver(it allows for RegEx or we just got Visual Studio 2010, if it allows for searching by RegEx) for all the underscores and camelCase them.
What would be the RegEx to do that?
RegEx baffles me. I definitely need to do some studying.
Thanks in advance!
Update: A little more info - I'm searching within my html,aspx,cfm or css documents for any string that contains an underscore and replacing it with the following letter capitalized.
I had this problem, but I need to also handle converting fields like gap_in_cover_period_d_last_5_yr into gapInCoverPeriodDLast and found 9 out of 10 other sed expressions, don't like 1 letter words or numbers.
So to answer the question, use sed. sed -s is the equivalent to using the :s command in vim. So the example below is a command (ie sed -s/.../gc
This seemed to work, although I did have to run it twice (word_a_word will become wordA_word on the first pass and wordAWord on the second. sed backward commands, are just too magical for my muggle blood):-
s/\([A-Za-z0-9]\+\)_\([0-9a-z]\)/\1\U\2/gc
I recently had to approach a similar situation you asked about. Here is a regex I've been using in VIM which does the job for me:
%s/_\([a-zA-Z]\)/\u\1/g
As an example:
this_is_a_test becomes thisIsATest
I don't think there is a good way to do this purely with regex. Searching for _ characters is easy, something like ._. should work to find an _ with something on either side, but you need a more powerful scripting system to change the case of the character following the _. I suggest perl :)
I have a solution in PHP
preg_replace("/(_)(.)/e", "strtoupper('\\2')", $str)
There may be a more elegant selector criteria but I wanted to keep it simple.

Need simple regex for LaTeX

In my LaTeX files, I have literally thousands of occurrences of the following construct:
$\displaystyle{...math goes here...}$
I'd like to replace these with
\mymath{...math goes here...}
Note that the $'s disappear, but the curly braces remain---if not for the trailing $, this would be a basic find-and-replace. If only I knew any regex, I'm sure it would handle this with no problem. What's the regex I need to make this happen?
Many thanks in advance.
Edit: Some issues and questions have arisen, so let me clarify:
Yes, $\displaystyle{ ... }$ can occur multiple times on the same line.
No, nested }$'s (such as $\displaystyle{...{more math}$...}$) cannot occur. I mean, I suppose it could if you put it in an \mbox or something, but I can't imagine why anyone would ever do that inside a $\displaystlye{}$ construct, the purpose of which is to display math inline with text. At any rate, it's not something I've ever done or am likely to do.
I tried using the perl suggestion, but while the shell raised no objections, the files remained unaffected.
I tried using the sed suggestion, but the shell objected to an "unexpected token near `('". I've never used sed before (and "man sed" was obtuse), but here's what I did: navigated to a directory containing .tex files and typed "sed s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g *.tex". No luck. How do I use sed to do what I want?
Again, many many thanks for all offered help.
Be very careful when using REGEX to do this type of substitution
because the theoretical answer is that
REGEX is incapable of matching this type of pattern.
REGEX is a finite state machine; it does not incorporate a pushdown stack so
it cannot work with nested structures such as "{...math goes here...}" if
there is any possibility of nesting such that something like "{more math}$"
can appear as part of a "math goes here" string. You need at a minimum a
context free grammar to describe this type of construct - a state machine
just doesn't cut it!
Now having said that, you may still be able to pull this off using REGEX
provided none of your "math goes here" strings are more complex than
what a state machine can handle.
Give it a shot.... but beware of the results!
sed:
s/\$\\displaystyle({[^}]+})\$/\\mymath\1/g
perl -pi -e 's/$\\displaystyle({.*)}\$/\\mymath$1}/g' *.tex
if multiples }$ are on the same line you need a non greedy version:
perl -pi -e 's/$\\displaystyle({.*?)}\$/\\mymath$1}/g' *.tex

Regular expression extraction in text editors

I'm kind of new to programming, so forgive me if this is terribly obvious (which would be welcome news).
I do a fair amount of PHP development in my free time using pregmatch and writing most of my expressions using the free (open source?) Regex Tester.
However frequently I find myself wanting to simply quickly extract something and the only way I know to do it is to write my expression and then script it, which is probably laughable, but welcome to my reality. :-)
What I'd like is something like a simple text editor that I can feed my expression to (given a file or a buffer full of pasted text) and have it parse the expression and return a document with only the results.
What I find is usually regex search/replace functions, as in Notepad++ I can easily find (and replace) all instances using an expression, but I simply don't know how to only extract it...
And it's probably terribly obvious, can expression match only the inverse? Then I could use something like (just the expression I'm currently working on):
([^<]*)
And replace everything that doesn't match with nothing. But I'm sure this is something common and simple, I'd really appreciate any poniters.
FWIW I know grep and I could do it using that, but I'm hoping their are better gui'ified solution I'm simply ignorant of.
Thanks.
Zach
What I was hoping for would be something that worked in a more standard set of gui tools (ie, the tools I might already be using). I appreciate all the responses, but using perl or vi or grep is what I was hoping to avoid, otherwise I would have just scripted it myself (of course I did) since their all relatively powerful, low-level tools.
Maybe I wasn't clear enough. As a senior systems administrator the cli tools are familiar to me, I'm quite fond of them. Working at home however I find most of my time is spent in a gui, like Netbeans or Notepad++. I just figure there would be a simple way to achieve the regex based data extraction using those tools (since in these cases I'd already be using them).
Something vaguely like what I was referring to would be this which will take aa expression on the first line and a url on the second line and then extract (return) the data.
It's ugly (I'll take it down after tonight since it's probably riddled with problems).
Anyway, thanks for your responses. I appreciate it.
If you want a text editor with good regex support, I highly recommend Vim. Vim's regex engine is quite powerful and is well-integrated into the editor. e.g.
:g!/regex/d
This says to delete every line in your buffer which doesn't match pattern regex.
:g/regex/s/another_regex/replacement/g
This says on every line that matches regex, do another search/replace to replace text matching another_regex with replacement.
If you want to use commandline grep or a Perl/Ruby/Python/PHP one-liner any other tool, you can filter the current buffer's text through that tool and update the buffer to reflect the results:
:%!grep regex
:%!perl -nle 'print if /regex/'
Have you tried nregex.com ?
http://www.nregex.com/nregex/default.aspx
There's a plugin for Netbeans here, but development looks stalled:
http://wiki.netbeans.org/Regex
http://wiki.netbeans.org/RegularExpressionsModuleProposal
You might also try The Regulator:
http://sourceforge.net/projects/regulator/
Most regex engines will allow you to match the opposite of the regex.
Usually with the ! operator.
I know grep has been mentioned, and you don't want a cli tool, but I think ack deserves to be mentioned.
ack is a tool like grep, aimed at
programmers with large trees of
heterogeneous source code.
ack is written purely in Perl, and
takes advantage of the power of Perl's
regular expressions.
A good text editor can be used to perform the actions you are describing. I use EditPadPro for search and replace functionality and it has some other nice feaures including code coloring for most major formats. The search panel functionality includes a regular expression mode that allows you to input a regex then search for the first instance which identifies if your expression matches the appropriate information then gives you the option to replace either iteratively or all instances.
http://www.editpadpro.com
My suggestion is grep, and cygwin if you're stuck on a Windows box.
echo "text" | grep ([^<]*)
OR
cat filename | grep ([^<]*)
What I'd like is something like a
simple text editor that I can feed my
expression to (given a file or a
buffer full of pasted text) and have
it parse the expression and return a
document with only the results.
You have just described grep. This is exactly what grep does. What's wrong with it?

Regex Search and Replace Program

Is there a simple and lightweight program to search over a text file and replace a string with regex?
For searching: grep - simple and fast. Included with Linux, here's a Windows version, not sure about Mac.
For replacing: sed. Here's a Windows version, not sure about Mac.
Of course, if you want to actually open up a file and see its contents while you search and replace, you can use emacs for that. Or ConTEXT. Or vim. Or what have you. ;)
See also this question.
Perl excels at this, with its -i, -n, -p and -e switches. See the slides from my talk Field Guide To The Perl Command Line Switches for examples.
Others have mentioned sed and awk, and it's no surprise that Perl was inspired by them. However, Perl may well be easier to get and install for you and/or your users.
There's also sed, which is a useful tool to learn the basics of - great for doing quick regex based substitutions.
Quick example, to change "foo" to "bar" in input.txt ...
sed -e 's/foo/bar/g' input.txt > output.txt
Many decent text editors have the option as well, vim, emacs, EditPlus and so on.
sed or awk. I recommend the book sed&awk to master the subject or the booklet sed&awk pocket reference for a quick reference. Of course mastering regular expressions is a must...
You didn't mention what platform you're using... If you are interested in a relatively simple GUI tool, there's regexxer. Otherwise, the commandline tools such as sed that were mentioned earlier can be very useful.
It depends if you're dealing with one or many files. At the risk of being pilloried, I'm assuming you're using Windows because you didn't specify a platform.
For one file at a time, Notepad2 does the trick and is extremely fast, lightweight and portable.
For search/replace over multiple files at once, try Agent Ransack.
Try WildGem: http://www.skytopia.com/software/wildgem
I'm the creator. Small, super-fast, portable and self-contained. You can use Regex, but it also has its own simple language syntax to make queries much easier in theory.
I quote:
Unlike similar programs, WildGem is fast with a dual split display, and updates or highlights matches as you type in realtime. A unique colour coded syntax allows you to easily find/replace text without worrying about having to escape special symbols.
Here's a screenshot:
NOt knowing the platform, I'd say the ad that popped-up pon this page might be appropriate: PowerGREP. Don't know anything about it, but it sounds similar to what you're looking for.
Use emacs or xemacs. It has a perfect regexp replacement function. You can even use constructions like /1 (or /2 or /3) to get a matched expression back in your replacement that was identified with ( ) around them. To prevent a vi-emacs clash: vi will also have similar constructions. I'm not sure of any modern editors that support this functionality.
Tip: Try out a simple replacement first, it can be a bit unclear as you might up add '\' to escape the special RegExp constructions...