Regular expression to match string containing character - regex

I have a file that contains several strings bounded by single quotations ('). These strings can contain whitespace and sometimes occur over multiple lines; however, no string contains a quotation (') mark. I'd like to create a regex that finds strings containing the character "$". The regex I had in mind: '[^']*\$[^']* can't search over multiple lines. How can I get it to do so?

Your regex can search over multiple lines. If it doesn't there is a mistake in your code outside of it. (hint: [^'] does include newlines).
How about this expression (it prevents useless backtracking):
'([^'$]*\$[^']*)'

You are not telling us which language you are using, so we are left to speculate. There are two issues here, really:
Many regex engines only process one line at a time by default
Some regex engines cannot process more than one line at a time
If you are in the former group, we can help you. But the problem isn't with the regex, so much as it's with how you are applying it. (But I added the missing closing single quote to your regex, below, and the negation to prevent backtracking as suggested in Tomalak's answer.)
In Python 2.x:
# doesn't work
with open('file', 'r') as f:
for line in f:
# This is broken because it examines a single line of input
print "match" if re.search(r"'[^'$]*\$[^']*'", line)
# works
s = ''
with open('file', 'r') as f:
for line in f:
s += line
# We have collected all the input lines. Now examine them.
print "match" if re.search(r"'[^'$]*\$[^']*'", s)
(That is not the idiomatic, efficient, correct way to read in an entire file in Python. I'm using clumsy code to make the difference obvious.)
Now, more idiomatically, what you want could be
perl -0777 -ne 'while (m/\x27[^\x27$]*\$[^\x27]*\x27/g) { print "$&\n" }' file
(the \x27 is a convenience so I can put the whole script in single quotes for the shell, and not strictly necessary if you write your Perl program in a file), or
#!/usr/bin/env python
import re
with open('file', 'r') as f:
for match in re.match(r"'[^'$]*\$[^']*'", f.read()):
print match
Similar logic can be applied in basically any scripting language with a regex engine, including sed. If you are using grep or some other simple, low-level regex tool, there isn't really anything you can do to make it examine more than one line at a time (but some clever workarounds are possible, or you could simply switch to a different tool -- pcregrep comes to mind as a common replacement for grep).
If you have really large input files, reading it all into memory at once may not be a good idea; perhaps you can devise a way to read only as much as necessary to perform a single match at a time. But that already goes beyond this simple answer.

Related

Inserting a potentially missing line with perl

I'm trying to modify a perl filter to insert a line that may be missing.
I might have inputs that are
A
B
C
or
A
C
A and B are fixed and known in advance. C can vary from file to file.
The real data are more complicated - callstacks generated as part of
a regression test. Depending on the compiler used (and hence the
optimization) there may be tail call elimination which can remove
the 'B' frame. After filtering the files are simply diffed.
In the second case I want to insert the 'B' line. In the first case I don't want to insert a duplicate line. I thought that this was a job for negative lookahead, using the following
s/A.(?!B)/A\nB/s;
However this seems to mean "if any part of A.(?!B) matches the input text then substitute it with A\nB" whereas I need "if all of A.(?!B) matches" then substitute.
No matter what I try it either always substitutes or never substitutes.
In a one-liner for a ready test
perl -0777 -wpe's/ ^A.*\n \K (?!B.*\n) /B-line\n/xgm' file
The \K makes it drop all matches before it, so we don't have to capture and copy them back in the replacement side. With the -0777 switch the whole file is slurped into a string, available in $_.
In order to match all such A-B?-C groups of lines we need the /g modifier (match "globally"), and for the anchor ^ to also match on internal newlines we need the /m modifier ("multiline").
The /x modifier makes it disregard literal spaces (and newlines and comments), what allows for spacing things out for readability.
On the other hand, if a line starting with A must be followed by either a line starting with B, or by a line starting with C if the B-line isn't there, then it's simpler, with no need for a lookahead
perl -0777 -wpe's/ ^A.*\n \K (^C.*\n) /B-line\n$1/xgm' file
Both these work correctly in my (basic) tests.
In either case, the rest of the file is printed as is, so you can use the -i switch to change the input file "in-place," if desired, and with -i.bak you get a backup as well. So
perl -i.bak -0777 -wpe'...' file
Or you can dump the output (redirect) into the same file to overwrite it, since the whole file was first read, if this runs out of a script.
Reading the file line by line is of course far more flexible. For example
use warnings;
use strict;
use feature 'say';
my $just_saw_A_line;
while (<>) {
if ($just_saw_A_line and not /^B/) {
say "B-line";
}
$just_saw_A_line = /^A/;
print
}
This also handles multiple A-(B?)-C line groups. It's far more easily adjusted for variations.
The program acts like a filter, taking either STDIN or lines from file(s) given on the command line, and printing lines to STDOUT. The output can then be redirected to a file, but not to the input file itself. (If the input file need be changed then the code need be modified for that.)

Extracting substring in linux using expr and regex

So I have just begun learning regular expressions. I have to extract a substring within a large string.
My string is basically one huge line containing a lot of stuff. I have identified the pattern based on which I need to extract. I need the number in this line A lot of stuff<li>65,435 views</li>a lot of stuff This number is just for example.
This entire string is in fact one big line and my file views.txt contains a lot of such lines.
So I tried this,
while read p
do
y=`expr "$p": ".*<li>\(.*\) views "`
echo $y
done < views.txt
I wished to iterate over all such lines within this views.txt file and print out the numbers.
And I get a syntax error. I really have no idea what is going wrong here. I believe that I have correctly flanked the number by <li> and views including the spaces.
My (limited) interpretation of the above regex leads me to believe that it would output the number.
Any help is appreciated.
The syntax error is because the ":" is not separated from "$p" by a space (or tab). With that fixed, the regex has a trailing blank which will prevent it matching. Fixing those two problems, your sample script works as intended.

switch search pattern in a line to titlecase leaving remainder of line unchanged

I want to convert the UPPERCASE words in the following line:
<h3>XV. THE THOUSAND AND ONE GOALS</h3>
to titlecase, using either sed or ved (vim ed). Googling turned up ways to Titlecase whole lines but not just text that matches a pattern search (partial text in a line)
thought this might work, but no dice:
sed -ri '/<h3>/s/([A-Z ]*)<\/h3>/\L\1\END<\/h3>/;s/[[:graph:]]*/\u&/g'
after converting the search pattern to LOWERCASE (no probs of course managing that), I thought I might be able to then convert same text to Titlecase with something like this, but still no love (I actually thought this made enough sense to work, so I am unsure why it doesn't):
sed -ri 's/(<\/a>[IVX]{1,6}\.[ ]{1,})( [a-z])/\1\u\2/g'
Is there some way to edit only the text in a pattern search to Titlecase and not all the words in an entire line of text? I wonder why there is not a \T to compliment the \L and \u case commands. Sure would be handy.
%s/\v<\zs(\u)(\u*)\ze>/\1\L\2/g
in vim, after you executed this command, your line will be turned into:
<h3>Xv. The Thousand And One Goals</h3>
vim has substitute() and other powerful functions, which are very handy to do substitution on matched text, if your requirement is complex enough. :s/.../\=(expression here)/
only do replacement in <h3>....</h3>:
%s#<h3>\zs.*\ze</h3>#\=substitute(submatch(0), '\v<\zs(\u)(\u*)\ze>','\1\L\2',"g")#

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.

Regex Partial String CSV Matching

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?
My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.
s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.
Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.
If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.
Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.
This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g
Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.