Extracting substring in linux using expr and regex - regex

So I have just begun learning regular expressions. I have to extract a substring within a large string.
My string is basically one huge line containing a lot of stuff. I have identified the pattern based on which I need to extract. I need the number in this line A lot of stuff<li>65,435 views</li>a lot of stuff This number is just for example.
This entire string is in fact one big line and my file views.txt contains a lot of such lines.
So I tried this,
while read p
do
y=`expr "$p": ".*<li>\(.*\) views "`
echo $y
done < views.txt
I wished to iterate over all such lines within this views.txt file and print out the numbers.
And I get a syntax error. I really have no idea what is going wrong here. I believe that I have correctly flanked the number by <li> and views including the spaces.
My (limited) interpretation of the above regex leads me to believe that it would output the number.
Any help is appreciated.

The syntax error is because the ":" is not separated from "$p" by a space (or tab). With that fixed, the regex has a trailing blank which will prevent it matching. Fixing those two problems, your sample script works as intended.

Related

Regular expression to match string containing character

I have a file that contains several strings bounded by single quotations ('). These strings can contain whitespace and sometimes occur over multiple lines; however, no string contains a quotation (') mark. I'd like to create a regex that finds strings containing the character "$". The regex I had in mind: '[^']*\$[^']* can't search over multiple lines. How can I get it to do so?
Your regex can search over multiple lines. If it doesn't there is a mistake in your code outside of it. (hint: [^'] does include newlines).
How about this expression (it prevents useless backtracking):
'([^'$]*\$[^']*)'
You are not telling us which language you are using, so we are left to speculate. There are two issues here, really:
Many regex engines only process one line at a time by default
Some regex engines cannot process more than one line at a time
If you are in the former group, we can help you. But the problem isn't with the regex, so much as it's with how you are applying it. (But I added the missing closing single quote to your regex, below, and the negation to prevent backtracking as suggested in Tomalak's answer.)
In Python 2.x:
# doesn't work
with open('file', 'r') as f:
for line in f:
# This is broken because it examines a single line of input
print "match" if re.search(r"'[^'$]*\$[^']*'", line)
# works
s = ''
with open('file', 'r') as f:
for line in f:
s += line
# We have collected all the input lines. Now examine them.
print "match" if re.search(r"'[^'$]*\$[^']*'", s)
(That is not the idiomatic, efficient, correct way to read in an entire file in Python. I'm using clumsy code to make the difference obvious.)
Now, more idiomatically, what you want could be
perl -0777 -ne 'while (m/\x27[^\x27$]*\$[^\x27]*\x27/g) { print "$&\n" }' file
(the \x27 is a convenience so I can put the whole script in single quotes for the shell, and not strictly necessary if you write your Perl program in a file), or
#!/usr/bin/env python
import re
with open('file', 'r') as f:
for match in re.match(r"'[^'$]*\$[^']*'", f.read()):
print match
Similar logic can be applied in basically any scripting language with a regex engine, including sed. If you are using grep or some other simple, low-level regex tool, there isn't really anything you can do to make it examine more than one line at a time (but some clever workarounds are possible, or you could simply switch to a different tool -- pcregrep comes to mind as a common replacement for grep).
If you have really large input files, reading it all into memory at once may not be a good idea; perhaps you can devise a way to read only as much as necessary to perform a single match at a time. But that already goes beyond this simple answer.

Notepad++ masschange using regular expressions

I have issues to perform a mass change in a huge logfile.
Except the filesize which is causing issues to Notepad++ I have a problem to use more than 10 parameters for replacement, up to 9 its working fine.
I need to change numerical values in a file where these values are located within quotation marks and with leading and ending comma: ."123,456,789,012.999",
I used this exp to find and replace the format to:
,123456789012.999, (so that there are no quotation marks and no comma within the num.value)
The exp used to find is:
([,])(["])([0-9]+)([,])([0-9]+)([,])([0-9]+)([,])([0-9]+)([\.])([0-9]+)(["])([,])
and the exp to replace is:
\1\3\5\7\9\10\11\13
The problem is parameters \11 \13 are not working (the chars eg .999 as in the example will not appear in the changed values).
So now the question is - is there any limit for parameters?
It seems for me as its not working above 10. For shorter num.values where I need to use only up to 9 parameters the string for serach and replacement works fine, for the example above the search works but not the replacement, the end of the changed value gets corrupted.
Also, it came to my mind that instead of using Notepad++ I could maybe change the logfile on the unix server directly, howerver I had issues to build the correct perl syntax. Anyone who could help with that maybe?
After having a little play myself, it looks like back-references \11-\99 are invalid in notepad++ (which is not that surprising, since this is commonly omitted from regex languages.) However, there are several things you can do to improve that regular expression, in order to make this work.
Firstly, you should consider using less groups, or alternatively non-capture groups. Did you really need to store 13 variables in that regex, in order to do the replacement? Clearly not, since you're not even using half of them!
To put it simply, you could just remove some brackets from the regex:
[,]["]([0-9]+)[,]([0-9]+)[,]([0-9]+)[,]([0-9]+)[.]([0-9]+)["][,]
And replace with:
,\1\2\3\4.\5,
...But that's not all! Why are you using square brackets to say "match anything inside", if there's only one thing inside?? We can get rid of these, too:
,"([0-9]+),([0-9]+),([0-9]+),([0-9]+)\.([0-9]+)",
(Note I added a "\" before the ".", so that it matches a literal "." rather than "anything".)
Also, although this isn't a big deal, you can use "\d" instead of "[0-9]".
This makes your final, optimised regex:
,"(\d+),(\d+),(\d+),(\d+)\.(\d+)",
And replace with:
,\1\2\3\4.\5,
Not sure if the regex groups has limitations, but you could use lookarounds to save 2 groups, you could also merge some groups in your example. But first, let's get ride of some useless character classes
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
We could merge those groups:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+)(\.)([0-9]+)(")(,)
^^^^^^^^^^^^^^^^^^^^
We get:
(\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(,)
Let's add lookarounds:
(?<=\.)(")([0-9]+)(,)([0-9]+)(,)([0-9]+)(,)([0-9]+\.[0-9]+)(")(?=,)
The replacement would be \2\4\6\8.
If you have a fixed length of digits at all times, its fairly simple to do what you have done. Even though your expression is poorly written, it does the job. If this is the case, look at Tom Lords answer.
I played around with it a little bit myself, and I would probably use two expressions - makes it much easier. If you have to do it in one, this would work, but be pretty unsafe:
(?:"|(\d+),)|(\.\d+)"(?=,) replace by \1\2
Live demo: http://regex101.com/r/zL3fY5

Lua pattern matching for extracting hard coded strings in code base

I'm working with a C++ code base. Right now I'm using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the program.
The strings in question are always preceded by a JUCE macro called TRANS. Here are some examples which should extract a string
TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")
And I'm sure you can imagine some other possible string varients that could occur in a large code base. I'm making an automatic tool to generate JUCE translation formatted files to automate the process as much as possible
I've gotten this far, as it stands, for pattern matching in order to find these strings. I've converted the source code into a lua string
path = ...
--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")
and called
for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end
which finds a pattern that starts with TRANS, has balanced parenthesis. This will get me the full Macro, including the brackets but from there I figured it would be pretty easy to split off the fat I don't need and just keep the actual string value.
However this doesn't work for strings which cause a parenthesis imbalance.
e.gTRANS(")") will return TRANS("), instead of TRANS("(")
I revised my pattern to
for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end
where, the pattern should start with a TRANS, then 0 or many spaces. Then it should have a ( character followed by zero or more spaces. Now that we are inside the brackets, we should have a balanced number of "" marks, followed by another 0 or many spaces, and finally ended by a ) . Unfortunately, this does not return a single value when used. But... I think even IF it worked as I expected it to... There can be a \" inside, which causes the bracket imbalance.
Any advice on extracting these strings? Should I continue to try and find a pattern matching sequence? or should I try a direct algorithm... Do you know why my second pattern returned no strings? Any other advice! I'm not looking to cover 100% of all possibilities, but being close to 100% would be awesome. Thanks! :D
I love Lua patterns as much as anyone, but you're bringing a knife to a gun fight. This is one of those problems where you really don't want to code the solution as regular expressions. To deal correctly with doublequote marks and backslash escapes, you want a real parser, and LPEG will manage your needs nicely.
In the second case, you forgot to escape parentheses. Try
for word in string.gmatch(str, 'TRANS%s*%(%s*(%b"")%s*%)') do print(word) end

REGEX: How to match several lines?

I have a huge CSV list that needs to be broken up into smaller pieces (say, groups of 100 values each). How do I match 100 lines? The following does not work:
(^.*$){100}
If you must, you can use (flags: multi-line, not global):
(^.*[\r\n]+){100}
But, realistically, using regex to find lines probably is the worst-performing method you could come up with. Avoid.
You don't need regex for this, there should be other tools in your language, even if there is none, you can do simple string processing to get those lines.
However this is the regular expression that should match 100 lines:
/([^\n]+\n){100}/
But you really shouldn't use that, it's just to show how to do such task if ever needed (it searches for non newlines [^\n]+ followed by a newline \n repeated for {100} times).

Regex Partial String CSV Matching

Let me preface this by saying I'm a complete amateur when it comes to RegEx and only started a few days ago. I'm trying to solve a problem formatting a file and have hit a hitch with a particular type of data. The input file is structured like this:
Two words,Word,Word,Word,"Number, number"
What I need to do is format it like this...
"Two words","Word",Word","Word","Number, number"
I have had a RegEx pattern of
s/,/","/g
working, except it also replaces the comma in the already quoted Number, number section, which causes the field to separate and breaks the file. Essentially, I need to modify my pattern to replace a comma with "," [quote comma quote], but only when that comma isn't followed by a space. Note that the other fields will never have a space following the comma, only the delimited number list.
I managed to write up
s/,[A-Za-z0-9]/","/g
which, while matching the appropriate strings, would replace the comma AND the following letter. I have heard of backreferences and think that might be what I need to use? My understanding was that
s/(,)[A-Za-z0-9]\b
should work, but it doesn't.
Anyone have an idea?
My experience has been that this is not a great use of regexes. As already said, CSV files are better handled by real CSV parsers. You didn't tag a language, so it's hard to tell, but in perl, I use Text::CSV_XS or DBD::CSV (allowing me SQL to access a CSV file as if it were a table, which, of course, uses Text::CSV_XS under the covers). Far simpler than rolling my own, and far more robust than using regexes.
s/,([^ ])/","$1/ will match a "," followed by a "not-a-space", capturing the not-a-space, then replacing the whole thing with the captured part.
Depending on which regex engine you're using, you might be writing \1 or other things instead of $1.
If you're using Perl or otherwise have access to a regex engine with negative lookahead, s/,(?! )/","/ (a "," not followed by a space) works.
Your input looks like CSV, though, and if it actually is, you'd be better off parsing it with a real CSV parser rather than with regexes. There's lot of other odd corner cases to worry about.
This question is similar to: Replace patterns that are inside delimiters using a regular expression call.
This could work:
s/"([^"]*)"|([^",]+)/"$1$2"/g
Looks like you're using Sed.
While your pattern seems to be a little inconsistent, I'm assuming you'd like every item separated by commas to have quotations around it. Otherwise, you're looking at areas of computational complexity regular expressions are not meant to handle.
Through sed, your command would be:
sed 's/[ \"]*,[ \"]*/\", \"/g'
Note that you'll still have to put doublequotes at the beginning and end of the string.