Regex appears properly formed, but not ouputting expected results in grep

Regex appears properly formed, but not ouputting expected results in grep - regex

I'm trying to search my codebase for instances of the following pattern:
m_vParts[foo] =
And similar instances with varying whitespace. So I can up with this regex:
m_vParts\[.*\]\s*=[^=]\s*
When I test this at http://gskinner.com/RegExr/ and other regex-tester type sites, it finds exactly what I want. However, when I actually grep (or egrep) I get no results. My guess is my regex isn't well-formed for grep's dialect of regexes, but I'm not sure exactly where I'm off.
Here is the actual command I give:
[e]grep -Irn "m_vParts\[.*\]\s*=[^=]\s*" .
I've tried with both single and double quotes.
Here's a small sample of code that is exemplary of the codebase:
pcTab->m_vParts[iLastPart] = pcPart;
if ( m_pcCurrentTab->m_vParts[i]== pcPart )
I would expect that the first line would be a match, and the second line would not.
Also, I should note that I'm using GnuWin32 grep on Windows 7 x64.
Thanks in advance for any guidance here; very much trying to avoid the non-automated search :)

Just add quotes around the re:
$ vim 1.txt
$ egrep 'm_vParts\[.*\]\s*=[^=]\s*' 1.txt
pcTab->m_vParts[iLastPart] = pcPart;
As you can see, all works.

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks

After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.

With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).

If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

Regex ignoring linebreaks and "page layout"

I have an assortment of searchable PDF files and I often search particular patterns in all of them simultaneously, using the pdfgrep command. My regex knowledge is somewhat limited and I'm not sure how to work around linebreaks and page layout.
For example, I would like to find the pattern "ignor.{0,10}layout" in each example below:
This is a rather difficult You see, I would like to ignore
task that I am trying to page layout and still find the
achieve. pattern I am looking for.
This is a rather difficult This is because I would like to ig-
task that I am trying to nore page layout and still find the
achieve. pattern I am looking for.
In both examples, I would like the first two lines to be reported by
pdfgrep -n "ignor.{0,10}layout" *
but it fails to do so because:
there is a linebreak in the middle.
in the first example, there are more than 10 characters between ignor and layout.
in the second example, ignor is cut in half.
Is there a regex that would solve this problem entirely?

pdfgrep does not have the -z flag that would be necessary to interpret newlines as zero-bytes. You can use a workaround with pdftotext, that allows to convert it to text and stream this to STDOUT, where you can pipe a regular grep call:
pdftotext SPECIFIC-FILE.pdf - | grep -Pzo "(?s)YOUR\s+QUERY"
This makes it impossible to use globbing efficiently, but you can at least iterate the glob:
for pdf in *.pdf; do echo -n "$pdf:"; pdftotext "$pdf" - | grep -Pzo "(?s)YOUR\s+QUERY"; done
Please note that if you want to match whitespaces, you almost always will want to use \s+ which matches also newlines, when -z is enabled. See this other answer for an explanation of the flags.

sed and regular expression: unexpected replacement pattern

I am trying to use a small bash script using sed to append a string, but I do not seem to be able to make it work.
I want to append a string to another string pattern:
Strings in input file:
Xabc
Xdef
Desired output:
XabcZ
XdefZ
Here is the script:
#!/bin/bash
instring="$2"
sed -r "s/${instring}/${instring}Z/g" $1
Where $1 is the file name and $2 is the string pattern I am looking for
Then I run the script:
bash script.test.sh test.txt X
output:
XZabc
XZdef
As expected.
but if I use regular expressions:
bash script.test.sh test.txt X...
All I get is:
X...Z
X...Z
So obviously it is not reading it correctly in the replacement part of the command. Smae thing if I use X[a-z09] (but there may be "_" in my strings, I want to include those as well). I had a look at several previous similar topics, but I do not seem able to implement any of the solutions correctly (bear with a newbie...). Thank you for your kind help.
EDIT: After receiving the answers from Glenn Jackman (accepted solution) and RavinderSingh13, I would like to clarify two important points for whoever is having a similar issue:
1) Glenn Jackman solution did not work because I needed to convert the text file from DOS to Unix. I did it with dos2unix , but for some reason did not work (maybe forgot to overwrite the output to the old file?). I later did it using sed -i 's/\r$//' test.txt ; that solved the issue, and Glenn's solution now works. having a dos-formatted text file has been the source of many trouble, for me at least.
2) I probably did not make clear that I only wanted to target specific lines in the input files; my example only has target strings, but the actual file has strings that I do not want to edit. That was probably the misunderstanding occurred with RavinderSingh13's script, which actually works, but targets every single line.
Hope this can help future readers. Thank you, Stackers, you saved the day once again :)

What you have (sed -r "s/${instring}/${instring}Z/g" $1) uses the variable as a pattern on the left-hand side and as plain text on the right-hand side.
What you want to do is:
sed -r "s/${instring}/&Z/g" $1
# ....................^
where the & marker is replaced by whatever text the pattern matched. In the documentation for The s Command:
[T]he replacement can contain unescaped & characters which reference the whole matched portion of the pattern space.

EDIT: In case you need to pass a regex to script then following may help, where my previous solution was only appending a character to last of the line.
cat script.ksh
value="$2"
sed "s/$value/&Z/" "$1"
After running the script:
./script.ksh X.*
XabcZ
XdefZ
After seeing OP's comment to match everything which starts from either small letter or capital letter run script in following style then.
./script.ksh [A-Za-z]+*
Could you please try following and let me know if this helps you.
cat script.ksh
value="$2"
sed "s/$/$value/" "$1"
After running script I am getting following output on terminal too.
./script.ksh Input_file Z
XabcZ
XdefZ
You could use sed -i option in above code in case you want to save output into Input_file itself too.

Linux script to parse each line, check the regex and modify the line

I'm trying to write a linux bash script that takes in input a csv file with lines written in the following format (something can be blank):
something,something,,number,something,something,something,something,something,something,,,
something,something.something,,number,something,something,something,something,something,something,,,
and i have to have as output the following format (if the lines contains . it has to separate the two substring in substring1,substring2 and remove one , character, else do nothing)
something,something,,number,something,something,something,something,something,something,,,
something,something,something,number,something,something,something,something,something,something,,,
I tried to parse each line of the file and check if it respects a regex, but the command starts a never ending loop (don't know why) and morevor don't know how to divide the substring to have as output substring1,substring2
for f in /filepath/filename.csv
do
while read p; do
if [[$p == .\..]] ; then echo $p; fi
done <$f
done
Thanks in advance!

I can't provide you with a working code at the moment but a piece of quick advice:
1. Try with tool called sed
2. Learn about "capture groups" for regex to get info on how to divide the text based on expressions.

To separate strings AWK will be useful
echo "Hello.world" | awk -F"." '{print "STR1="$1", STR2="$2 }'
Hope it will help.

As your task is more about transforming unrelated lines of text than of parsing fields of csv formatted files, sed is indeed the tool to go.
Learning to use sed properly, even for the most basic tasks, is synonym to learning regular expressions. The following invocation of sed command transforms your input sample to your expected output:
sed 's/\.\([^,]*\),/,\1/g' input.csv >output.csv
In the above example, s/// is the replacement command.
From the manpage:
s/regexp/replacement/
Attempt to match regexp against the pattern space. If successful,
replace that portion matched with replacement. [...]
Explaining the regexp and replacement of the above command is probably out of the scope for the question, so I'll finish my answer here... Hope it helps!

Ok, i managed to use regexp, but the following command seems not working again:
sed '\([^,]*\),\([^,]*\)\.\([^,]*\),,\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\),/\1,\2,\3,\4,\5,\6,\7,\8,\9,\10,\11,\12,'
sed: -e expression #1, char 125: unknown command: `\'

Trying to remove version number from a string using sed in OSX

I have what I hope is a simple issue which is stumping me. I need to take an installer file with a name like:
installer_v0.29_linux.run
installer_v10.22_linux_x64.run
installer_v1.1_osx.app
installer_v5.6_windows.exe
and zip it up into a file with the format
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
I already have a bash script running on OSX which does almost everything else I need in the build chain, and was certain I could achieve this with sed using something like:
ZIP_NAME=`echo "$OUTPUT_NAME" | sed -E 's/_(?:\d*\.)?\d+//g'`
That is, replacing the regex _(?:\d*\.)?\d+ with a blank - the regex should match any decimal number preceded by an underscore.
However, I get the error RE error: repetition-operator operand invalid when I try to run this. At this stage I am stumped - I have Googled around this and can't see what I am doing wrong. The regex I wrote works correctly at Regexr, but clearly some element of it is not supported by the sed implementation in OSX. Does anyone know what I am doing wrong?

You can try this sed:
sed 's/_v[^_]*//; s/\.[[:alnum:]]\+$/.zip/' file
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip

You don't need sed, just some parameter expansion magic with an extended pattern.
shopt -s extglob
zip_name=${OUTPUT_NAME/_v+([^_])/}
The pattern _v+([^_]) matches a string starting with _v and all characters up to the next _. The extglob option enables the use of the +(...) pattern to match one or more occurrences of the enclosed pattern (in this case, a non-_ character). The parameter expansion ${var/pattern/} removes the first occurrence of the given pattern from the expansion of $var.

Try this way also
sed 's/_[^_]\+//' FileName
OutPut:
installer_linux.run
installer_linux_x64.run
installer_osx.app
installer_windows.exe
If you want add replace zip instead of run use below method
sed 's/\([^_]\+\).*\(_.*\).*/\1\2.zip/' Filename
Output :
installer_linux.run.zip
installer_x64.run.zip
installer_osx.app.zip
installer_windows.exe.zip

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js