Why is this grep <regex> not working? - regex

The following regex with grep doesn't seem to be working:
grep "(?=\_\(\").*?(?=\"\))" ./testfile.js
testfile.js is as follows:
asoijf oaisdjf _("string 1") fodijsasf _("string 2")
fasdoij _("string 3");
console.log(_("string 4"));
My aim is to grab all the strings enclosed in _() function calls, without greps -P flag (the option doesn't exist for me). Expected output would be:
string 1
string 2
string 3
string 4
Any idea's?
Update: The -P flag was removed from bash in my version of mac (see grep -P no longer works how can I rewrite my searches)

Could you please try following awk and let me know if this helps you.
grep -oE '_\([^\)]*' Input_file | cut -c3-
Output will be as follows.
"string 1"
"string 2"
"string 3"
"string 4"
EDIT: Since OP doesn't have -P option in it's O.S so providing an awk approach here too.
awk '{
while($0){
match($0,/_\([^\)]*/);
st=RSTART;
le=RLENGTH;
if(substr($0,RSTART+2,RLENGTH-2)){
print substr($0,RSTART+2,RLENGTH-2)
};
$0=substr($0,st+va+3)
}
}
' Input_file

grep -oP '_\("\K[^"]+' inputfile
string 1
string 2
string 3
string 4
Here, -o will print only the matched result, not the whole line.
\K will be used for the look behind, Means matches the string left of \K but do not print it. [^"]+ means anything except " one or more time.
Or without -P option:
grep -oE '_\("[^"]+' inputfile|cut -d'"' -f2
string 1
string 2
string 3
string 4

Related

Regex pattern for quoted numbers and commas

I'm trying to find the correct regex to search a file for double quoted numbers separated by a comma. For example I'm trying to find "27,422,734" and then replace it in a text editor to correct the comma to be every 4 numbers so the end result would be "2742,2734"
I've tried a few examples I found on SO but none are helping me with this scenario like
"[^"]+"
'\d+'
while the above do find matches, I don't know how to deal with the commas and how what to replace that with.
Thanks for any help!
I found an even shorter solution (works with gnu-sed):
colonmv () {
echo $# | sed 's/,//g' | sed -r ':a;s/\B[0-9]{4}\>/,&/;ta'
}
But attention, the first sed command eats every comma, not just between digits, so improve it or filter your input before.
The second command uses the :a trick.
Read 4 digits, followed by a non digit (>) replace with the same plus comma, when a replacement took place, jump back from ta to :a and repeat.
Now, let's see colonmv in the wild:
colonmv '"A 3-grouped, pretty long number: 5,127,422,734 and an ungrouped one 5678905567789065778"'
"A 3-grouped pretty long number: 51,2742,2734 and an ungrouped one 567,8905,5677,8906,5778"
There might be better way of doing but I propose the following approach:
INPUT:
$ cat to_transform.txt
abc "27,422,734" def"27,422,734" def
ltu "123,734" abc "345,678,123,734" vtu
xtz "345,678,123,734" vtu "345,678,123,734"
u "1" a
"123"
iu"abc"a "123,734"
CMD:
$ paste -d' ' <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) <(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt | sed -e 's/,//g;:loop s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g; s/,,/,/g; /\([0-9]\{5\}\)/b loop') | awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'
OUTPUT:
$ cat to_transform.txt
abc "2742,2734" def"2742,2734" def
ltu "12,3734" abc "3456,7812,3734" vtu
xtz "3456,7812,3734" vtu "3456,7812,3734"
u "1" a
"123"
iu"abc"a "12,3734"
CODE DETAILS AND EXPLANATIONS:
<(grep -oP '(?<=")(:?\d+,\d+)+(?=")' to_transform.txt) will extract each number to be processed from the input file, the regex used here use lookbehind/lookahead to enforce the surrounded by quotes condition, (:?\d+,\d+)+ is used to extract the numbers like 27,422,734.
the sed command will getting the output from the grep command will then do the following operations:
SED DETAILS:
s/,//g #remove all , in the number
:loop #create a label to loop
s/\([0-9]\{4\}\)\($\|,\)/\2,\1/g #add a coma after every chain of 4 characters starting by the end of the string/or from the latest coma added
s/,,/,/g #remove duplicate comas added by the previous step if any
/\([0-9]\{5\}\)/b loop #if there are at least 5 digits present successively in the string loop and continue the processing.
Temporary output after the paste operation:
27,422,734 2742,2734
27,422,734 2742,2734
123,734 12,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
345,678,123,734 3456,7812,3734
123,734 12,3734
Last but not least the awk command will read this file and run some sed command to replace every element of the first column by the corresponding value in the second command: awk '{cmd="sed -i 0,/"$1"/s/" $1 "/" $2 "/ to_transform.txt"; system(cmd)}'.
Precondition: Your input conforms to "[0-9,]*" and is a "#,###"-format correct number.
#!/bin/bash
colonmv () {
echo $1 | sed -r 's/,([0-9]{3})+/\1/g;' | \
rev | sed -r 's/[^0-9]?([0-9]{4})/\1,/g;s/,"$/"/;s/.*/"&/' | rev
}
colonmv '"734"'
colonmv '"2,734"'
colonmv '"22,734"'
colonmv '"422,734"'
colonmv '"7,422,734"'
colonmv '"27,422,734"'
colonmv '"127,422,734"'
colonmv '"5,127,422,734"'
Test:
colonmv.sh
"734""
"2734"
"2,2734"
"42,2734"
"742,2734"
"2742,2734"
"1,2742,2734"
"51,2742,2734"

How do I grab the part of this string after a `/`?

Here is my Bash code:
echo "Some string/Another string" | grep -o "\/.*"
This returns /Another string.
But I do not want the / included in the value returned by echo.
How do I change the regex do accomplish this?
EDIT: I want to match everything after the /, no matter what is after it. "Another string" is not always after the /.
If you have GNU Grep that supports PCRE then you can use \K to forget the match.
$ echo "Some string/Another string" | grep -oP "\/\K.*"
Another string
With sed :
$ sed 's/.*\/\(.*\)/\1/' <<< "Some string/Another string"
Another string
It search any characther up to next /, then capture and print following characters.
It may be more readable in ERE mode (-r option with GNU sed) and with another separator :
sed -r 's|.*/(.*)|\1|'
With parameter expansion:
$ string='Some string/Another string'
$ echo "${string#*/}"
Another string
The expansion with # removes what comes after it from the beginning of the expanded parameter.
With awk:
$ awk -F/ '{print $2}' <<< "$string"
Another string
This sets the field separator to / and prints the second field.
You can do this with cut command:
If you want string between first and second occurrence of /
cut -d '/' -f 2 <<< "Some string/Another string/abc"
output: Another string
If you want entire string after first occurrence of /
cut -d '/' -f 2- <<< "Some string/Another string/abc"
output: Another string/abc

How to match and keep the first number in a line using sed?

Question
Let's say I have one line of text with a number placed somewhere (it could be at the beginning, in the middle or at the end of the line).
How to match and keep the first number found in a line using sed?
Minimal example
Here is my attempt (following this page of a tutorial on regular expressions) and the output for different positions of the number:
$echo "SomeText 123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "123SomeText" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
$echo "SomeText 123" | sed 's:.*\([0-9][0-9]*\).*:\1:'
3
As you can only the last digit is kept in the process whereas the desired output should be 123...
Using sed:
echo "SomeText 123SomeText 456" | sed -r 's/^[^0-9]*([0-9]+).*$/\1/'
123
You can also do this in gnu awk:
echo "SomeText 123SomeText 456" | awk '{print gensub(/^[^0-9]*([0-9]+).*$/, "\\1", $0)}'
123
To complement the sed solutions, here's an awk alternative (assuming that the goal is to extract the 1st number on each line, if any (i.e., ignore lines without any numbers)):
awk -F'[^0-9]*' '/[0-9]/ { print ($1 != "" ? $1 : $2) }'
-F'[^0-9]*' defines any sequence of non-digit chars. (including the empty string) as the field separator; awk automatically breaks each input line into fields based on that separator, with $1 representing the first field, $2 the second, and so on.
/[0-9]/ is a pattern (condition) that ensures that output is only produced for lines that contain at least one digit, via its associated action (the {...} block) - in other words: lines containing NO number at all are ignored.
{ print ($1!="" ? $1 : $2) } prints the 1st field, if nonempty, otherwise the 2nd one; rationale: if the line starts with a number, the 1st field will contain the 1st number on the line (because the line starts with a field rather than a separator; otherwise, it is the 2nd field that contains the 1st number (because the line starts with a separator).
You can also use grep, which is ideally suited to this task. sed is a Stream EDitor, which is only going to indirectly give you what you want. With grep, you only have to specify the part of the line you want.
$ cat file.txt
SomeText 123SomeText
123SomeText
SomeText 123
$ grep -o '[0-9]\+' file.txt
123
123
123
grep -o prints only the matching parts of a line, each on a separate line. The pattern is simple: one or more digits.
If your version of grep is compatible with the -P switch, you can use Perl-style regular expressions and make the command even shorter:
$ grep -Po '\d+' file.txt
123
123
123
Again, this matches one or more digits.
Using grep is a lot simpler and has the advantage that if the line doesn't match, nothing is printed:
$ echo "no number" | grep -Po '\d+' # no output
$ echo "yes 123number" | grep -Po '\d+'
123
edit
As pointed out in the comments, one possible problem is that this won't only print the first matching number on the line. If the line contains more than one number, they will all be printed. As far as I'm aware, this can't be done using grep -o.
In that case, I'd go with perl:
perl -lne 'print $1 if /.*?(\d+).*/'
This uses lazy matching (the question mark) so only non-digit characters are consumed by the .* at the start of the pattern. The $1 is a back reference, like \1 in sed. If there are more than one number on the line, this only prints the first. If there aren't any at all, it doesn't print anything:
$ echo "no number" | perl -ne 'print "$1\n" if /.*?(\d+).*/'
$ echo "yes123number456" | perl -lne 'print $1 if /.*?(\d+).*/'
123
If for some reason you still really want to use sed, you can do this:
sed -n 's/^[^0-9]*\([0-9]\{1,\}\).*$/\1/p'
unlike the other answers, this is compatible with all version of sed and will only print lines that contain a match.
Try this sed command,
$echo "SomeText 123SomeText" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123
Another example,
$ echo "SomeText 123SomeText 456" | sed -r '/[^0-9]*([0-9][0-9]*)[^0-9]*/ s//\1 /g'
123 456
It prints all the numbers in a file and the captured numbers are separated by spaces while printing.

bash sed/grep extract text between 2 words

My problem is the same as it's here, except I only want the first occurrence, ignore all the rest:
How to use sed/grep to extract text between two words?
In his example if it would be:
input: "Here is a String Here is a String"
But I only care about the first "is"
echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
output: "is a String Here is a"
Is this even possible with grep? I could use sed as well for the job.
Thanks
Your regexp happens to be matching against the longest string that sits between "Here" and "String". That is, indeed, "Here is a String Here is a String". This is the default behaviour of the * quantifier.
$ echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)'
is a String Here is a
If you want to match the shortest, you may put a ? (greediness modifier) just after the * quantifier:
$ echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*?(?= String)'
is a
is a
To get the first word you can use grep -o '^[^ ]*':
echo "Here is a String Here is a String" | grep -Po '(?<=(Here )).*(?= String)' | grep -o '^[^ ]*'
And you can pipe grep to grep multiple times to compose simple commands into complex ones.
sed 's/ String.*//;s/.*Here //'

sed: print only matching group

I want to grab the last two numbers (one int, one float; followed by optional whitespace) and print only them.
Example:
foo bar <foo> bla 1 2 3.4
Should print:
2 3.4
So far, I have the following:
sed -n 's/\([0-9][0-9]*[\ \t][0-9.]*[\ \t]*$\)/replacement/p'
will give me
foo bar <foo> bla 1 replacement
However, if I try to replace it with group 1, the whole line is printed.
sed -n 's/\([0-9][0-9]*[\ \t][0-9.]*[\ \t]*$\)/\1/p'
How can I print only the section of the line that matches the regex in the group?
Match the whole line, so add a .* at the beginning of your regex. This causes the entire line to be replaced with the contents of the group
echo "foo bar <foo> bla 1 2 3.4" |
sed -n 's/.*\([0-9][0-9]*[\ \t][0-9.]*[ \t]*$\)/\1/p'
2 3.4
grep is the right tool for extracting.
using your example and your regex:
kent$ echo 'foo bar <foo> bla 1 2 3.4'|grep -o '[0-9][0-9]*[\ \t][0-9.]*[\ \t]*$'
2 3.4
And for yet another option, I'd go with awk!
echo "foo bar <foo> bla 1 2 3.4" | awk '{ print $(NF-1), $NF; }'
This will split the input (I'm using STDIN here, but your input could easily be a file) on spaces, and then print out the last-but-one field, and then the last field. The $NF variables hold the number of fields found after exploding on spaces.
The benefit of this is that it doesn't matter if what precedes the last two fields changes, as long as you only ever want the last two it'll continue to work.
The cut command is designed for this exact situation. It will "cut" on any delimiter and then you can specify which chunks should be output.
For instance:
echo "foo bar <foo> bla 1 2 3.4" | cut -d " " -f 6-7
Will result in output of:
2 3.4
-d sets the delimiter
-f selects the range of 'fields' to output, in this case, it's the 6th through 7th chunks of the original string. You can also specify the range as a list, such as 6,7.
I agree with #kent that this is well suited for grep -o. If you need to extract a group within a pattern, you can do it with a 2nd grep.
# To extract \1 from /xx([0-9]+)yy/
$ echo "aa678bb xx123yy xx4yy aa42 aa9bb" | grep -Eo 'xx[0-9]+yy' | grep -Eo '[0-9]+'
123
4
# To extract \1 from /a([0-9]+)b/
$ echo "aa678bb xx123yy xx4yy aa42 aa9bb" | grep -Eo 'a[0-9]+b' | grep -Eo '[0-9]+'
678
9
I generally cringe when I see 2 calls to grep/sed/awk piped together, but it's not always wrong. While we should exercise our skills of doing things efficiently, "A foolish consistency is the hobgoblin of little minds", and "Real artists ship".