And statement using grep/regular expressions - regex

I'm using grep to search for text within a specific directory. I would like to return rows of text that contain stringA AND stringB.
I know that doing grep "stringA|stringB" is effectively an OR statement, is there something I can do, maybe using regex, that would allow me to run an AND statement ?
Many Thanks

If you don't know the order of the items, you could always reverse a pattern using |
grep '1st pattern.*2nd pattern|2nd pattern.*1st pattern' foofile
This works geat with two items, three or more would start slowing things down for sure...

You can pipe through two greps:
... | grep "stringA" | grep "stringB"
Note that if your patterns are actually fixed strings and not regular expressions then you can use fgrep instead of grep.

"stringA.*stringB"
This will locate stringA with any number of any characters between stringB at last

Related

Regular Expression between strings (multiple results?)

I am using a regular expression to filter a link from a HTML page like so:
(?<=data-ng-non-bindable data-src=\")(.*?)(?=\" data-caption)
How do I change it so that I get multiple results, not only the first one?
With sed, you replace strings, not extract. There are options you may set to actually output replaced substrings only, there is always a big problem with matches on the the same line.
Due to this, the easiest will be using grep with -oP options:
grep -oP '(?<=data-ng-non-bindable data-src=").*?(?=" data-caption)' file > outfile
Double quotation marks are not special.

Select a single character in an alphanumeric string in bash

I have an issue with string manipulation in bash. I have a list of names, each name being composed of two parts, chars and numbers: for example
abcdef01234
I want to cut the last character before the numeric part starts, in this case
f
I think there is a regular expression to help me with this but just can't figure it out. AWK/sed solutions are accepted too. Hope someone can help.
Thank you.
In bash it can be done with parameter expansion with substring removal and string indexes, e.g.,
a=abcdef01234 # your string
tmp=${a%%[0-9]*} # remove all numbers from right
echo ${tmp:(-1)} # output last of remaining chars
Output: f
You can use a regexp like [a-zA-Z]+([a-zA-Z])[0-9]+. If you know how to use sed is pretty easy.
Check https://regex101.com/r/XCkKM5/1
The match will be the letter you want.
^\w+([a-zA-Z])\d+$
As a sed command (on OSX) this will be :
echo "abcdef12345" | sed -E "s#^[a-zA-Z]+([a-zA-Z])[0-9]+\$#\1#"
try following too once.
echo "abcdef01234" | awk '{match($0,/[a-zA-Z]+/);print substr($0,RLENGTH,1)}'
I have a list of names I assume is a file, file. Using grep's PCRE and (positive) lookahead:
$ grep -oP "[a-z](?=[^a-z])" file
f
It prints out the first (lowercase) letter followed by a non-(lowercase)-letter.

How to replace arbritary combinations of (special) characters and numbers using sed and regular expressions

I have a csv file with nearly arbritary filled colums like this:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""
and now I want to replace the comma between the two numbers with a point:
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
I tried a lot but nothing helped. :-(
sed s/[0-9],[0-9]/./g data.csv
works but it delets the two numbers before and after the comma. So I tried things like
sed s/\(\.[0-9]\),\([0-9]\.\)/\1.\2/g data.csv
but that changed nothing.
Try with s/\([0-9]\),\([0-9]\)/\1.\2/g:
$ echo '"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03,123456789,"bla::38594f-47849-h945f",""' | sed 's/\([0-9]\),\([0-9]\)/\1.\2/g'
"bla","","blabla","bla::bla::blabla",19.05.16 12:00:03.123456789,"bla::38594f-47849-h945f",""
Regex Demo Here
You don't really need the additional dot \. in the capturing groups.

String simplification for regex (BASH)

I'm looking for a way to simplify multiple strings for the purpose of regular expression searching, Here's an example:
I have a list of several thousand strings, similar to the ones below (text.#######):
area.202264
area.202265
area.202266
area.202267
area.202268
area.202269
area.202270
area.204517
area.204518
area.204519
area.207171
area.207338
area.208842
I've been trying to figure out an automated way to simplify it into something like this:
area.20226(4|5|6|7|8|9)|area.202270|area.20451(7|8|9)|area.207171|area.207338|area.208842
The purpose of this would be to reduce string length when searching these areas, I have absolutely no way how to approach something like this in a simple, re-usable way.
Thanks in advance! Any solutions or tips on where to start would be appreciated :)
echo "area.202264 area.202265 area.202266 area.202267 area.202268 area.202269 area.202270 area.204517 area.204518 area.204519 area.207171 area.207338 area.208842" | tr ' ' '\n' > list.txt
cat list.txt | grep -v "^$" | sed -e "s/[0-9] *$//g" | sort -u | while read p; do l=`grep $p list.txt | sed -e "s/.*\([0-9]\)$/\1/g" | xargs | tr ' ' '|'` ;echo "$p($l)" ; done | sed -e "s/(\(.\))/\1/g"| xargs| tr ' ' '|'
put search strings to the file named "filter" in one column
area.202264
area.202265
area.202266
area.202267
than you can search fast enought by
fgrep -f filter file-to-search-in
I see no easy way to produce regexp from samples, and I'm not sure regexp approach will faster.
Here are a couple of things you should know:
Nearly all regex engines build a state machine from their patterns. You can probably just put the various named between vertical bars and get good performance. (It won't look nice, but it will work.)
That is, something like:
(area.202264|area.202265|area.202266|...|area.207338|area.208842)
Even with 4k items, the right engine will just compile it down. (I don't think bash will handle it, because of the length. But perl, grep, fgrep as mentioned elsewhere can do it.)
You say "BASH", so it's worth pointing out there is a difference between regex and file globbing. If the things you are working with are text, then regex (^area.\d+$) is the way to go. If the things you are working with are filenames, then globbing (*.c) has different rules.
You can simplify greatly if you don't care at all about the numbers, only the format. For regexes:
area\.\d+ # area, dot, one or more digits (0-9)
area\.\d{1,6} # area, dot no less than 1, no more than 6 digits
area\.\d{6} # area, dot, exactly 6 digits
area\.20[234]\d{3} # area, dot, 20 {2,3,4} then 3 more digits
If you can use Perl and the Regexp::Assemble module, it can convert multiple patterns into a single, optimized, regular expression. For instance, using it on the list of strings in the question yields:
(?-xism:area\.20(?:22(?:6[456789]|70)|7(?:171|338)|451[789]|8842))
That only works if the database plugin can accept Perl regular expressions.

How can I delete all lines that do not begin with certain characters?

I need to figure out a regular expression to delete all lines that do not begin with either "+" or "-".
I want to print a paper copy of a large diff file, but it shows 5 or so lines before and after the actual diff.
In VIM:
:g!/^[+-]/d
Here is the English translation:
globally do something to all lines that do NOT! match the regular expression: start of line^ followed by either + or -, and that something to do is to delete those lines.
sed -e '/^[^+-]/d'
diff -u <some args here> | grep '^[+-]'
Or you could just not produce the extra lines at all:
diff --unified=0 <some args>
cat your_diff_file | sed '/^[+-]/!D'
egrep "^[+-]" difffile >outputfile
Instead of deleting everything that doesn't match you show only lines that match. :)
If you need to do something more complex in terms of regular expressions, you should use this site:
http://txt2re.com/
it also provides code examples for many different languages.
%!grep -Ev '^[+-]'
does it inline on current file, and can be considerably faster than :v for large files.
Tested on Vim 7.4, Ubuntu 14.04, 1M line log file.
Lines that don't contain word: https://superuser.com/questions/265085/vim-delete-all-lines-that-do-not-contain-a-certain-word/1187212#1187212