How to use grep to search for two regex at the same time. Say, I am looking for "My name is" and "my bank account " in a text like:
My name is Mike. I'm 16 years old.
I have no clue how to solve my grep problem,but
if I manage to solve it, then I'll transfer
you some money from my bank account.
I'd like grep to return:
My name is
my bank account
Is it possible to do it with just one grep call or should I write a script to do that for me?
If you do not care about a trailing newline, simply use grep:
< file.txt grep -o "My name is\|my bank account" | tr '\n' ' '
If you would prefer a trailing newline, use awk:
awk -v RS="My name is|my bank account" 'RT != "" { printf "%s ", RT } END { printf "\n" }' file.txt
I'm not quite sure what you're after. The result you give doesn't seem to fit with anything grep can/will do. In particular, grep is line oriented, so if it finds a match in a line, it includes that entire line in the output. Assuming that's what you really want, you can just or the two patterns together:
grep ("My name is" | "my bank account")
Given the input above, this should produce:
My name is Mike. I'm 16 years old.
you some money from my bank account.
Alternatively, since you haven't included any meta-characters in your patterns, you could use fgrep (or grep -F) and put your patterns in a file, one per line. For two patterns this probably doesn't make a big difference, but if you want to look for lots of patterns, it'll probably be quite a bit faster (it uses the Aho-Corasick string search to search for all the patterns at once instead of searching for them one at a time).
The other possibility would be that you're looking for a single line that includes both my name is and my bank account. That's what #djechlin's answer would do. From the input above, that would produce no output, so I doubt it's what you want, but if it is, his answer is fairly reasonable. An alternative would be a pattern like ("My name is.*my bank account" | "my bank account.*My name is").
Yes. It is possible. I used sed. You can replace S1 and S2 with whatever you want
sed '/S1/{ s:.*:S1:;H};/S2/{ s:.*:S2:;H};${x;s:\n: :g;p};d'
Sed is much more complex than grep, and in this case I used it to simulate grep's behaviour that you wish.
pipe. grep expr1 file | grep expr2
for or - egrep '(expr1|expr2)' file
Related
I'm looking for a way to simplify multiple strings for the purpose of regular expression searching, Here's an example:
I have a list of several thousand strings, similar to the ones below (text.#######):
area.202264
area.202265
area.202266
area.202267
area.202268
area.202269
area.202270
area.204517
area.204518
area.204519
area.207171
area.207338
area.208842
I've been trying to figure out an automated way to simplify it into something like this:
area.20226(4|5|6|7|8|9)|area.202270|area.20451(7|8|9)|area.207171|area.207338|area.208842
The purpose of this would be to reduce string length when searching these areas, I have absolutely no way how to approach something like this in a simple, re-usable way.
Thanks in advance! Any solutions or tips on where to start would be appreciated :)
echo "area.202264 area.202265 area.202266 area.202267 area.202268 area.202269 area.202270 area.204517 area.204518 area.204519 area.207171 area.207338 area.208842" | tr ' ' '\n' > list.txt
cat list.txt | grep -v "^$" | sed -e "s/[0-9] *$//g" | sort -u | while read p; do l=`grep $p list.txt | sed -e "s/.*\([0-9]\)$/\1/g" | xargs | tr ' ' '|'` ;echo "$p($l)" ; done | sed -e "s/(\(.\))/\1/g"| xargs| tr ' ' '|'
put search strings to the file named "filter" in one column
area.202264
area.202265
area.202266
area.202267
than you can search fast enought by
fgrep -f filter file-to-search-in
I see no easy way to produce regexp from samples, and I'm not sure regexp approach will faster.
Here are a couple of things you should know:
Nearly all regex engines build a state machine from their patterns. You can probably just put the various named between vertical bars and get good performance. (It won't look nice, but it will work.)
That is, something like:
(area.202264|area.202265|area.202266|...|area.207338|area.208842)
Even with 4k items, the right engine will just compile it down. (I don't think bash will handle it, because of the length. But perl, grep, fgrep as mentioned elsewhere can do it.)
You say "BASH", so it's worth pointing out there is a difference between regex and file globbing. If the things you are working with are text, then regex (^area.\d+$) is the way to go. If the things you are working with are filenames, then globbing (*.c) has different rules.
You can simplify greatly if you don't care at all about the numbers, only the format. For regexes:
area\.\d+ # area, dot, one or more digits (0-9)
area\.\d{1,6} # area, dot no less than 1, no more than 6 digits
area\.\d{6} # area, dot, exactly 6 digits
area\.20[234]\d{3} # area, dot, 20 {2,3,4} then 3 more digits
If you can use Perl and the Regexp::Assemble module, it can convert multiple patterns into a single, optimized, regular expression. For instance, using it on the list of strings in the question yields:
(?-xism:area\.20(?:22(?:6[456789]|70)|7(?:171|338)|451[789]|8842))
That only works if the database plugin can accept Perl regular expressions.
I have a text file this way
"an arbitrary string" = "this is the text one"
"other arbitrary string" = "second text"
"a third arbitrary string" = "the text number three"
I want to obtain only this
an arbitrary string
other arbitrary string
a third arbitrary string
That is, the text inside the first quotes, or between the first " and the " =. I used this regex
(?!").*(?=(" =))
This is working when I tried it in RegExr and in this online tool. But in my OSX Terminal it does not work, the output is empty
grep -o '(?!").*(?=(" =))' input.txt
What is wrong here? Do I have to escape some characters? I try everyone and nothing changes.
Thank you so much and please excuse my lack of knowledge about this topic.
Lookaheads and lookbehinds are PCRE features so you have to use the parameter -P:
grep -Po '(?!").*(?=(" =))' input.txt
This should do:
awk -F\" '{print $2}' file
It uses " as separators, and then print second field.
steffen`s answer is right, you have to use -P flag. But there is also a problem with your regex.
Imagine this input:
"an arbitrary string" = " =this is the text one"
Your regex will fail dramatically.
To solve this you have to use something like this:
grep -Po '^"\K.*?(?=(" =))'
^ to prevent other matches that do not begin from the line start.
\K is just easier to read. (It also allows you to match strings with arbitrary length)
.*? to make it non-greedy.
I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.
So I have a bunch of data that all looks like this:
janitor#1/2 of dorm#1/1
president#4/1 of class#2/2
hunting#1/1 hat#1/2
side#1/2 of hotel#1/1
side#1/2 of hotel#1/1
king#1/2 of hotel#1/1
address#2/2 of girl#1/1
one#2/1 in family#2/2
dance#3/1 floor#1/2
movie#1/2 stars#5/1
movie#1/2 stars#5/1
insurance#1/1 office#1/2
side#1/1 of floor#1/2
middle#4/1 of December#1/2
movie#1/2 stars#5/1
one#2/1 of tables#2/2
people#1/2 at table#2/1
Some lines have prepositions, others don't so I thought I could use regular expressions to clean it up. What I need is each noun, the # sign and the following number on its own line. So for example, the first lines of output should look like this in the final file:
janitor#1
dorm#1
president#4
etc...
The list is stored in a file called NPs. My code to do this is:
cat NPs | grep -E '\b(\w*[#][1-9]).' >> test
When I open test, however, it's the exact same as the input file. Any input as to what I'm missing? It doesn't seem like it should be a hard operation, so maybe I'm missing something about syntax? I'm using this command from a shell script that is called in bash.
Thanks in advance!
This should do what you need.
The -o option will show only the part of a matching line that matches the PATTERN.
grep -Eo '[a-z#]+[1-9]' NPs > test
or even the -P option, which Interprets the PATTERN as a Perl regular expression
grep -Po '[\w#]*(?=/)' NPs > test
Using grep:
$ grep -o "\w*[#]\w*" inputfile
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
ecember#1
movie#1
stars#5
one#2
tables#2
people#1
table#2
grep variations extracting entire lines from text, if they match pattern. If you need to modify lines, you should use sed, like
cat NPs | sed 's/^\(\b\w*[#][1-9]\).*$/\1/g'
You need sed, not grep. (Or awk, or perl.) It looks like this would do what you want:
cat NPs | sed 's?/.*??'
or simply
sed 's?/.*??' NPs
s means "substitute". The next character is the delimiter between regular expressions. Usually it's "/", but since you need to search for "/", I used "?" instead. "." refers to any character, and "*" says "zero or more of what preceded me". Whatever is between the last two delimiters is the replacement string. In this case it's empty, so you're replacing "/" followed by zero or more of any character, with the empty string.
EDIT: Oh, I see now that you wanted to extract the last item on the line, too. Well, I'm sure that others' suggested regexps would work. If it were my problem, I'd probably filter the file in two steps, perhaps piping the results from one step to the next, or using multiple substitutions with sed: First delete the "of"s and middle spaces, and add newlines, and then run sed as above. It's not as cool as doing it all in one regexp, but each step is easier to understand. For even more simplicity and uncoolness, use three steps, replacing " of " with space in the first step. Since others have provided complete solutions, I won't work out the details.
Grep by default just searches for the text, so in your case it is printing the lines that match. I think you want to investigate sed instead to perform the replacement. (And you don't need to cat the file, just grep PATTERN filename)
To get your output on separate lines, this worked for me:
sed 's|/.||g' NPs | sed 's/ .. /=/' | tr "=" "\n"
This uses two seds in a row to do different substitutions, and tr to insert line feeds.
The -o option in grep, which causes it to print out only the matching text, as described in another answer, is probably even simpler!
An awk version:
awk '/#/ {print $NF}' RS="/" NPs
janitor#1
dorm#1
president#4
class#2
hunting#1
hat#1
side#1
hotel#1
side#1
hotel#1
king#1
hotel#1
address#2
girl#1
one#2
family#2
dance#3
floor#1
movie#1
stars#5
movie#1
stars#5
insurance#1
office#1
side#1
floor#1
middle#4
December#1
movie#1
stars#5
one#2
tables#2
people#1
table#2
I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))