Using pcregrep for multiple files - regex

I am trying to use pcregrep multiline match on a set of files. And those files itself are coming out some searches from the current directory, something like below:
l | grep -P "\d\.mt.+" | cut -d":" -f 2 | cut -d" " -f 2 | xargs
So, I want to do a pcregrep on these set of files, and that is a multiline match, as below:
pcregrep -Mi "index(.+\n)+" list of files
I don't know, if it's possible to give the list of file names like this.
Can someone help?
Regards,
Manu

Try this :
l | grep -P "\d\.mt.+" | cut -d":" -f 2 | cut -d" " -f 2 | xargs pcregrep -Mi "index(.+\n)+"
Your command provides xargs at the end but with no command to use it.
Now, xargs is useful and the command is just like
pcregrep <*list of all found files*>
That's the idea behind xargs.

Related

Find & Replace String in All Found Files with LaTeX $sim$ -> $\sim$

I know this sort of question has been asked many times before, but I'm running into an odd circumstance where my feeble brain forgot to include a \ while calling $\sim$ in some markdown files. I need to go through and replace all instances of $sim$ with $\sim$. My code is running but not actually replacing any of the words that I want. Here are some variations I have tried:
grep -rl '\$sim\$' . | xargs sed -i 's/\$sim\$/$\sim$/g'
grep -rlF '$sim$' . | xargs sed -i 's/\$sim\$/$\sim$/g'
grep -rlF '$sim$' . | xargs sed -i 's/$sim$/$\sim$/g'
grep -rlF '$sim$' . | xargs sed -i '' -e 's/$sim$/$\sim$/g'
And other odd variations on a theme. The code just runs with no output but when I check the files nothing has changed. I figure this is either a sed issue (I'm macOS) or a regex issue.
Like this :
grep -rlF '$sim$' | xargs sed -i 's/$sim\$/$\\sim$/g'
for MacOsX :
grep -rlF '$sim$' | xargs sed -i '' 's/$sim\$/$\\sim$/g'
sed -i changes files in place, however you aren't telling sed to operate on any files. You are giving sed its input on stdin.
What you want is something like
find . -type f -exec sed -i 's/\$sim\$/$\\sim\$/g' {} \;

Editing this Script to my needs

I want to use this Script to build a custom Wordlist.
Wordlist Script
This Script will build a Wordlist with only loweralpha Chars. But i want lower/upper Chars and Numbers.
The Output should be like this example:
test
123test
test123
Test
123Test
Test123
I dont know how to change it. I would be really happy if you could help me out with this.
I tried some tutorials for grep and regex but i dont understand anything.
Replace the line 18 of the script
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;
With this:
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | sort -u`;
If you have a look at it, you can see how it
replaces " " with "\n",
changes cases
filters by length
sorts
You can remove bits from that pipe chain and see how the output changes
delete this bit from the script:
tr '[:upper:]' '[:lower:]' |
that will leave case alone.
there's also a bit in wordlist.sh that only selects words from 9 to 25 characters which you could delete, or change if you prefer a different range:
`sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' |`
or you could try a simpler strategy: download and install w3m, a command-line web browser, and replace the complicated line in wordlist.sh with this:
page=`grep '' -R "./temp/" | w3m -dump wikipedia.org | grep -o '\w\+' | sort -u`
the grep is (a weird) way to get all the text from the html files, then w3m -dump gets rid of all the html tags and other non-display stuff, and grep -o '\w\+' matches any word.

Extract filenames that matches the pattern and remove duplicates and store in an array

I would like to know the easiest way to list a part of filenames without any duplication present in a directory.
Example:
A directory has files like this:
Stack1_over_flow.txt
Stack2_exchange.txt
Meta_stack.txt
Stack1_over_flow.txt
Meta_stack.txt
Now I want the result to be:
Stack1
Stack2
Meta
Here, extract the string that occurs before the first occurrence of "_" and remove if any duplication of the string.
ls -1 | awk '{split($0,a,"_"); print a[1]}' | sort -b | uniq
Only files, with find:
find . -maxdepth 1 -type f -printf "%f\n" | awk '{split($0,a,"_"); print a[1]}' | sort -b | uniq
Using sed
ls -l | sed -r 's/([a-zA-Z0-9])_.*/\1/' | uniq
you can even try this
ls -1 | cut -d "_" -f1 | uniq

How to find all files in a Directory with grep and regex?

I have a Directory(Linux/Unix) on a Apache Server with a lot of subdirectory containing lot of files like this:
- Dir
- 2010_01/
- 142_78596_101_322.pdf
- 12_10.pdf
- ...
- 2010_02/
- ...
How can i find all files with filesnames looking like: *_*_*_*.pdf ? where * is always a digit!!
I try to solve it like this:
ls -1Rl 2010-01 | grep -i '\(\d)+[_](\d)+[_](\d)+[_](\d)+[.](pdf)$' | wc -l
But the regular expression \(\d)+[_](\d)+[_](\d)+[_](\d)+[.](pdf)$ doesn't work with grep.
Edit 1: Trying ls -l 2010-03 | grep -E '(\d+_){3}\d+\.pdf' | wc -l for example just return null. So it's dont work perfectly
Try using find.
The command that satisfies your specification __*_*.pdf where * is always a digit:
find 2010_10/ -regex '__\d+_\d+\.pdf'
You seem to be wanting a sequence of 4 numbers separated by underscores, however, based on the regex that you tried.
(\d+_){3}\d+\.pdf
Or do you want to match all names containing solely numbers/underscores?
[\d_]+\.pdf
First, you should be using egrep vs grep or call grep with -E for extended patterns.
So this works for me:
$ cat test2.txt
- Dir
- 2010_01/
- 142_78596_101_322.pdf
- 12_10.pdf
- ...
- 2010_02/
- ...
Now egrep that file:
cat test2.txt | egrep '((?:\d+_){3}(?:\d+)\.pdf$)'
- 142_78596_101_322.pdf
Since there are parenthesis around the whole pattern, the entire file name will be captured.
Note that the pattern does NOT work with grep in traditional mode:
$ cat test2.txt | grep '((?:\d+_){3}(?:\d+)\.pdf$)'
... no return
But DOES work if you use the extend pattern switch (the same as calling egrep):
$ cat test2.txt | grep -E '((?:\d+_){3}(?:\d+)\.pdf$)'
- 142_78596_101_322.pdf
Thanks to gbchaosmaster and the wolf I find a way which work for me:
Into a Directory:
find . | grep -P "(\d+_){3}\d+\.pdf" | wc -l
At the Root Directory:
find 20*/ | grep -P "(\d+_){3}\d+\.pdf" | wc -l

How can I make this script more concise?

I wrote a little script which prints the names of files containing problematic character sequences.
#!/bin/bash
# Finds all files in the repository that contain
# undesired characters or sequences of characters
pushd .. >/dev/null
# Find Windows newlines
find . -type f | grep -v ".git/" | grep -v ".gitmodules" | grep -v "^./lib" | xargs grep -l $'\r'
# Find tabs (should be spaces)
find . -type f | grep -v ".git/" | grep -v ".gitmodules" | grep -v "^./lib" | xargs grep -l $'\t'
# Find trailing spaces
find . -type f | grep -v ".git/" | grep -v ".gitmodules" | grep -v "^./lib" | xargs grep -l " $"
popd >/dev/null
I'd line to combine this into one line, i.e. by having grep look for \r OR \t or trailing spaces. How would I construct a regex to do this? It seems that for escape characters a special sequence needs to be used ($'\X') and I'm not sure how to combine these...
I'm running OS X, and am looking for a solution that works on both BSD and GNU based systems.
find . -type f | grep -E -v ".git/|.gitmodules|^./lib" | xargs grep -E -l '$\r|$\t| $'
Not certain that '$\r|$\t| $' will work quoted that way, with a simple test on my system it seemed to work.
I'm using the -E (extended reg-exp) to grep, that allows 'OR'ing together multiple search targets.
Older Unix-en may or maynot support the -E option, so if you get an error message flagging that, replace all grep -E with egrep.
I hope this helps.