Replacing text and duplicates - regex

I have a log file with lines filled with things like this:
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.html
I want to extract only the username portion, and then remove duplicates, so that I am only left with this:
biaxib
hihi
hoho
ihatespam
The ruleset is:
Extract the text between "/home/Users/" and "/....." at the end
Remove duplicate lines after the above rule is applied
Do this inside Linux
Can someone help me with how to create such a script, or statement to do this?

Assuming that username always appears at 4th component of path:
$ cat test.txt
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.
$ cat test.txt | cut -d/ -f 5 | sort | uniq
biaxib
hihi
hoho
ihatespam

cat /path/to/your/log/file.txt | python3 -c '
import sys
for line in sys.stdin.readlines():
print( line.split("/")[5] )
' | sort | uniq
More conciseness probably achievable in perl or with other builtin tools (see other answer), but I personally shy away from the standard linux text manipulation tools (edit: cut is a useful one though).

Related

Bash - numbers of multiple lines matching regex (possible oneliner?)

I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)
You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}
Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file
First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7

Editing this Script to my needs

I want to use this Script to build a custom Wordlist.
Wordlist Script
This Script will build a Wordlist with only loweralpha Chars. But i want lower/upper Chars and Numbers.
The Output should be like this example:
test
123test
test123
Test
123Test
Test123
I dont know how to change it. I would be really happy if you could help me out with this.
I tried some tutorials for grep and regex but i dont understand anything.
Replace the line 18 of the script
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;
With this:
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | sort -u`;
If you have a look at it, you can see how it
replaces " " with "\n",
changes cases
filters by length
sorts
You can remove bits from that pipe chain and see how the output changes
delete this bit from the script:
tr '[:upper:]' '[:lower:]' |
that will leave case alone.
there's also a bit in wordlist.sh that only selects words from 9 to 25 characters which you could delete, or change if you prefer a different range:
`sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' |`
or you could try a simpler strategy: download and install w3m, a command-line web browser, and replace the complicated line in wordlist.sh with this:
page=`grep '' -R "./temp/" | w3m -dump wikipedia.org | grep -o '\w\+' | sort -u`
the grep is (a weird) way to get all the text from the html files, then w3m -dump gets rid of all the html tags and other non-display stuff, and grep -o '\w\+' matches any word.

How to remove both matching lines while removing duplicates

I have a large text file containing a list of emails called "main", and I have sent mails to some of them. I have a list of 'sent' emails. Now, I want to remove the 'sent' emails from the list "main".
In other words, I want to remove both the matching raw from the text file while removing duplicates. Example:
I have:
email#email.com
test#test.com
email#email.com
I want:
test#test.com
Is there any easier way to achieve this? Please suggest a tool or method to do this, but please consider the text file is larger than 10MB.
In terminal:
cat test| sort | uniq -c | awk -F" " '{if($1==1) print $2}'
I use cygwin a lot for such tasks, as the unix command line is incredibly powerful.
Here's how to achieve what you want:
cat main.txt | sort -u | grep -Fvxf sent.txt
sort -u will remove duplicates (by sorting the main.txt file first), and grep will take care of removing the unwanted addresses.
Here's what the grep options mean:
-F plain text search
-v invert results
-x will force the whole line to match the pattern
-f read patterns from the specified file
Oh, and if your files are in the Windows format (CR LF newlines) you'll rather have to do this:
cat main.txt | dos2unix | sort -u | grep -Fvxf <(cat sent.txt | dos2unix)
Just like with the Windows command line, you can simply add:
> output.txt
at the end of the command line to redirect the output to a text file.

Git log stats with regular expressions

I would like to do some stats on my git log to get something like:
10 Daniel Schmidt
5 Peter
1 Klaus
The first column is the count of commits and the second the commiter.
I already got as far as this:
git log --raw |
grep "^Author: " |
sort |
uniq -c |
sort -nr |
less -FXRS
The interesting part is the
grep "^Author: "
which i wanted to modify with a nice Regex to exclude the mail adress.
With Rubular something like this http://rubular.com/r/mEzP2hFjGb worked, but if i insert it in the grep (or in a piped other one) it won't get me the right output.
Sidequestion: Is there a possibility to get the count and the author seperated by something else then whitespace while staying in this pipe command style? I would like to have a nicer seperator between both to us column later (and maybe some color ^^)
Thanks a lot for your help!
Google git-extras. It has a git summary that does this.
git shortlog -n -s gets you the same data. On the git repository, for example (piped to head to get higher numbers):
$ git shortlog -n -s | head -4
11129 Junio C Hamano
1395 Shawn O. Pearce
1103 Linus Torvalds
896 Jeff King
To get a different delimiter, you could pipe it to awk:
$ git shortlog -n -s | awk 'BEGIN{OFS="|";} { $1=$1; print $0 }' | head -4
11129|Junio|C|Hamano
1395|Shawn|O.|Pearce
1103|Linus|Torvalds
896|Jeff|King
You can get the full power of pcre (which should match your experiments with Rebular) with a perl one-liner:
perl -ane 'print if /^Author: /'
Just extend that pattern as necessary.
To reformat you can use awk (eg awk '{printf "%5d\t%s", $1, $2}')

Using awk sed or grep to parse URLs from webpage source

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:
This seems to leave out parts of the URL from some of the page names.
$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3
This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:
$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' file.html
If you are only parsing something like < a > tags, you could just match the href attribute like this:
$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq
That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.
lynx -dump http://www.ibm.com
And look for the string 'References' in the output. Post-process with sed if you need to.
Using a different tool sometimes makes the job simpler. Once in a while, a different tool makes the job dead simple. This is one of those times.