Replacing text and duplicates

Replacing text and duplicates - regex

I have a log file with lines filled with things like this:
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.html
I want to extract only the username portion, and then remove duplicates, so that I am only left with this:
biaxib
hihi
hoho
ihatespam
The ruleset is:
Extract the text between "/home/Users/" and "/....." at the end
Remove duplicate lines after the above rule is applied
Do this inside Linux
Can someone help me with how to create such a script, or statement to do this?

Assuming that username always appears at 4th component of path:
$ cat test.txt
/home/Users/b/biaxib/is-clarithromycin-effective-against-strep.html
/home/Users/b/hihi/low-cost-biaxin-free-shipping.html
/home/Users/b/hoho/no-script-biaxin-fast-delivery.html
/home/Users/b/ihatespam/no-script-low-cost-biaxin.
$ cat test.txt | cut -d/ -f 5 | sort | uniq
biaxib
hihi
hoho
ihatespam

cat /path/to/your/log/file.txt | python3 -c '
import sys
for line in sys.stdin.readlines():
print( line.split("/")[5] )
' | sort | uniq
More conciseness probably achievable in perl or with other builtin tools (see other answer), but I personally shy away from the standard linux text manipulation tools (edit: cut is a useful one though).

Related

Bash - numbers of multiple lines matching regex (possible oneliner?)

I'm not very fluent in bash but actively trying to improve, so I'd like to ask some experts here for a little suggestion :)
Let's say I've got a following text file:
Some
spam
about which I don't care.
I want following letters:
X1
X2
X3
I do not want these:
X4
X5
Nor this:
X6
But I'd like these, too:
I want following letters:
X7
And so on...
And I'd like to get numbers of lines with these letters, so my desired output should look like:
5 6 7 15
To clarify: I want all lines matching some regex /\s*X./, that occur right after one match with another regex /\sI want following letters:/
Right now I've got a working solution, which I don't really like:
cat data.txt | grep -oPz "\sI want following letters:((\s*X.)*)" | grep -oPz "\s*X." > tmp.txt
for entry in $(cat tmp.txt); do
grep -n $entry data.txt | cut -d ":" -f1
done
My question is: Is there any smart way, any tool I don't know with a functionality to do this in one line? (I esspecially don't like having to use temp file and a loop here)

You can use awk:
awk '/I want following/{p=1;next}!/^X/{p=0;next}p{print NR}' file
Explanation in multiline version:
#!/usr/bin/awk
/I want following/{
# Just set a flag and move on with the next line
p=1
next
}
!/^X/ {
# On all other lines that doesn't start with a X
# reset the flag and continue to process the next line
p=0
next
}
p {
# If the flag p is set it must be a line with X+number.
# print the line number NR
print NR
}

Following may help you here.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1} flag' Input_file
Above will print the lines which have I want following letters: too in case you don't want these then use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag' Input_file
To add line number to output use following.
awk '!/X[0-9]+/{flag=""} /I want following letters:/{flag=1;next} flag{print FNR}' Input_file

First, let's optimize a little bit your current script:
#!/bin/bash
FILE="data.txt"
while read -r entry; do
[[ $entry ]] && grep -n $entry "$FILE" | cut -d ":" -f1
done < <(grep -oPz "\sI want following letters:((\s*X.)*)" "$FILE"| grep -oPz "\s*X.")
And here's some comments:
No need to use cat file|grep ... => grep ... file
Do not use the syntaxe for i in $(command), it's often the cause of multiple bugs and there's always a smarter solution.
No need to use a tmp file either
And then, there's a lot of shorter possible solutions. Here's one using awk:
$ awk '{ if($0 ~ "I want following letters:") {s=1} else if(!($0 ~ "^X[0-9]*$")) {s=0}; if (s && $0 ~ "^X[0-9]*$") {gsub("X", ""); print}}' data.txt
1
2
3
7

Editing this Script to my needs

I want to use this Script to build a custom Wordlist.
Wordlist Script
This Script will build a Wordlist with only loweralpha Chars. But i want lower/upper Chars and Numbers.
The Output should be like this example:
test
123test
test123
Test
123Test
Test123
I dont know how to change it. I would be really happy if you could help me out with this.
I tried some tutorials for grep and regex but i dont understand anything.

Replace the line 18 of the script
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;
With this:
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | sort -u`;
If you have a look at it, you can see how it
replaces " " with "\n",
changes cases
filters by length
sorts
You can remove bits from that pipe chain and see how the output changes

delete this bit from the script:
tr '[:upper:]' '[:lower:]' |
that will leave case alone.
there's also a bit in wordlist.sh that only selects words from 9 to 25 characters which you could delete, or change if you prefer a different range:
`sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' |`
or you could try a simpler strategy: download and install w3m, a command-line web browser, and replace the complicated line in wordlist.sh with this:
page=`grep '' -R "./temp/" | w3m -dump wikipedia.org | grep -o '\w\+' | sort -u`
the grep is (a weird) way to get all the text from the html files, then w3m -dump gets rid of all the html tags and other non-display stuff, and grep -o '\w\+' matches any word.

How to remove both matching lines while removing duplicates

I have a large text file containing a list of emails called "main", and I have sent mails to some of them. I have a list of 'sent' emails. Now, I want to remove the 'sent' emails from the list "main".
In other words, I want to remove both the matching raw from the text file while removing duplicates. Example:
I have:
email#email.com
test#test.com
email#email.com
I want:
test#test.com
Is there any easier way to achieve this? Please suggest a tool or method to do this, but please consider the text file is larger than 10MB.

In terminal:
cat test| sort | uniq -c | awk -F" " '{if($1==1) print $2}'

I use cygwin a lot for such tasks, as the unix command line is incredibly powerful.
Here's how to achieve what you want:
cat main.txt | sort -u | grep -Fvxf sent.txt
sort -u will remove duplicates (by sorting the main.txt file first), and grep will take care of removing the unwanted addresses.
Here's what the grep options mean:
-F plain text search
-v invert results
-x will force the whole line to match the pattern
-f read patterns from the specified file
Oh, and if your files are in the Windows format (CR LF newlines) you'll rather have to do this:
cat main.txt | dos2unix | sort -u | grep -Fvxf <(cat sent.txt | dos2unix)
Just like with the Windows command line, you can simply add:
> output.txt
at the end of the command line to redirect the output to a text file.

Git log stats with regular expressions

I would like to do some stats on my git log to get something like:
10 Daniel Schmidt
5 Peter
1 Klaus
The first column is the count of commits and the second the commiter.
I already got as far as this:
git log --raw |
grep "^Author: " |
sort |
uniq -c |
sort -nr |
less -FXRS
The interesting part is the
grep "^Author: "
which i wanted to modify with a nice Regex to exclude the mail adress.
With Rubular something like this http://rubular.com/r/mEzP2hFjGb worked, but if i insert it in the grep (or in a piped other one) it won't get me the right output.
Sidequestion: Is there a possibility to get the count and the author seperated by something else then whitespace while staying in this pipe command style? I would like to have a nicer seperator between both to us column later (and maybe some color ^^)
Thanks a lot for your help!

Google git-extras. It has a git summary that does this.

git shortlog -n -s gets you the same data. On the git repository, for example (piped to head to get higher numbers):
$ git shortlog -n -s | head -4
11129 Junio C Hamano
1395 Shawn O. Pearce
1103 Linus Torvalds
896 Jeff King
To get a different delimiter, you could pipe it to awk:
$ git shortlog -n -s | awk 'BEGIN{OFS="|";} { $1=$1; print $0 }' | head -4
11129|Junio|C|Hamano
1395|Shawn|O.|Pearce
1103|Linus|Torvalds
896|Jeff|King

You can get the full power of pcre (which should match your experiments with Rebular) with a perl one-liner:
perl -ane 'print if /^Author: /'
Just extend that pattern as necessary.
To reformat you can use awk (eg awk '{printf "%5d\t%s", $1, $2}')

Using awk sed or grep to parse URLs from webpage source

I am trying to parse the source of a downloaded web-page in order to obtain the link listing. A one-liner would work fine. Here's what I've tried thus far:
This seems to leave out parts of the URL from some of the page names.
$ cat file.html | grep -o -E '\b(([\w-]+://?|domain[.]org)[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'|sort -ut/ -k3
This gets all of the URL's but I do not want to include links that have/are anchor links. Also I want to be able to specify the domain.org/folder/:
$ awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' file.html

If you are only parsing something like < a > tags, you could just match the href attribute like this:
$ cat file.html | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | sort | uniq
That will ignore the anchor and also guarantee that you have uniques. This does assume that the page has well-formed (X)HTML, but you could pass it through Tidy first.

lynx -dump http://www.ibm.com
And look for the string 'References' in the output. Post-process with sed if you need to.
Using a different tool sometimes makes the job simpler. Once in a while, a different tool makes the job dead simple. This is one of those times.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Replacing text and duplicates - regex

Related

Bash - numbers of multiple lines matching regex (possible oneliner?)

Editing this Script to my needs

How to remove both matching lines while removing duplicates

Git log stats with regular expressions

Using awk sed or grep to parse URLs from webpage source

Categories

Resources