Regex with perl one liner - regex

I have the following:
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
and my attempt, which did not work, at a perl one liner to extract the first digit is as follows;
perl -pi -e "s/\S+_(\.+)_.+/Number$1/g" *.txt
I expected the following results:
Number 007
Number 566
Number 023
pls help

I'd use the -n option instead of the -p option and do the printing and formatting in the code:
perl -i~ -ne 'if (($num) = /[0-9]+/g) {
printf "Number %03d\n", $num;
} else {
print
}' *.txt

The problem is that this regex pattern /\S+_(\.+)_.+/ looks for a sequence of one or more literal dots . surrounded by underscores, so something like _..._ would match, but such a sequence doesn't exist in your file. I think you didn't mean to escape the dot. But even then, because the \S+ is greedy, it would find and capture the last field delimited by underscores, and so would capture ser from all three lines. Perhaps you meant to write \d+ instead of \.+, which is pretty much what I have written below.
This will do as you ask. It looks for the first occurrence of an underscore that is followed by a number of decimal digits, and uses printf to format the number as three digits.
You can add the -i qualifier, but I suggest you test it as it is first to save overwriting your data with erroneous results. Of course you could redirect the output to another file if you wished.
perl -ne'/_(\d+)/ and printf "Number %03d\n", $1' myfile
output
Number 007
Number 566
Number 023

cat > /tmp/test
XXUM_7_mauve_999119_ser_11.255255
UXUM_566_mauve_999119_ser_11.255255
IXUM_23_mauve_999119_ser_11.255255
perl -i -ne 'if ($_=~/^\w+\_(\d+)\_mauve/g) { printf "Number %03d\n", $1; }' /tmp/test
cat /tmp/test
Number 007
Number 566
Number 023

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever
grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text
With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

How to output multiple regex matches through comma on the same line

I want to use grep/awk/sed to extract matched strings for each line of a log file. Then place it into csv file.
Highlighted strings (1432,53,http://www.espn.com/)
If the input is:
2018-10-31
18:48:01.717,INFO,15592.15627,PfbProxy::handlePfbFetchDone(0x1d69850,
pfbId=561, pid=15912, state=4, fd=78, timer=61), FETCH DONE: len=45,
PFBId=561, pid=0, loadTime=1434 ms, objects=53, fetchReqEpoch=0.0,
fetchDoneEpoch:0.0, fetchId=26, URL=http://www.espn.com/
2018-10-31
18:48:01.806,DEBUG,15592.15621,FETCH DONE: len=45, PFBId=82, pid=0,
loadTime=1301 ms, objects=54, fetchReqEpoch=0.0, fetchDoneEpoch:0.0,
fetchId=28, URL=http://www.diply.com/
Expected output for the above log lines:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
This is an example, and the actual Log File will have much more data.
--My-Solution-So-far-
For now I used grep to get all lines containing keyword 'FETCH DONE' (these lines contain strings I am looking for).
I did come up with regular expression that matches the data I need, but when I grep it and put it in the file it prints each string on the new line which is not quite what I am looking for.
The grep and regular expression I use (online regex tool: https://regexr.com/42cah):
echo -en 'url,loadtime,object\n'>test1.csv #add header
grep -Po '(?<=loadTime=).{1,5}(?= )|((?<=URL=).*|\/(?=.))|((?<=objects=).{1,5}(?=\,))'>>test1.csv #get matching strings
Actual output:
URL,LoadTime,Objects
http://www.espn.com
1434
53
http://www.diply.com
1301
54
Expected output:
URL,LoadTime,Objects
http://www.espn.com/,1434,53
http://www.diply.com/,1301,54
I was trying using awk to match multiple regex and print comma in between. I couldn't get it to work at all for some reason, even though my regex matches correct strings.
Another idea I have is to use sed to replace some '\n' for ',':
for(i=1;i<=n;i++)
if(i % 3 != 0){
sed REPLACE "\n" with "," on i-th line
}
Im pretty sure there is a more efficient way of doing it
Using sed:
sed -n 's/.*loadTime=\([0-9]*\)[^,]*, objects=\([0-9]*\).* URL=\(.*\)/\3,\1,\2/p' input | \
sed 1i'URL,LoadTime,Objects'

Regex substitute multiple lines for a single line

I have a plain text file in which I need to substitute multiple consecutive lines of text with a single replacement line. For example, when I have a date and time, followed by a blank line, followed by a page number,
11/13/2018 08:33:00
Page 1 of 1
I'd like to replace it with a single line (e.g., PAGE BREAK).
I've tried
sed 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
and
perl -pe 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
but it leaves the text unchanged.
Both sed and Perl process the input line by line. You can tell Perl to load the whole file into memory by using -0777 (if it's not too large):
perl -0777 -pe 's=[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\n\nPage [0-9]+ of [0-9]+=PAGE BREAK=g'
Note that I used [0-9], because \d can match ٤, ໖, ६, or 𝟡.
I also used s=== instead of s/// so I don't have to backslash the slashes in the date part.
Another Perl variant
$ cat page_break.txt
123 45 jh kljl
11/13/2018 08:33:00
Page 1 of 1
ghjgjh hkjhj
fhfghfghfh
11/13/2018 08:33:00
Page 1 of 2
ghgigkjkj
$ perl -ne '{ if ( (/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}/ and $x++)or ( /^\s*$/ and $x++) or (/Page \d of \d/ and $x++) ){} if($x==0) { print "$_" } if($x==3) { print "PAGE BREAK\n"; $x=0} }' page_break.txt
123 45 jh kljl
PAGE BREAK
ghjgjh hkjhj
fhfghfghfh
PAGE BREAK
ghgigkjkj
$

splitting bash string by delimiter (last line with delimiter) into array

I'm having a hard time splitting a string like this:
444,555,text with, separator
into this:
444
555
text with, separator
i.e. into a 3-element array (last element may contain comma)
I tried sed but I end up having 4 elements due to the last comma.
Any ideas?
Thanks,
With bash and array:
s='444,555,text with, separator'
IFS=, read -r a b c <<< "$s"
array=("$a" "$b" "$c")
declare -p array
Output:
declare -a array='([0]="444" [1]="555" [2]="text with, separator")'
sed editor allows replacing the number th match of the regexp(i.e. the k-th occurence of the string within a line):
str="444,555,text with, separator"
sed 's/,/\n/1; s/,/\n/1' <<< $str
The output:
444
555
text with, separator
s/,/\n/1 - 1 here is a number flag which points to the first occurrence of , to replace with \n
The following will give the same result(implying the first match on each substitution):
sed 's/,/\n/; s/,/\n/' <<< $str
Two consecutive substitutions will give 3 lines(chunks)
echo "444,555,text with, separator" | sed "s/\([0-9]*\),\([0-9]*\),\(.*\)/\1\n\2\n\3/"
Output:
444
555
text with, separator

How do I remove duplicate characters and keep the unique one only in Perl?

How do I remove duplicate characters and keep the unique one only.
For example, my input is:
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
Expected output is:
EFUAH
UEH
UJHACDEF
I came across perl -pe's/$1//g while/(.).*\/' which is wonderful but it is removing even the single occurrence of the character in output.
This can be done using positive lookahead :
perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
The regex used is: (.)(?=.*?\1)
. : to match any char.
first () : remember the matched
single char.
(?=...) : +ve lookahead
.*? : to match anything in between
\1 : the remembered match.
(.)(?=.*?\1) : match and remember
any char only if it appears again
later in the string.
s/// : Perl way of doing the
substitution.
g: to do the substitution
globally...that is don't stop after
first substitution.
s/(.)(?=.*?\1)//g : this will
delete a char from the input string
only if that char appears again later
in the string.
This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.
To keep the relative order intact we can do what KennyTM tells in one of the comments:
reverse the input line
do the substitution as before
reverse the result before printing
The Perl one line for this is:
perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.
I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.
if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines
$ wc -l <file2
3210220
$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null
real 1m1.761s
user 0m58.565s
sys 0m1.568s
$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' file2 > /dev/null
real 1m32.123s
user 1m23.623s
sys 0m3.450s
$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null
real 1m17.818s
user 1m10.611s
sys 0m2.557s
$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null
real 1m20.347s
user 1m13.069s
sys 0m2.896s
perl -ne'my%s;print grep!$s{$_}++,split//'
Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.
perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'
It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable
Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:
use Test::More 0.88;
use Tie::IxHash;
sub dedupe {
my $str=shift;
my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
return join('',$hash->Keys);
}
{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}
done_testing();
Use uniq from List::MoreUtils:
perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates
Me bad - it removes only adjoining appearances. Disregard
This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.
However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):
perl -pe 's/(.)(?=.*?\1)//g'
And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.
MASSIVE EDIT
I've been spending the last half an hour on this, and this looks like this works, without the reversing.
perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).
With test input like this:
aabbbcbbccbabb
EFAUUUUH
ABCBBBBD
DEEEFEGGH
AABBCC
The output is like this:
abc
EFAUH
ABCD
DEFGH
ABC
I think it's working...
Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.
From the shell, this works:
sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.
I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.
Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:
cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
That puts the characters in each line in alphabetical order though.
use strict;
use warnings;
my ($uniq, $seq, #result);
$uniq ='';
sub uniq {
$seq = shift;
for (split'',$seq) {
$uniq .=$_ unless $uniq =~ /$_/;
}
push #result,$uniq;
$uniq='';
}
while(<DATA>){
uniq($_);
}
print #result;
__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
The output:
EFUAH
UEH
UJHACDEF
for a file containing the data you list named foo.txt
python -c "print set(open('foo.txt').read())"