perl $1 not getting set - regex

I need to go through some java files, and pull out the authors after every #author tag. I started out looking at awk, but awk can't remove the unneded parts, and so I came across this.
What I'm running
perl -n -e'/author (.*)/ && print $1' *.java
This prints nothing. If I do
perl -n -e'/author (.*)/ && print $_' *.java
it will (correctly) print the entire line.
I can do this, and it does accomplish my goal, but I still want to know why my capture group isn't working.
perl -n -e"/\#author / && print $'" *.java
Example input:
/* HelloWorld.java
* #author Partner of Winning
* #author Robert LastName
*/
public class HelloWorld{
public static void main(String[] args) {
System.out.println("Hello World!");
}
}

You must have a long prompt. A shorter prompt would have revealed the problem.
$ perl -n -e'/author (.*)/ && print $1' *.java
$ bert LastNameing
Your file has Windows line endings (carriage return + line feed), and you are outputting the carriage return without the line feed, causing lines to be overwritten.
You can convert the file to a unix file using dos2unix, or you could change your program to handle CRLF line endings. There are a couple of shortcuts you can take here.
Add newlines, effectively neutralizing the CR.
$ perl -nle'/author (.*)/ && print $1' *.java
Robert LastName
Partner of Winning
But that can output text that causes problems, since it still contains the input's CRs. The following avoids matching them:
$ perl -nle'/author ([^\r\n]*)/ && print $1' *.java
Robert LastName
Partner of Winning

Related

Can not replace multiple empty lines with one

Why does the following not replace multiple empty lines with one?
$ cat some_random_text.txt
foo
bar
test
and this does not work:
$ cat some_random_text.txt | perl -pe "s/\n+/\n/g"
foo
bar
test
I am trying to replace the multiple new lines (i.e. empty lines) to a single empty new line but the regex I use for that does not work as you can see in the example snippet.
What am I messing up?
Expected outcome is:
foo
bar
test
The reason it doesn't work is that -p tells perl to process the input line by line, and there's never more than one \n in a single line.
Better idea:
perl -00 -lpe 1
-00: Enable paragraph mode (input records are terminated by any sequence of 2+ newlines).
-l: Enable autochomp mode (the input record separators are trimmed automatically, so since we're in paragraph mode, all trailing newlines are removed, and output records get "\n\n" added).
-p: Enable automatic input/output (the main code is executed for each input record; anything left in $_ is printed automatically).
-e 1: Use a dummy main program that does nothing.
Taken all together this does nothing except normalize paragraph terminators to exactly two newlines.
You are executing the following program:
LINE: while (<>) {
s/\n+/\n/g;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Since you are reading one line at at time, and since a line is a sequence of characters that aren't line feeds terminated by a line feed, your pattern will never match more than one newline.
The simple fix is to tell Perl to treat the entire file as one line. Also, you don't want to replace every line feed, but just those found in sequence of two or more, and you want to replace the sequence with two line feeds.
perl -0777pe's/\n\n\K\n+//g; s^\n+//; s/\n\K\n\z//' some_random_text.txt
The second and third substitutions ensure there are no blank lines at the start and end of the file.
While reading the entire file into memory is easy, it's not necessary. The desired output can also be achieved by maintaining a flag that indicates whether the previous line was blank or not.
perl -ne'if (/\S/) { print "\n" if $f; print; $f=0 } else { $f=1 }' some_random_text.txt
This solution also removes blank lines from the start and end of the file.
Given:
$ echo "$txt"
foo
bar
test
You can use sed to reduce the runs of blank lines to a single \n:
$ echo "$txt" | sed '/^$/N;/^\n$/D'
foo
bar
test
Even easier, you can use cat -s:
$ echo "$txt" | cat -s # same output
In perl the idiomatic 1 liner is to use -00 for paragraph mode:
$ echo "$txt" | perl -00pe0 # same output
And in awk you have the flexibility of using paragraph mode by setting RS= and then set ORS= to what you want the replacement for runs of \n to be:
$ echo "$txt" | awk '1' RS= ORS="\n\n" # same output
ikegami correctly states that printf 'a\n\n' | ... will produce two trailing spaces with these solutions. That may or may not be an issue.

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Extracting string from html file or curl output

I have a html file where some of them are "minified", this means that a whole website can be in just one line.
I want to filter the value of ?idsite= which contains numbers. So a html contains something like this: img src="//stats.domains.com/piwik.php?idsite=44.
So the plain output should be "44".
I tried grep but it echos the whole line and just highlights the value.
With perl it could be something like:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| perl -nE 'say /.*idsite=(..)\"/ '
(assumes that idsite is always two characters ! :-). Your regex will need to be more sophisticated than this most likely).
Putting the snippet from the page you reference above in an HTML file (non-minified) and subsituting 44 for the parameter variable, this bit of perl will extract the "44":
perl -nE 'say /.*idsite=(..)/ if /idsite/ ' idsite.html
Translating the one liner to a sed command line would be similar:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| sed -En "s/^.*idsite=(..)\"/\1/p"
This is POSIXsed from FreeBSD (should work on OSX) the -E switch is to add "modern" regexes.
Doing it in awk is left as an exercise for another community member :-)
Here is a perl way to extract only the trailing digits of strings like src="//stats.domains.com/piwik.php?idsite=44" and run on a bash command line:
echo $src|perl -ne '$_ =~m /(\d+$)/; print $1'
Here is a python way to do the same thing:
import re
print ', '.join( re.findall(r'\d+$', src))
If there will be a lot of src strings to process, it would be best to compile the regex when using Python as follows:
import re
p = re.compile('\d+$')
print ', '.join(p.findall(src))
The import and the compilation only have to be done once.
Here is a Ruby way to do it:
puts src.scan( /\d+$/ ).first
In all cases the regexes end with "$" which matches the end of the string. That is why they match and extract only digits (\d+) at the end of the string.
If you don't need to check whether the idsite is in the value of a src attribute, then all you need is
perl -nE'say $1 if /\bidsite=(\d+)' myfile.html
$ cat site.html
lorem ipsum idsite='4934' fasdf a
other line
$ sed -n '/idsite/ { s/.*idsite=\([0-9]\+\).*$/\1/; p }' < site.html
4934
Let me know in case you need an explanation of what is going on.

Replace previous when match regular expression

I need to delete the "end of line" of the previous line when current line starts is not a number ^[!0-9], basically if match, append to the line before, I'm a sed & awk n00b, and really like them btw. thanks
edit:
$ cat file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;tex
t
broke
4564;1;1;"also
";12,2121;546465
$ "script" file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;text broke
4564;1;1;"also";12,2121;546465
You didn't post any sample input or expected output so this is a guess but it sounds like what you're asking for:
$ cat file
a
b
3
4
c
d
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
ab
3
4cd
On the OPs newly posted input:
$ awk '{printf "%s%s",(NR>1 && /^[[:digit:]]/ ? ORS : ""),$0} END{print ""}' file
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465
This might work for you (GNU sed):
sed -r ':a;$!N;s/\n([^0-9]|$)/\1/;ta;P;D' file
Keep two lines in the pattern space and if the start of the second line is empty or does not start with an integer, remove the newline.
if you have Ruby on your system
array = File.open("file").readlines
array.each_with_index do |val,ind|
array[ind-1].chomp! if not val[/^\d/] # just chomp off the previous item's \n
end
puts array.join
output
# ruby test.rb
1;1;1;text,1
2;4;;8;some;1;1;1;more
100;textbroke
4564;1;1;"also";12,2121;546465

sed/awk replace in all matches

I want to invert all the color values in a bunch of files. The colors are all in the hex format #ff3300 so the inversion could be done characterwise with the sed command
y/0123456789abcdef/fedcba9876543210/
How can I loop through all the color matches and do the char translation in sed or awk?
EDIT:
sample input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
asdfghj
desired output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
asdfghj
EDIT: I changed my response as per your edit.
OK, sed may result in a difficult processing. awk could do the trick more or less easily, but I find perl much more easy for this task:
$ perl -pe 's/#[0-9a-f]+/$&=~tr%0123456789abcdef%fedcba9876543210%r/ge' <infile >outfile
Basically you find the pattern, then execute the right-hand side, which executes the tr on the match, and substitutes the value there.
The inversion is really a subtraction. To invert a hex, you just subtract it from ffffff.
With this in mind, you can build a simple script to process each line, extract hexes, invert them, and inject them back to the line.
This is using Bash (see arrays, printf -v, += etc) only (no external tools there):
#!/usr/bin/env bash
[[ -f $1 ]] || { printf "error: cannot find file: %s\n" "$1" >&2; exit 1; }
while read -r; do
# split line with '#' as separator
IFS='#' toks=( $REPLY )
for tok in "${toks[#]}"; do
# extract hex
read -n6 hex <<< "$tok"
# is it really a hex ?
if [[ $hex =~ [0-9a-fA-F]{6} ]]; then
# compute inversion
inv="$((16#ffffff - 16#$hex))"
# zero pad the result
printf -v inv "%06x" "$inv"
# replace hex with inv
tok="${tok/$hex/$inv}"
fi
# build the modified line
line+="#$tok"
done
# print the modified line and clean it for reuse
printf "%s\n" "${line#\#}"
unset line
done < "$1"
use it like:
$ ./invhex infile > outfile
test case input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
bdf#cvb_foo
asdfghj
#bdfg
processed output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
bdf#cvb_foo
asdfghj
#bdfg
This might work for you (GNU sed):
sed '/#[a-f0-9]\{6\}\>/!b
s//\n&/g
h
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g
y/0123456789abcdef/fedcba9876543210/
H
g
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta
s/\n//' file
Explanation:
/#[a-f0-9]\{6\}\>/!b bail out on lines not containing the required pattern
s//\n&/g prepend every pattern with a newline
h copy this to the hold space
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g delete everything but the required pattern(s)
y/0123456789abcdef/fedcba9876543210/ transform the pattern(s)
H append the new pattern(s) to the hold space
g overwrite the pattern space with the contents of the hold space
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta replace the old pattern(s) with the new.
s/\n// remove the newline artifact from the H command.
This works...
cat test.txt |sed -e 's/\#\([0123456789abcdef]\{6\}\)/\n\#\1\n/g' |sed -e ' /^#.*/ y/0123456789abcdef/fedcba9876543210/' | awk '{lastType=type;type= substr($0,1,1)=="#";} type==lastType && length(line)>0 {print line;line=$0} type!=lastType {line=line$0} length(line)==0 {line=$0} END {print line}'
The first sed command inserts line breaks around the hex codes, then it is possible to make the substitution on all lines starting with a hash. There are probably an elegant solution to merge the lines back again, but the awk command does the job. The only assumption there is that there won't be two hex-codes following directly after each other. If so, this step has to be revised.