Combine different regex together in xargs command - regex

I have these two regexes:
find ... | xargs perl -pi -e 's/\t/ /g'
find ... | xargs perl -pi -e 's/[^\S\n]+$//g'
First one changes tabs to 4 spaces, and second removes any trailing white space at the end of each line.
I am tempted to combine the two, but don't want to break something. Besides, they are doing different things -- one is adding spaces, another is removing spaces. Is there a safe way to merge these two together or just leave them as is?

You can do this:
find ... | xargs perl -l -pi -e 's/\t/ /g; s/\s+$//'
Since the second find is operating on the results of the first one, it's safe to perform each command in succession in a single perl invocation.

I would leave the expressions separate, but you can perform them both with a single call to perl:
find ... | xargs perl -pi -e 's/\t/ /g;' -e 's/[^\S\n]+$//g;'

Related

Delete any special character using Sed

I have yet another list of subdomain. I want to remove any Wildcard subdomain which include these special characters:
()!&$#*+?
Mostly, the data are prefixly random. Also, could be middle. Here's some sample of output data
(www.imgur.com
***************diet.blogspot.com
*-1.gbc.criteo.com
------------------------------------------------------------i.imgur.com
This has been quite an inconvenience while scanning through the list. As always, I'm trying sed to fix it:
sed -i "/[!()#$&?+]/d" foo.txt ###Didn't work
sed -i "/[\!\(\)\#\$\&\?\+]/d" ###Escaping char didn't work
Performing commands above still result in an unchanged list and the file still on original state. I'm thinking that; to fix this is to pipe series of sed command in order to remove it one by one:
cat foo.txt | sed -e "/!/d" -e "/#/d" -e "/\*/d" -e "/\$/d" -e "/(/d" -e "/)/d" -e "/+/d" -e "/\'/d" -e "/&/d" >> foo2.txt
cat foo.txt | sed -e "/\!/d" | sed -e "/\#/d" | sed -e "/\*/d" | sed -e "/\$/d" | sed -e "/\+/d" | sed -e "/\'/d" | sed -e "/\&/d" >> foo2.txt
If escaping all special char doesn't work, it must've been my false logic. Also tried with /g still doesn't increase my luck.
As a side note: I don't want - to be deleted as some valid subdomain can have - character:
line-apps.com
line-apps-beta.com
line-apps-rc.com
line-apps-dev.com
Any help would be cherished.
Using sed
$ sed '/[[:punct:]]/d' input_file
This should delete all lines with special characters, however, it would help if you provided sample data.
To do what you're trying to do in your answer (which adds [ and ] and more to the set of characters in your question) would be:
sed '/[][!?+,#$&*() ]/d'
or just:
grep -v '[][!?+,#$&*() ]'
Per POSIX to include ] in a bracket expression it must be the first character otherwise it indicates the end of the bracket expression.
Consider printing lines you want instead of deleting lines you do not want, though, e.g.:
grep '^[[:alnum:]_.-]$' file
to print lines that only contain letters, numbers, underscores, dashes, and/or periods.

replace string with underscore and dots using sed or awk

I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot
Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.
Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done

sed - exchange words with delimiter

I'm trying swap words around with sed, not replace because that's what I keep finding on Google search.
I don't know if it's the regex that I'm getting wrong. I did a search for everything before a char and everything after a char, so that's how I got the regex.
echo xxx,aaa | sed -r 's/[^,]*/[^,]*$/'
or
echo xxx/aaa | sed -r 's/[^\/]*/[^\/]*$/'
I am getting this output:
[^,]*$,aaa
or this:
[^,/]*$/aaa
What am I doing wrong?
For the first sample, you should use:
echo xxx,aaa | sed 's/\([^,]*\),\([^,]*\)/\2,\1/'
For the second sample, simply use a character other than slash as the delimiter:
echo xxx/aaa | sed 's%\([^/]*\)/\([^/]*\)%\2/\1%'
You can also use \{1,\} to formally require one or more:
echo xxx,aaa | sed 's/\([^,]\{1,\}\),\([^,]\{1,\}\)/\2,\1/'
echo xxx/aaa | sed 's%\([^/]\{1,\}\)/\([^/]\{1,\}\)%\2/\1%'
This uses the most portable sed notation; it should work anywhere. With modern versions that support extended regular expressions (-r with GNU sed, -E with Mac OS X or BSD sed), you can lose some of the backslashes and use + in place of * which is more precisely what you're after (and parallels \{1,\} much more succinctly):
echo xxx,aaa | sed -E 's/([^,]+),([^,]+)/\2,\1/'
echo xxx/aaa | sed -E 's%([^/]+)/([^/]+)%\2/\1%'
With sed it would be:
sed 's#\([[:alpha:]]\+\)/\([[:alpha:]]\+\)#\2,\1#' <<< 'xxx/aaa'
which is simpler to read if you use extended posix regexes with -r:
sed -r 's#([[:alpha:]]+)/([[:alpha:]]+)#\2/\1#' <<< 'xxx/aaa'
I'm using two sub patterns ([[:alpha:]]+) which can contain one or more letters and are separated by a /. In the replacement part I reassemble them in reverse order \2/\1. Please also note that I'm using # instead of / as the delimiter for the s command since / is already the field delimiter in the input data. This saves us to escape the / in the regex.
Btw, you can also use awk for that, which is pretty easy to read:
awk -F'/' '{print $2,$1}' OFS='/' <<< 'xxx/aaa'

Sed substitute recursively

echo ddayaynightday | sed 's/day//g'
It ends up daynight
Is there anyway to make it substitute until no more match ?
My preferred form, for this case:
echo ddayaynightday | sed -e ':loop' -e 's/day//g' -e 't loop'
This is the same as everyone else's, except that it uses multiple -e commands to make the three lines and uses the t construct—which means "branch if you did a successful substitution"—to iterate.
This might work for you:
echo ddayaynightday | sed ':a;s/day//g;ta'
night
The g flag deliberately doesn't re-match against the substituted portion of the string. What you'll need to do is a bit different. Try this:
echo ddayaynightday | sed $':begin\n/day/{ s///; bbegin\n}'
Due to BSD Sed's quirkiness the embedded newlines are required. If you're using GNU Sed you may be able to get away with
sed ':begin;/day/{ s///; bbegin }'
with bash:
str=ddayaynightday
while true; do tmp=${str//day/}; [[ $tmp = $str ]] && break; str=$tmp; done
echo $str
The following works:
$ echo ddayaynightday | sed ':loop;/day/{s///g;b loop}'
night
Depending on your system, the ; may not work to separate commands, so you can use the following instead:
echo ddayaynightday | sed -e ':loop' -e '/day/{s///g
b loop}'
Explanation:
:loop # Create the label 'loop'
/day/{ # if the pattern space matches 'day'
s///g # remove all occurrence of 'day' from the pattern space
b loop # go back to the label 'loop'
}
If the b loop portion of the command is not executed, the current contents of the pattern space are printed and the next line is read.
Ok, here they're: while and strlen in bash.
Using them one may implement my idea:
Repeat until its length will stop changing.
There's neither way to set flag nor way to write such regex, to "substitute until no more match".

Command line perl regex to find dangling Javascript commas

Hello I'm seeking a Perl one-liner if possible, to scan all of our Javascript files, to find so-called "rogue commas". That is, commas that come at the end of an array or object data structure, and therefore commas that come immediately before either an ']' or '}' character.
The main challenge I'm encountering is how to make the regex that checks for ] or } non-greedy. The regex needs to span multiple lines, since the comma could end one line, followed by the } or ] on the next line, but I've figured out how to do that with the help of the book Minimal Perl.
Also, I'd like to be able to pipe a number of files to this Perl regex (via find/xargs), and so I'd like to print the name of the input file, and the line number within that file.
Below are various attempts of mine that are not particularly close to working straight from my bash history. Thanks in advance:
find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+$/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+\]/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}?/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}+?/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,$/' and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,$/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/\,$/ and print $_;'
With the -00 switch, you change the record separator, and (probably) get the whole file in one line, which allows you to find multi-line trailing commas. However, it also makes the print $_ print the whole line. What you probably want is printing the file name:
print $ARGV if /,\s*[\]\}]/;
Most of these look like a decent approach to the problem, with one small issue. You probably want ,\s*(?:$|[\]\}]) rather than ,\s+(?:$|[\]\}]) as there may not be even one space. Your + quantifier might miss forms like ,].
Having said that, JavaScript can be pretty subtle, and you might well encounter comments and other stuff, which might legitimately end with a comma before something unexpected, like the end of the file or a }. A cheap solution might be to use a perl s/// form to simply remove all the comments before applying your tests.
If you're handling JSON, JSON::XS can enforce validity with its relaxed option.
If you need real validation, something like JSLint is probably the way to go. I've had a lot of success with using Rhino to embed JavaScript (a bit less using Perl with SpiderMonkey) and using this as a set of tests against JavaScript code would be a nice way to ensure reliability over time.
An easy solution to this problem is to use comma-first style. Since commas never come at the end of a line, there is never a 'trailing comma'.
For example:
var myObj = { foo: 1
, bar: 2
, baz: 4
}
You can easily detect if a comma is missing, it's obvious which elements belong to what set of braces, and there's never a 'trailing comma problem'.
See also https://gist.github.com/357981