How to split, map, and join in Bash? - regex

I want to create a simple regular expression to match some files. The command npm ls --dev --parseable prints out a bunch of files, for example:
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/tough-cookie
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/tunnel-agent
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/rimraf
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/rimraf/node_modules/glob
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/rimraf/node_modules/glob/node_modules/inflight
/Users/chetcorcos/code/dev-tool/node_modules/fsevents/node_modules/rimraf/node_modules/glob/node_modules/inflight/node_modules/wrappy
I want to get back a string that looks something like this:
tough-cookie|tunnel-agent|rimraf|inflight|wrappy
To get this, I want to "split by newline, map over basename, and join with a pipe". In JavaScript with Ramdajs, I'd so something like this:
R.pipe(R.split('\n'), R.map(R.split('/')), R.map(R.nth(-1)), R.join('|'))
Any ideas how to do something like this in bash? Whats the idiomatic way of doing this?

Bash doesn't have functional programming primitives built in. It's possible to build them with a hundred lines of code or so, but also not particularly worth it for this kind of use case.
Consider:
content=$(npm ls --dev --parseable | sed -e 's#.*/##' | paste -s -d '|')
echo "$content"
...this routes the stdout of NPM into sed, telling it to replace everything up to the last slash in each line with an empty string, and then routing the stdout of sed into paste, using that to combine all lines into a single string with | separating them.
Alternately, to use no tools not built into bash itself (other than your data source, npm):
#!/bin/bash
# note that this requires bash 4.0 or later
mapfile -t lines < <(npm ls --dev --parseable) # read content into array
lines=( "${lines[#]##*/}" ) # trim everything prior to last / in each
(IFS='|'; printf '%s\n' "${lines[*]}") # emit array as a single string with |s

You could just pipe that thing to awk and have awk pick off the last element:
npm ls --dev --parseable | awk -F"/" '{output=output$(NF)"|"} END { sub(/[|]+$/, "", output); print output }'
That awk script will split incoming records by /, capture the last element $(NF) to variable output with a pipe to delimit, Then once complete, will strip the last pipe using gsub and spit the results out

You already have a 'list' of strings, separated by '\n'.
Just map basename on each item (using xargs) - then you'll get list of basenames separated by '\n' plus final '\n'. Then replace each '\n' with '|' symbol:
anycmd | xargs -r -n1 basename | tr '\n' '|'
You may then remove last '|' by either sed or second xargs.
anycmd | xargs -r -n1 basename | tr '\n' '|' | sed 's/|$//'
or
anycmd | xargs -r -n1 basename | xargs | tr ' ' '|'

Related

How to find specific text in a text file, and append it to the filename?

I have a collection of plain text files which are named as yymmdd_nnnnnnnnnn.txt, which I want to append another number sequence to the filenames, so that they each become named as yymmdd_nnnnnnnnnn_iiiiiiiii.txt instead, where the iiiiiiiii is taken from the one line in each file which contains the text "GST: 123456789⏎" (or similar) at the end of the line. While I am sure that there will only be one such matching line within each file, I don't know exactly which line it will be on.
I need an elegant one-liner solution that I can run over the collection of files in a folder, from a bash script file, to rename each file in the collection by appending the specific GST number for each filename, as found within the files themselves.
Before even getting to the renaming stage, I have encountered a problem with this. Here is what I tried, which didn't work...
# awk '/\d+$/' | grep -E 'GST: ' 150101_2224567890.txt
The grep command alone works perfectly to find the relevant line within the file, but the awk doesn't return just the final digits group. It fails with the error "warning: regexp escape sequence \d is not a known regexp operator". I had assumed that this regex should return any number of digits which are at the end of the line. The text file in question contains a line which ends with "GST: 112060340⏎". Can someone please show me how to make this work, and maybe also to help with the appropriate coding to move the collection of files to the new filenames? Thanks.
Thanks to a comment from #Renaud, I now have the following code working to obtain just the GST registration number from within a text file, which puts me a step closer towards a workable solution.
awk '/GST: / {printf $NF}' 150101_2224567890.txt
I still need to loop this over the collection instead of just specifying one filename. I also need to be able to use the output from #Renaud's contribution, to rename the files. I'm getting closer to a working solution, thanks!
This awk should work for you:
awk '$1=="GST:" {fn=FILENAME; sub(/\.txt$/, "", fn); print "mv", FILENAME, fn "_" $2 ".txt"; nextfile}' *_*.txt | sh
To make it more readable:
awk '$1 == "GST:" {
fn = FILENAME
sub(/\.txt$/, "", fn)
print "mv", FILENAME, fn "_" $2 ".txt"
nextfile
}' *_*.txt | sh
Remove | sh from above to see all mv commands together.
You may try
for f in *_*.txt; do echo mv "$f" "${f%.txt}_$(sed '/.*GST: /!d; s///; q' "$f").txt"; done
Drop the echo if you're satisfied with the output.
As you are sure there is only one matching line, you can try:
$ n=$(awk '/GST:/ {print $NF}' 150101_2224567890.txt)
$ mv 150101_2224567890.txt "150101_2224567890_$n.txt"
Or, for all .txt files:
for f in *.txt; do
n=$(awk '/GST:/ {print $NF}' "$f")
if [[ -z "$n" ]]; then
printf '%s: GST not found\n' "$f"
continue
fi
mv "$f" "$f{%.txt}_$n.txt"
done
Another one-line solution to consider, although perhaps not so elegant.
for original_filename in *_*.txt; do \
new_filename=${original_filename%'.txt'}_$(
grep -E 'GST: ' "$original_filename" | \
sed -E 's/.*GST//g; s/[^0-9]//g'
)'.txt' && \
mv "$original_filename" "$new_filename"; \
done
Output:
150101_2224567890_123456789.txt
If you are open to a multi line script:-
#!/bin/sh
for f in *.txt; do
prefix=$(echo "${f}" | sed s'#\.txt##')
cp "${f}" f1
sed -i s'#GST#%GST#' "./f1"
cat "./f1" | tr '%' '\n' > f2
number=$(cat "./f2" | sed -n '/GST/'p | cut -d':' -f2 | tr -d ' ')
newname="${prefix}_${number}.txt"
mv -v "${f}" "${newname}"
rm -v "./f1"
rm -v "./f2"
done
In general, if you want to make your files easy to work with, then leave as many potential places for them to be split with newlines as possible. It is much easier to alter files by simply being able to put what you want to delete or print on its' own line, than it is to search for things horizontally with regular expressions.

shell multiline selection from word to character

.textexpandrc
[yoro] よろしくお願いします。
[ohayo] おはようございます。
元気ですか?
[otsu] お疲れさまでします。
Looking for
$ KEY=ohayo; awk "???" ~/.textexpandrc
おはようございます。
元気ですか?
awk or sed is fine, but I'd like to avoid using a mix of awk/sed/perl/tr/cut etc because I'm under the impression that awk is robust enough to handle this on its own.
The best I could find on my own was
$ KEY=ohayo; awk "/\[${KEY}/,/\[otsu/" ~/.textexpandrc | sed "s/\[${KEY}\] //" | grep -v otsu
おはようございます。
元気ですか?
But I need to know the next key in advance (not impossible but ugly). Strangely, if asking awk to search until the square bracket, it fails to select a multiline
$ KEY=ohayo; awk "/\[${KEY}/,/\[/" ~/.textexpandrc
[ohayo] おはようございます。
Currently using a single-line parser solution as follow
#!/usr/bin/env bash
CONFIG=${HOME}/.textexpandrc
ALL_KEYS=$(sed 's/\].*/]/' ${CONFIG} | tr -d '[]')
KEY=$(echo $ALL_KEYS | rofi -sep ' ' -dmenu -p "autocomplete")
grep "\[${KEY}\]" $CONFIG | sed "s/\[${KEY}\] //" | xsel -ib # ← HERE
xdotool key ctrl+shift+v
If you set up the RS and FS variables to match [ and ], this works quite well:
awk 'BEGIN{ RS="\["; FS="\] " }; $1 ~ key { print $2 }' key=ohayo tmp.txt
You pass in the parameter you're searching for using key=.... on the command line instead of setting a variable. This makes it much easier to write the awk script within single quotes.

AWK: get file name from LS

I have a list of file names (name plus extension) and I want to extract the name only without the extension.
I'm using
ls -l | awk '{print $9}'
to list the file names and then
ls -l | awk '{print $9}' | awk /(.+?)(\.[^.]*$|$)/'{print $1}'
But I get an error on escaping the (:
-bash: syntax error near unexpected token `('
The regex (.+?)(\.[^.]*$|$) to isolate the name has a capture group and I think it is correct, while I don't get is not working within awk syntax.
My list of files is like this ABCDEF.ext in the root folder.
Your specific error is caused by the fact that your awk command is incorrectly quoted. The single quotes should go around the whole command, not just the { action } block.
However, you cannot use capture groups like that in awk. $1 refers to the first field, as defined by the input field separator (which in this case is the default: one or more "blank" characters). It has nothing to do with the parentheses in your regex.
Furthermore, you shouldn't start from ls -l to process your files. I think that in this case your best bet would be to use a shell loop:
for file in *; do
printf '%s\n' "${file%.*}"
done
This uses the shell's built-in capability to expand * to the list of everything in the current directory and removes the .* from the end of each name using a standard parameter expansion.
If you really really want to use awk for some reason, and all your files have the same extension .ext, then I guess you could do something like this:
printf '%s\0' * | awk -v RS='\0' '{ sub(/\.ext$/, "") } 1'
This prints all the paths in the current directory, and uses awk to remove the suffix. Each path is followed by a null byte \0 - this is the safe way to pass lists of paths, which in principle could contain any other character.
Slightly less robust but probably fine in most cases would be to trust that no filenames contain a newline, and use \n to separate the list:
printf '%s\n' * | awk '{ sub(/\.ext$/, "") } 1'
Note that the standard tool for simple substitutions like this one would be sed:
printf '%s\n' * | sed 's/\.ext$//'
(.+?) is a PCRE construct. awk uses EREs, not PCREs. Also you have the opening script delimiter ' in the middle of the script AFTER the condition instead of where it belongs, before the start of the script.
The syntax for any command (awk, sed, grep, whatever) is command 'script' so this should be is awk 'condition{action}', not awk condition'{action}'.
But, in any case, as mentioned by #Aaron in the comments - don't parse the output of ls, see http://mywiki.wooledge.org/ParsingLs
Try this.
ls -l | awk '{ s=""; for (i=9;i<=NF;i++) { s = s" "$i }; sub(/\.[^.]+$/,"",s); print s}'
Notes:
read the ls -l output is weird
It doesn't check the items (they are files? directories? ... strip extentions everywhere)
Read the other answers :D
If the extension is always the same pattern try a sed replacement:
ls -l | awk '{print $9}' | sed 's\.ext$\\'

Editing this Script to my needs

I want to use this Script to build a custom Wordlist.
Wordlist Script
This Script will build a Wordlist with only loweralpha Chars. But i want lower/upper Chars and Numbers.
The Output should be like this example:
test
123test
test123
Test
123Test
Test123
I dont know how to change it. I would be really happy if you could help me out with this.
I tried some tutorials for grep and regex but i dont understand anything.
Replace the line 18 of the script
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | tr '[:upper:]' '[:lower:]' | sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' | sort -u`;
With this:
page=`grep '' -R "./temp/" | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | tr " " "\n" | sort -u`;
If you have a look at it, you can see how it
replaces " " with "\n",
changes cases
filters by length
sorts
You can remove bits from that pipe chain and see how the output changes
delete this bit from the script:
tr '[:upper:]' '[:lower:]' |
that will leave case alone.
there's also a bit in wordlist.sh that only selects words from 9 to 25 characters which you could delete, or change if you prefer a different range:
`sed -e '/[^a-zA-Z]/d' -e '/^.\{9,25\}$/!d' |`
or you could try a simpler strategy: download and install w3m, a command-line web browser, and replace the complicated line in wordlist.sh with this:
page=`grep '' -R "./temp/" | w3m -dump wikipedia.org | grep -o '\w\+' | sort -u`
the grep is (a weird) way to get all the text from the html files, then w3m -dump gets rid of all the html tags and other non-display stuff, and grep -o '\w\+' matches any word.

Regex w/grep against tnsnames.ora

I am trying to print out the contents of a TNS entry from the tnsnames.ora file to make sure it is correct from an Oracle RAC environment.
So if I do something like:
grep -A 4 "mydb.mydomain.com" $ORACLE_HOME/network/admin/tnsnames.ora
I will get back:
mydb.mydomain.com =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL = TCP)(HOST = myhost.mydomain.com)(PORT = 1521))
  (CONNECT_DATA =(SERVER = DEDICATED)(SERVICE_NAME=mydb)))
Which is what I want. Now I have an environment variable being set for the JDBC connection string by an external program when the shell script gets called like:
export $DB_URL=#myhost.mydomain.com:1521/mydb
So I need to get TNS alias mydb.mydomain.com out of the above string. I'm not sure how to do multiple matches and reorder the matches with regex and need some help.
grep #.+: $DB_URL
I assume will get the
#myhost.mydomain.com:
but I'm looking for
mydb.mydomain.com
So I'm stuck at this part. How do I get the TNS alias and then pipe/combine it with the initial grep to display the text for the TNS entry?
Thanks
update:
#mklement0 #Walter A - I tried your ways but they are not exactly what I was looking for.
echo "#myhost.mydomain.com:1521/mydb" | grep -Po "#\K[^:]*"
echo "#myhost.mydomain.com:1521/mydb" | sed 's/.*#\(.*\):.*/\1/'
echo "#myhost.mydomain.com:1521/mydb" | cut -d"#" -f2 | cut -d":" -f1
echo "#myhost.mydomain.com:1521/mydb" | tr "#:" "\t" | cut -f2
echo "#myhost.mydomain.com:1521/mydb" | awk -F'[#:]' '{ print $2 }'
All these methods get me back: myhost.mydomain.com
What I am looking for is actually: mydb.mydomain.com
Note:
- For brevity, the commands below use bash/ksh/zsh here-string syntax to send strings to stdin (<<<"$var"). If your shell doesn't support this, use printf %s "$var" | ... instead.
The following awk command will extract the desired string (mydb.mydomain.com) from $DB_URL (#myhost.mydomain.com:1521/mydb):
awk -F '[#:/]' '{ sub("^[^.]+", "", $2); print $4 $2 }' <<<"$DB_URL"
-F'[#:/]' tells awk to split the input into fields by either # or : or /. With your input, this means that the field of interest are part of the second field ($2) and the fourth field ($4). The sub() call removes the first .-based component from $2, and the print call pieces together the result.
To put it all together:
domain=$(awk -F '[#:/]' '{ sub("^[^.]+", "", $2); print $4 $2 }' <<<"$DB_URL")
grep -F -A 4 "$domain" "$ORACLE_HOME/network/admin/tnsnames.ora"
You don't strictly need intermediate variable $domain, but I've added it for clarity.
Note how -F was added to grep to specify that the search term should be treated as a literal, so that characters such as . aren't treated as regex metacharacters.
Alternatively, for more robust matching, use a regex that is anchored to the start of the line with ^, and \-escape the . chars (using shell parameter expansion) to ensure their treatment as literals:
grep -A 4 "^${domain//./\.}" "$ORACLE_HOME/network/admin/tnsnames.ora"
You can get a part of a string with
# Only GNU-grep
echo "#myhost.mydomain.com:1521/mydb" | grep -Po "#\K[^:]*"
# or
echo "#myhost.mydomain.com:1521/mydb" | sed 's/.*#\(.*\):.*/\1/'
# or
echo "#myhost.mydomain.com:1521/mydb" | cut -d"#" -f2 | cut -d":" -f1
# or, when the string already is in a var
echo "${DB_URL#*#}" | cut -d":" -f1
# or using a temp var
tmpvar="${DB_URL#*#}"
echo "${tmpvar%:*}"
I had skipped the alternative awk, that was given by #mklement0 already:
echo "#myhost.mydomain.com:1521/mydb" | awk -F'[#:]' '{ print $2 }'
The awk solution is straight-forward, when you want to use the same approach without awk you can do something like
echo "#myhost.mydomain.com:1521/mydb" | tr "#:" "\t" | cut -f2
or the ugly
echo "#myhost.mydomain.com:1521/mydb" | (IFS='#:' read -r _ url _; echo "$url")
What is happening here?
After introducing the new IFS I want to take the second word of the input. The first and third word(s) are caught in the dummy var's _ (you could have named them dummyvar1 and dummyvar2). The pipe | creates a subprocess, so you need ()to hold reading and displaying the var url in the same process.