How to pipe file path for parsing? - regex

I have a file path that I need to parse, however, I am pretty new to shell and am not really sure what the appropriate or conventional process would be. Let's say I have a variable representing a file path:
EX=/home/directory/this/might/have/numbers23/12349_2348/more/paths
I need to obtain 12349_2348, which will then become a file name for some other things that I already know how to do. How can I extract this? I know a basic way to do this using regex, which would match with /([0-9])\d+/, however, I determined that by playing around with regexr and have no idea what to do with it from there. I have tried using sed as follows:
echo $EX | sed /([0-9])\d+/
but this does not do anything and just gives me an error. What is a better way to do this, and if sed is the best way to do it, what am I doing wrong? I have looked at tutorials and it seems like I should be able to just match the regular expression this way.

It depends on how you know what you're looking for. So for example if you know it's some digits followed by underscore followed by some more digits, you could do this:
dwalker$ EX=/home/directory/this/might/have/numbers23/12349_2348/more/paths
dwalker$ echo $EX | egrep -o '\d+_\d+'
12349_2348
5 digits followed by underscore followed by 4 digits:
dwalker$ EX=/home/directory/this/might/have/numbers23/12349_2348/more/paths
dwalker$ echo $EX | egrep -o '\d{5}_\d{4}'
12349_2348
If you know you need to take off 2 subdirectories off the end, and then what remains is your directory, you can do this:
$ EX1=`dirname $EX`
$ EX1=`dirname $EX`
$ basename $EX1
12349_2348
So there are a couple of ways to do it.
egrep is "extended" grep. It lets you use \d for digits and other things. You can see the man page for more details, and for the explanation of -o.

Related

Unable to make the mentioned regular expression to work in sed command

I am trying to make the following regular expressions to work in sed command in bash.
^[^<]?(https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&\/\/=]*))[^>]?$
I know the regular expression is correct and it is working as I expected. So; there is no help needed with that. I tested it on online regular expressions tester and it is working as per my expectations.
Please find the demo of the above regex in here.
My requirement:
I want to enclose every url inside <>. If the url is already enclosed; then append it to the result as can be seen in the above regex link.
Sample Input:(in file named website.txt)
// List of all legal urls
https://www.google.com/
https://www.fakesite.co.in
https://www.fakesite.co.uk
<https://www.fakesite.co.uk>
<https://www.google.com/>
Expected Output:(in the file named output.txt)
<https://www.google.com/> // Please notice every url is enclosed in the <>.
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk> // Please notice if the url is already enclosed in <> then it is appended as it is.
<https://www.google.com/>
What I tried in sed:
Since I'm not well-versed in bash commands; so previously I was not able to capture the group properly in sed but after reading this answer; I figured out that we need to escape the parenthesis to be able to capture it.
Somewhere; I read that look-arounds are not supported in sed(GNU based) so I removed lookarounds too; but that also didn't worked. If it doesn't support look-arounds then I used this regex and it served my purpose.
Then; this is my latest try with sed command:
sed 's#^[^<]?(https?://(?:www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()#:%_\+.~#?&/=]*))[^>]?$#<\1>#gm;t;d' websites.txt > output.txt
My exact problem:
How can I make the above command to work properly. If you'll run the command sample I attached above in point-3; you'd see it is not replacing the contents properly. It is just dumping the contents of websites.txt to output.txt. But in regex demo; attached above it is working properly i.e. enclosing all the unenclosed websites inside <>. Any suggestions would be helpful. I preferably want it in sed but if it is possible can I convert the above command in awk also? If you can please help me with that too; I'll be highly obliged. Thanks
After working for long, I made my sed command to work. Below is the command which worked.
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t' websites.txt > output.txt
You can find the sample implementation of the command in here.
Since, the regex has already fulfilled the requirement of the person for whom I'm writing this requirement for; I needed to get help only regarding the command syntax (although any improvements are heartily welcomed); I want the command to work with the same regular expression pattern.
Things which I was unaware previously and learnt now:
I didn't knew anything about -E flag. Now I know; that -E uses POSIX "extended" syntax ("ERE"). Thanks to #GordonDavisson and #Sundeep. Further reading.
I didn't know with clarity that sed doesn't supports look-around. But now I know sed doesn't support look-around. Thanks to #dmitri-chubarov. Further reading
I didn't knew sed doesn't support non-capturing groups too. Thanks to #Sundeep for solving this part. Further Reading
I didn't knew about GNU sed as a specific command line tool. Thanks to #oguzismail for this. Further reading.
With respect to the command in your answer:
sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
Here's a few notes:
Your posted sample input has 1 URL per line so AFAIK the gm;t at the end of your sed command is doing nothing useful so either your input is inadequate or your script is wrong.
The hard-coded ranges a-z, A-Z, and 0-9 include different characters in different locales. If you meant to include all (and only) lower case letters, upper case letters, and digits then you should replace a-zA-Z0-9 with the POSIX character class [:alnum:]. So either change to use a locale-independent character class or specify the locale you need on your command line depending in your requirements for which characters to match in your regexp.
Like most characters, the character + is literal inside a bracket expression so it shouldn't be escaped - change \+ to just +.
The bracket expression [^<]? means "1 or 0 occurrences of any character that is not a <" and similarly for [^>]? so if your "url" contained random characters at the start/end it'd be accepted, e.g.:
echo 'xhttp://foo.bar%' | sed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://foo.bar%>
I think you meant to use <? and >? instead of [^<]? and [^>]?.
Your regexp would allow a "url" that has no letters:
echo 'http://=.9' | gsed -E 's#^[^<]?(https?://(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&=]*))[^>]?$#<\1>#gm;t'
<http://=.9>
If you edit your question to provide more truly representative sample input and expected output (including cases you do not want to match) then we can help you BUT based on a quick google of what a valid URL is it looks like there are several valid URLs that'd be disallowed by your regexp and several invalid ones that'd be allowed so you might want to ask about that in a question tagged with url or similar (with the tags you currently have we can help you implement your regexp but there may be better people to help with defining your regexp).
If the input file is just a comment followed by a list of URLs, try:
sed '1d;s/^[^<]/<&/;s/[^>]$/&>/' websites.txt
Output:
<https://www.google.com/>
<https://www.fakesite.co.in>
<https://www.fakesite.co.uk>
<https://www.fakesite.co.uk>
<https://www.google.com/>

Regex ignoring linebreaks and "page layout"

I have an assortment of searchable PDF files and I often search particular patterns in all of them simultaneously, using the pdfgrep command. My regex knowledge is somewhat limited and I'm not sure how to work around linebreaks and page layout.
For example, I would like to find the pattern "ignor.{0,10}layout" in each example below:
This is a rather difficult You see, I would like to ignore
task that I am trying to page layout and still find the
achieve. pattern I am looking for.
This is a rather difficult This is because I would like to ig-
task that I am trying to nore page layout and still find the
achieve. pattern I am looking for.
In both examples, I would like the first two lines to be reported by
pdfgrep -n "ignor.{0,10}layout" *
but it fails to do so because:
there is a linebreak in the middle.
in the first example, there are more than 10 characters between ignor and layout.
in the second example, ignor is cut in half.
Is there a regex that would solve this problem entirely?
pdfgrep does not have the -z flag that would be necessary to interpret newlines as zero-bytes. You can use a workaround with pdftotext, that allows to convert it to text and stream this to STDOUT, where you can pipe a regular grep call:
pdftotext SPECIFIC-FILE.pdf - | grep -Pzo "(?s)YOUR\s+QUERY"
This makes it impossible to use globbing efficiently, but you can at least iterate the glob:
for pdf in *.pdf; do echo -n "$pdf:"; pdftotext "$pdf" - | grep -Pzo "(?s)YOUR\s+QUERY"; done
Please note that if you want to match whitespaces, you almost always will want to use \s+ which matches also newlines, when -z is enabled. See this other answer for an explanation of the flags.

Simplest, Safe Method for Trimming File Paths

I have a script that does a lot of file-processing, and it's good enough to receive its paths using null-characters as a separator for safety.
However, it process all paths as absolute (saves some headaches), but these are a bit unwieldy for output purposes, so I'd like to remove a chunk of the path from my output. Now, plenty of options spring to mind, but the difficulty is in using these in a way that's safe for any arbitrary path that I might encounter, which is where things get a bit trickier.
Here's a quick example:
#!/bin/sh
TARGET="$1"
find "$TARGET" -print0 | while IFS= read -rd '' path; do
# Process path for output here
path_str="$path"
echo "$path_str"
done
So in the above script I want to take path and remove TARGET from it, in the most compatible way possible (e.g - nothing bash specific), it needs to be able to remove only from the start of the string, i.e - /foo/bar becomes bar, /foo/bar/foo becomes bar/foo and /bar/foo remains /bar/foo. It should also cope with any possible characters in a file-name, including characters that some file-systems support such as tildes, colons etc., as well as pesky inverted quotation characters.
I've hacked together some messy solutions using sed by first escaping any characters that might break my regular expression, but this is a very messy way of doing things, so I'm hoping there are some simpler methods out there. In case there isn't, here's by solution so far:
SAFE_CHARS='s:\([[/.*]\):\\\1:g'
target_safe=$(printf '%s' "$TARGET" | sed "$SAFE_CHARS")
path_str=$(printf '%s' "$path" | sed "s/^$target_safe//g')
There's probably a few characters missing that I should be escaping in addition to those ones, and apologies for any typos.
To remove a prefix from a string,
$ TARGET=/foo/
$ path=/foo/bar
$ echo "${path#$TARGET}"
bar
The # operator for parameter expansion is part of the POSIX standard and will work in any POSIX-compliant shell.
You can try this simple find:
export TARGET="$1"
find "$TARGET" -exec bash -c 'sed "s|^$TARGET\/||" <<< "$1"' - '{}' \;

How to use ls to list out files that end in numbers

I'm not sure if I'm using regular expressions in bash correctly. I'm on a Centos system using bash shell. In our log directory, there are log files with digits appended to them, i.e.
stream.log
stream.log.1
stream.log.2
...
stream.log.nnn
Unfortunately there are also log files with the new naming convention,
stream.log.2014-02-14
stream.log.2014-02-13
I need to get files with the old log file naming format. I found something that works but I'm wondering if there's another more elegant way to do this.
ls -v stream.log* | grep -v 2014
I don't know how regular expressions work in bash and/or what command (other than possibly grep) to pipe output to. The cmd/regex I was thinking of is something like this:
ls -v stream.log(\.\d{0,2})+
Not surprisingly, this didn't work. Maybe my logic is incorrect but I wanted to say from the cmdline list files with the name stream.log with an optional .xyz at the end where xyz={1..999} is appended at the end. Please let me know if this is doable or if the solution I came up with is the only way to do something like this. Thanks in advance for your help.
EDIT: Thanks for everyone's prompt comments and reply. I just wanted to bring up that there's also a file called stream.log that doesn't any digits appended to it that also needs to make it into my ls listing. I tried the tips in the comment and answer and they work but it leaves out that file.
You can do this with extended pattern matching in bash, e.g.
> shopt -s extglob
> ls *'.'+([0-9])
Where
+(pattern-list)
Matches one or more occurrences of the given patterns
And other useful syntaxes.
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns
Alternatively without extended pattern matching could use a less neat solution
ls *'.'{1..1000} 2>dev/null
And replace 1000 with some larger number if you have a lot of log files. Though I would prefer the grep option to this one.
An approach using sed:
ls -v stream.log* | sed -nE '/log(\.[0-9]+)?$/p'
and one using egrep:
ls -v stream.log* | egrep 'log(\.[0-9]+)?$'
These print out lines that end in "log" and optionally a period and any positive number of digits, followed by the end of the line.
You can this much more simply by just focusing on the dash '-' in the old logfile format. Here is the minimal version:
ls *-*
This may be a little safer if there are different types of logfiles in the same directory:
ls stream.log.*-*
To ensure that you get the one extra file, it does not make sense to generate a confusing regex that will fit it - just include it on the ls line:
ls stream.log stream.log.*-*
refer #BroSlow's answer, here is the fix which will include stream.log as well.
shopt -s extglob
ls stream.log*(.)*([0-9])
stream.log stream.log.1 stream.log.2

Looking for a trailing $ sign using Regex

The pattern I'm looking for looks like $guid1$ with the $ signs on each side. Unfortunately, my regex in grep (and probably elsewhere) interprets that last $ as something else.
"\$guid[0-9]\$" works but "\$guid[0-9]\$" does not. What can I do?
You need to use single quotes around your regex:
grep '\$guid1\$' file
OR use fgrep for fixed string search:
fgrep '$guid1$' file