Matching regex at specific positions - regex

Is it possible to match strings using a single regex expression where I could define constraints on their position within the text?
For example given a hex encoded file I would like to match hex representations that correspond to characters whose hex representation is larger than 0x40. The position constraint should be that matching should start at even positions.
E.g. 034673911921 should match at 46,73,91 but not at 92.

You can encode the position inside a regex. For your example of only starting at even positions, that could be something like
/^(?:..)*([4-9a-fA-F].)/

It can be done in two easy to understand steps: first split it into fields and then check the size. Here is an example with sed, I hope it will be of help:
echo 034673911921 | sed -nr 's7([0-9][0-9])/\1 /gp' | sed -n 's/[0-3][0-9]//gp'

You can use something like this:
^(?:..)*([4-9A-Fa-f][\da-fA-F])
which will make sure that an even number of characters precedes your capturing group.

Related

Grep: omit spaces between search and results

Source text file is: Gross Bushels 1225.35
Grep is: (?<=Bushels\s)[\0-9,\.]*.
Desired return is: 1225.35
I am searching for "Gross Bushels" and want to capture the number directly after that, minus any spaces.
However, the source file may have more than one space between the "s" and the first number. I want to truncate 1 or more spaces, not just one. I understand I probably need some switch on the "\s" but cannot figure out what.
You can use \K to reset a match in PCRE which allows you to avoid lookbehind which cannot be of dynamic length. You may use this:
s='Gross Bushels 1225.35'
grep -oP '\bBushels\s+\K[\d,.]+' <<< "$s"
1225.35
If you want your regex to be a bit more strict then use:
\bBushels\h+\K\d+([.,]\d+)*

How to use an RE to match a line of ===== and the line above

I want to match two lines like the following using a Regular Expression:-
abcmnoxyz
=========
The first line is essentially random, the second line will be all the same character of a limited number of possibles (=, - and maybe a couple more). The lines can probably be required to be the same length but it would be nice if they didn't have to be. It would be OK to have multiple REs, one for each possible 'underline' character.
Can anyone come up with a way to do this?
This regex should do what you're trying to do :
regex = "(.*)\n(.)\2{2,}$"
group 1 will give you the line before the repeated linet
Live demo here
EXPLANATION
(.*)\n: match anything followed by a new line
(.)\2{2,} : capture something then check if its followed by same character 2+ more no. of times. You don't need to worry about which character is repeated.
In case you've a set of characters that can be repeated you can put a character set like this : [=-] instead of dot (.)
Use Grep's -B Flag
Matching with Alternation
Given your example, you can use extended regular expressions with alternations and a range operator. The -B flag tells grep how many lines before the match to include in the output.
$ grep -E -B1 '^(={5,}|-{5,})$' sample.txt
abcmnoxyz
=========
You can add alternations for additional characters if you want, although boundary markers ought to be as consistent as you can make them. You can also adjust the minimum number of sequential characters required for a match to suit your needs. I used a five-character range in the example because that's what was posted as the criterion in your original topic sentence, and because a shorter boundary marker is more likely to accidentally match truly random text.
Matching with a Character Class
Also, note that the following does the same job, but is a bit more concise. It uses a character class and a backreference to avoid alternations, which can get messy if you add many more boundary characters. Both versions are equally effective at matching your example.
$ grep -E -B1 '^([=-])\1{4,}$'
abcmnoxyz
========
A regex like this
^([^=\v]+)\v=+$
will do. Check it out at example 1
Explanation:
^([^=\v]+) # 1 or more matches of anything that is not a '=' or vertical space \v
\v=+$ # match a vertical space followed by 1 or more '='
If you want to extend this to more characters like '-' you could do this:
^([^=\-\v]+)\v(-|=)\2+$
Look at example 2
And, thanks to Ashish Ranjan, suppose you wanted to have = and/or - on the first line, use something like this:
^(.+)\v(-|=)\2+$
which would even allow you to have a first line like "=====". Having my doubts if OP had this in mind, though. Look at example 3
Hope this works
^([a-z]{1,})\n([=-]{1,})
\n and \r you have try both based on file format (unix or dos)
\1 will give you first line
\2 will give you second line
If the file contains same pattern over the text, then it might give you lot occurrence.
This answer is irrespective of number of characters in one line.
Ex: Tester

RegEx exclude sets while grouping all characters 2 by 2

I want to modify a binary file with a pattern. I've converted the file to a plain hexdump with xxd (from package vim). The plain file looks like this (only 1 line with no trailing LF):
$ xxd -ps file.bin | tr -d '\n' | tee out.txt
3a0a5354...
I want to remove all patterns that match \x01[^\xFF]*\xFF (an opening token and a closing token and everything between them except another closing token) in the original file, but sed doesn't work like this.
Example Input and Desired Match:
020202020101010101feeffeefff0000...
~~~~~~~~~~~~~~~~~~~~
And I'm thinking about doing this:
sed 's/regex//g' in.file > out.file
Now I'm trying to match all chatacters 2-by-2 while excluding ff. Any ideas?
This should do the trick:
((..)|01([0-9a-e][0-9a-f]|[0-9a-f][0-9a-e])*ff)*
That is, we match pairs of hexadecimal digits where either the first or the second digit can be f but not both. In the surrounding context we must also match everything two characters at a time to ensure that our matches start from an even digit.
Obviously, you must add something that actually removes the inner group from the output, which is specific to your regex engine. I realized only after posting this that a simple s/ won't do.

Regular Expression - Capture and Replace Select Sequences

Take the following file...
ABCD,1234,http://example.com/mpe.exthttp://example/xyz.ext
EFGH,5678,http://example.com/wer.exthttp://example/ljn.ext
Note that "ext" is a constant file extension throughout the file.
I am looking for an expression to turn that file into something like this...
ABCD,1234,http://example.com/mpe.ext
ABCD,1234,http://example/xyz.ext
EFGH,5678,http://example.com/wer.ext
EFGH,5678,http://example/ljn.ext
In a nutshell I need to capture everything up to the urls. Then I need to capture each URL and put them on their own line with the leading capture.
I am working with sed to do this and I cannot figure out how to make it work correctly. Any ideas?
If the number of URLs in each line is guaranteed to be two, you can use:
sed -r "s/([A-Z0-9,]{10})(.+\.ext)(.+\.ext)/\1\2\n\1\3/" < input
This does not require the first two fields to be a particular width or limit the set of (non-comma) characters between the commas. Instead, it keys on the commas themselves.
sed 's/\(\([^,]*,\)\{2\}\)\(.*\.ext\)\(http:.*\)/\1\3\n\1\4/' inputfile.txt
You could change the "2" to match any number of comma-delimited fields.
I have no sed available to me at the moment.
Wouldn't
sed -r 's/(....),(....),(.*\.ext)(http.*\.ext)/\1,\2,\3\n\1,\2,\4/g'
do the trick?
Edit: removed the lazy quantifier

Regex in sed to convert ##XXX## to ${XXX}

I need to use sed to convert all occurences of ##XXX## to ${XXX}. X could be any alphabetic character or '_'. I know that I need to use something like:
's/##/\${/g'
But of course that won't work properly, as it will convert ##FOO## to ${FOO${
Here's a shot at a better replacement regex:
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Or if you assume exactly three characters :
's/##\([a-zA-Z_]\{3\}\)##/${\1}/g'
Encapsulate the alpha and '_' within '\(' and '\)' and then in the right side reference that with '\1'.
'+' to match one or more alpha and '_' (in case you see ####).
Add the 'g' option to the end to replace all matches (which I'm guessing is what you want to do in this case).
's/##\([a-zA-Z_]\+\)##/${\1}/g'
Use this:
s/##\([^#]*\)##/${\1}/
BTW, there is no need to escape $ in the right side of the "s" operator.
sed 's/##\([a-zA-Z_][a-zA-Z_][a-zA-Z_]\)##/${\1}/'
The \(...\) remembers...and is referenced as \1 in the expansion. Use single quotes to save your sanity.
As noted in the comments below this, this can also be contracted to:
sed 's/##\([a-zA-Z_]\{3\}\)##/${\1}/'
This answer assumes that the example wanted exactly three characters matched. There are multiple variations depending on what is in between the hash marks. The key part is remembering part of the matched string.
echo "##foo##" | sed 's/##/${/;s//}/'
s change only 1 occurence by default
s//take last search pattern used so second s take also ## and only the second occurence still exist
echo '##XXX##' | sed "s/^##\([^#]*\)/##$\{\1\}/g"
sed 's/\([^a-z]*[^A-Z]*[^0-9]*\)/(&)/pg