Having trouble parsing tcpdump output with regex - regex

In particular, I'm trying to get the "Host: ..." part of an HTTP header of the HTTP request packet.
One instance is something like this:
.$..2~.:Ka3..E..D'.#.#..M....J}.e...P...q...W................g.o3GET./.HTTP/1.1...$..2~.:Ka3..E..G'.#.#..I....J}.e...P.......W................g..\host:.domain.com..
Another is this:
.$..2~.:Ka3..E..D'.#.#..M....J}.e...P...q...W................g.o3GET./.HTTP/1.1...$..2~.:Ka3..E..G'.#.#..I....J}.e...P.......W................g..\host:.domain.com..Connection:.Keep-Alive....
Note this is the ascii output. I want to extract that host. My initial regex was:
[hH]ost:\.(.*)..
This works for the first case, but does not work for the second one. In particular, for the second one it will extract: "domain.com..Connection.Keep-Alive.."
I would appreciate some help with creating a general regex that works in all cases.

Use this:
(?<=host:\.)(?:\.?[^.])+
See demo
The lookbehind (?<=host:\.) asserts that what precedes is host:.
(?:\.?[^.]) matches an optional period, then one character that is not a period.
And the + makes us match one or more of these characters
Reference
Lookahead and Lookbehind Zero-Length Assertions
Mastering Lookahead and Lookbehind

Related

Confusion regarding regex pattern

I have tried to write a regex to catch certains words in a sentence but it is not working. The below regex is only working when I give a exact match.
[\s]*((delete)|(exec)|(drop\s*table)|(insert)|(shutdown)|(update)|(\bor\b))
Lets say I send a HTTP Header - headerName = insert it works,
but does not work when I give headerName = awesome insert number
--edit--
#user1180, Yes I can use prepared statements, but we are also looking into the regex part.
#Marcel and Wiktor, yes it is working in that website. I guess my tool is not recognizing the regex. I am using Mulesoft ESB, which uses Matches when the evaluated value fits a given regular expression (regex), specifically a regex "flavor" supported by Java.
It is using something like this,
matches /\+(\d+)\s\((\d+)\)\s(\d+\-\d+)/ and I am not aware of how to write my usecase in this regex format.
My usecase is too catch SQL injection pattern, which would check the request header/queryparam for delete (exec)(drop\s*table)(insert)(shutdown)(update)or parameters.
Since your regex must match the whole input you need to wrap the pattern with .*, something similar to (?s).*(<YOUR PATTERN>).*.
Use
(?s).*\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b.*
Details
(?s) - turns on DOTALL mode where . matches any char
.* - any 0+ chars, as many as possible
\b(delete|exec|drop\s+table|insert|shutdown|update|or)\b - any one of the whole words (note \b is a word boundary construct) in the group
.* - any 0+ chars, as many as possible
I also replaced drop\s*table with drop\s+table since I guess droptable is not expected.

Can I exclude Positive Lookaheads and Lookbehinds within a snippet in vscode?

I am having issues excluding parts of a string in a VSCode Snippet. Essentially, what I want is a specific piece of a path but I am unable to get the regex to exclude what I need excluded.
I have recently asked a question about something similar which you can find here: Is there a way to trim a TM_FILENAME beyond using TM_FILENAME_BASE?
As you can see, I am getting mainly tripped up by how the snippets work within vscode and not so much the regular expressions themselves
${TM_FILEPATH/(?<=area)(.+)(?=state)/${1:/pascalcase}/}
Given a file path that looks like abc/123/area/my-folder/state/...
Expected:
/MyFolder/
Actual:
abc/123/areaMyFolderstate/...
You need to match the whole string to achieve that:
"${TM_FILEPATH/.*area(\\/.*?\\/)state.*/${1:/pascalcase}/}"
See the regex demo
Details
.* - any 0+ chars other than line break chars, as many as possible
area - a word
-(\\/.*?\\/) - Group 1: /, any 0+ chars other than line break chars, as few as possible, and a /
-state.* - state substring and the rest of the line.
NOTE: If there must be no other subparts between area and state, replace .*? with [^\\/]* or even [^\\/]+.
The expected output seems to be different with part of a string in the input. If that'd be desired the expression might be pretty complicated, such as:
(?:[\s\S].*?)(?<=area\/)([^-])([^-]*)(-)([^\/])([^\/]*).*
and a replacement of something similar to /\U$1\E$2$3\U$4\E$5/, if available.
Demo 1
If there would be other operations, now I'm guessing maybe the pascalcase would do something, this simple expression might simply work here:
.*area(\\/.*?\\/).*
and the desired data is in this capturing group $1:
(\\/.*?\\/)
Demo 2
Building on my answer you linked to in your question, remember that lookarounds are "zero-length assertions" and "do not consume characters in the string". See lookarounds are zero-length assertions:
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
So in your snippet transform: /(?<=area)(.+)(?=state)/ the lookaround portions are not actually consumed and so are simply passed through. Vscode treats them, as it should, as not actually being within the "part to be transformed" segment at all.
That is why lookarounds are not excluded from your transform.

regex look ahead behind (look around) negative problems

I am having trouble understanding negative regex lookahead / lookbehind. I got the impression from reading tutorials that when you set a criteria to look for, the criteria doesn't form part of the search match.
That seems to hold for positive lookahead examples I tried, but when I tried these negative ones, it matches the entire test string. 1, it shouldn't have matched anything, and 2 even if it did, it wasn't supposed to include the lookahead criteria??
(?<!^And).*\.txt$
with input
And.txt
See: https://regex101.com/r/vW0aXS/1
and
^A.*(?!\.txt$)
with input:
A.txt
See: https://regex101.com/r/70yeED/1
PS: if you're going to ask me which language. I don't know. we've been told to use regex without any specific reference to any specific languages. I tried clicking various options on regex101.com and they all came up the same.
Lookarounds only try to match at their current position.
You are using a lookbehind at the beginning of the string (?<!^And).*\.txt$, and a lookahead at the end of the string ^A.*(?!\.txt$), which won't work. (.* will always consume the whole string as it's first match)
To disallow "And", for example, you can put the lookahead at the beginning of the string with a greedy quantifier .* inside it, so that it scans the whole string:
(?!.*And).*\.txt$
https://regex101.com/r/1vF50O/1
Your understanding is correct and the issue is not with the lookbehind/lookahead. The issue is with .* which matches the entire string in both cases. The period . matches any character and then you follow it with * which makes it match the entire string of any length. Remove it and both you regexes will work:
(?<!^And)\.txt$
^A(?!\.txt$)

Negative Lookbehind stops after first occurrence of an optional regex

I'm removing protocol from links in HTML files using the following regex in Python:
re.sub(r"((http:|https:)?(\/\/website.com))", r"\3", result)
This works as expected, but I don't want to replace the protocol when the attribute is content. So I started looking into using Regex Negative Lookbehind.
(?<!content=")(http:|https:)?(\/\/website.com)
This regex should basically mean that if the string starts with <content=", then it should not match the rest. But the problem is that it only rejects the optional regex, (http:|https:)?, likely because it's optional. It rejects the whole line if it's not optional.
Here's a screenshot that shows the problem clearly. The first line should be rejected completely, but it only rejected the protocol.
Any suggestions? :)
Thanks!
The problem with the original regex is that it matches //website.com that does not have content=" directly before it, because the http:/https: is optional. To workaround it, you can include the protocol in the negative lookbehind.
As variable length lookbehinds are not supported in Python, you can do the following:
(?<!content=")(?<!content="https:)(?<!content="http:)((https?:)?(//website.com))
Demo
The regex finds //website.com that does not have content=" directly in front of it. So returns a match.
How about
(?<!content="|content="http:|content="https:)(http:|https:)?(\/\/website.com)

Parse with Regex without trailing characters

How can I successfully parse the text below in that format to parse just
To: User <test#test.com>
and
To: <test#test.com>
When I try to parse the text below with
/To:.*<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>/mi
It grabs
Message-ID <CC2E81A5.6B9%test#test.com>,
which I dont want in my answer.
I have tried using $ and \z and neither work. What am I doing wrong?
Information to parse
To: User <test#test.com> Message-ID <CC2E81A5.6B9%test#test.com>
To:
<test#test.com>
This is my parsing information in Rubular http://rubular.com/r/DQMQC4TQLV
Since you haven't specified exactly what your tool/language is, assumptions must be made.
In general regex pattern matching tends to be aggressive, matching the longest possible pattern. Your pattern starts off with .*, which means that you're going to match the longest possible string that ENDS WITH the remainder of your pattern <[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>, which was matched with <CC2E81A5.6B9%test#test.com> from the Message-ID.
Both Apalala's and nhahtdh's comments give you something to try. Avoid the all-inclusive .* at the start and use something that's a bit more specific: match leading spaces, or match anything EXCEPT the first part of what you're really interested in.
You need to make the wildcard match non greedy by adding a question mark after it:
To:.*?<[A-Z0-9._+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}>