I have stuck in a small issue where I need to remove last character "," ( if present) from JSON log file. I am using it in Splunk.
It seems simple and I was hoping my regex will work but its not working.
My Attempts :
1. s/\(,$\)?//g
2. s/,$//g
3. s/\(.*\),/\1/
FYI: My json file is nested, along with removing last character, I am removing some header and footers from this file and breaking the 1 event in multiple. Due to event break it has , at the end of each event.
For better understanding can refer this link which I posted on Splunk Community fourm
https://community.splunk.com/t5/Getting-Data-In/Updated-Help-in-event-break-for-json-file/td-p/569676
Actually there was an extra space at the end so below one is working but it cause another issue.
Working Regex s/\(,\s$\)//g
because I am using it with other regex and event break. Not event break is not working.
Other Regex
SEDCMD-removefooter = s/(\]\,).*//g
SEDCMD-removeheader = s/\{\"data\": \[//g
LINE_BREAKER = ([\r\n,]*(?:{[^[{]+\[)?){"links"
I resolved the issue
Working regex
SEDCMD-replacequotes = s/'/"/g
SEDCMD-removecomma = s/,\s$//g
SEDCMD-removefooter = s/(\]\,).*//g
SEDCMD-removeheader = s/\{.data.: \[//g
Related
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
I'm porting my system to another data access library. For that, I'm using regex to replaces/remove some codes on my source. (A example above)
I need to remove everything between IBOQ_OrderingItems.Strings and ') by regex. But I can't write a regex to express this condition to express that. In my attempts, this does not recognize something like #180'asdf' or 'adsf (asdf) asdf' or ' adf '. When recognized, the regexp delete all content of file.
object SQLCalcula_umaLinha: TFDQuery
IBOQ_OrderingItems.Strings = (
'sf')
end
object SQLCalcula_VariasLinhas: TFDQuery
IBOQ_OrderingItems.Strings = (
'sfdf'
'sdffs'
'sf')
end
object SQLCalcula_parentesesNoMeio: TFDQuery
IBOQ_OrderingItems.Strings = (
'sfdf'
'sdffs ('' asdf '')'
'sf')
end
I found a solution:
IBOQ_.*.Strings.=.\((\s.[\w|\s|('|')|#|!|$|#||&|*|<|>|=|*|~]*.)+'\)
I hope to help :)
Or you could try something like
IBOQ_.*\.Strings\s*=\s*\((?:'[^']*'|[^)])*\)
which does it in 288 steps instead of yours, that does it in 48067 steps ;)
Check it out here at regex101.
Edit Changed to handle parentheses inside quotes.
I am looking for a regular expression to match all references to SuppressMessage in a solution that took over.
example:
[SuppressMessage("Microsoft.Globalization", "CA1305:SpecifyIFormatProvider", MessageId = "System.Int32.ToString")]
I tried this to find the SuppressMessage with the beginning and ending square brackets but it does not observe line feeds and when multiple matches are with the same file, it will return the bulk of the file.
\[(SuppressMessage)\((.*)\)\]
[(SuppressMessage)((.*?))]
try it
Thanks vks - That got me closer but that finds two groups.
SupressMessage
"Microsoft.Design", "CA1062:Validate arguments of public methods", MessageId = "0"
What I found that works (without multiple SuppressMessages in the same square brace) is:
\[(SuppressMessage.*?)\]
\[(SuppressMessage\((?:.*?)\))\]
make your expression non greedy.In fact try
\[(SuppressMessage\((?:[^)]*)\))\]
or
\[(SuppressMessage[^)]*\)))
to make it fail proof.
I'm trying to parse a gitolite.conf file, which is a whitespace-oriented conf file with a few regexes. The worst problem is that some options might appear anywhere:
#staff = dilbert alice # line 1
#projects = foo bar # line 2
repo #projects baz # line 3
RW+ = #staff # line 4
- master = ashok # line 5
RW = ashok # line 6
R = wally # line 7
config hooks.emailprefix = '[%GL_REPO] ' # line 8
Check the "master" attribute. Some repos have them, others do not. It's a real pain.
This answer assumes a goal of extracting key/value pairs into capturing groups, where key consists of contiguous non-whitespace before = and value includes everything after = but before #, trimmed of leading/trailing whitespace.
Basic version
([^\s]+)\s*=\s*((?:\s*[^\s#]+)*)
More advanced version
The regex above doesn't handle quoted strings very well (e.g. prefix = ' Quoted with # and leading/trailing whitespace '). Regex isn't great at this kind of thing but simple cases can be handled as follows:
([^\s]+)\s*=\s*('[^']*'|"[^"]*"|(?:(?:\s*[^\s#]+)*))
Here's the demo if you need to see what is captured and play around with it more: Debuggex Demo
First, you should know that this isn't entirely possible with Regex. Regex is a great tool for parsing regular languages (including some types of configuration files), but as soon as you get into "Well, this line is actually a header line and we need all lines under it, and some lines might have this token, and others might not", it gets quite messy. I'm not saying it's impossible, but you're going to waste a lot of time debugging your Regex pattern instead of just writing a parser in whatever language you're using this with.
Second, if you're going to ask a quesiton about Regex, it is always helpful to know what you want out of the expression. Do you want to tokenize everything, do you only want the configuration keys, do you only want the comments?
That being said, I took my best guess, here's an expression to get you started:
^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))
With this expression, please apply the g and m flags (global and multiline). In PCRE, this would look like:
/^(?:([^=#]+?)\s.?=?\s.?([^=#]+?)\s.?(?:#|$))/gm
There are two capture groups, one is whatever is before the = sign, and the other is whatever is after. If there is no = sign, the first capture group contains everything. Anything after "#" is ignored.
Here's a fiddle to demonstrate: http://www.rexfiddle.net/eQexbZU
I am using vbscript regex to find self-defined tags within a file.
"\[\$[\s,\S]*\$\]"
Unfortunately, I am doing something wrong, so it will grab all of the text between two different tags. I know this is caused by not excluding "$]" between the pre and post tag, but I can't seem to find the right way to fix this. For example:
[$String1$]
useless text
[$String2$]
returns
[$String1$]
useless text
[$String2$]
as one match.
I want to get
[$String1$]
[$String2$]
as two different matches.
Any help is appreciated.
Wade
The RegEx is greedy and will try to match as much as it can in one go.
For this kind of matching where you have a specific format, instead of matching everything until the closing tag, try matching NOT CLOSING TAG until closing tag. This will prevent the match from jumping to the end.
"\[\$[^\$]*\$\]"
Make the * quantifier lazy by adding a ?:
"\[\$[\s\S]*?\$\]"
should work.
Or restrict what you allow to be matches between your delimiters:
"\[\$.*\$\]"
will work as long as there is only one [$String$] section per line, and sections never span multiple lines;
"\[\$(?:(?!\$\])[\s\S])*\$\]"
checks before matching each character after a [$ that no $] follows there.
No need to use regex. try this. If your tags are always defined by [$...$]
Set objFS = CreateObject( "Scripting.FileSystemObject" )
strFile=WScript.Arguments(0)
Set objFile = objFS.OpenTextFile(strFile)
strContent = objFile.ReadAll
strContent = Split(strContent,"$]")
For i=LBound(strContent) To UBound(strContent)
m = InStr( strContent(i) , "[$" )
If m > 0 Then
WScript.Echo Mid(strContent(i),m) & "$]"
End If
Next