regular expression - regex

I am trying to find a regular expression which should satisfy the following needs.
It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.
2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController : z rest as async texting: json, special character, spacses.....
I would like to have the separators her identified as following (Separator shown as X)
2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....
2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....
Exactly 8 separtors are found here.
Any ideas how to do this via regular expression?
My current approach does not work as I tried to to this like the following
Any ideas about this?
Update:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s
That's my current syntax to identify the separators.
Update 2:
Regex above is faulty.
Update 3:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
This is the current regex. Targetapproach is to call
df = pd.read_csv(file_name,
sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
engine='python',
nrows=100)
This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.
The last column line is not identified correctly. For unknown reasons yet.

If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.
The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:
def separate(logline):
fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
if len(fields) > 8:
# Fix up the ninth field. Perhaps you want to remove the colon:
fields[8] = fields[8][1:]
# or perhaps you want the text starting at the first non-whitespace
# character after the colon:
#
# if fields[8][0] == ':':
# fields[8] = fields[8].split(maxsplit=1)[1]
#
# etc.
return fields
>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
... + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
... + " c.w.f.w.NiceController"
... + " : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
'[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
'c.w.f.w.NiceController',
' z rest as async texting: json, special character, spaces.....']

Solution
The current outcome of my problem can be solved via the following regular expression.
(?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
Maybe minor adaptions have to be done maybe but for now it works pretty good.

Related

Preprocess words that do not match list of words

I have a very specific case I'm trying to match: I have some text and a list of words (which may contain numbers, underscores, or ampersand), and I want to clean the text of numeric characters (for instance) unless it is a word in my list. This list is also long enough that I can't just make a regex that matches every one of the words.
I've tried to use regex to do this (i.e. doing something along the lines of re.sub(r'\d+', '', text), but trying to come up with a more complex regex to match my case. This obviously isn't quite working, as I don't think regex is meant to handle that kind of case.
I'm trying to experiment with other options like pyparsing, and tried something like the below, but this also gives me an error (probably because I'm not understanding pyparsing correctly):
from pyparsing import *
import re
phrases = ["76", "tw3nty", "potato_man", "d&"]
text = "there was once a potato_man with tw3nty cars and d& 76 different homes"
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(lambda word: re.sub(r'\d+', '', word)))
parser.parseString(text)
What's the best way to approach this sort of matching, or are there other better suited libraries that would be worth a try?
You are very close to getting this pyparsing cleaner-upper working.
Parse actions generally get their matched tokens as a list-like structure, a pyparsing-defined class called ParseResults.
You can see what actually gets sent to your parse action by wrapping it in the pyparsing decorator traceParseAction:
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(traceParseAction(lambda word: re.sub(r'\d+', '', word))))
Actually a little easier to read if you make your parse action a regular def'ed method instead of a lambda:
#traceParseAction
def unnumber(word):
return re.sub(r'\d+', '', word)
parser = OneOrMore(oneOf(phrases) ^ Word(alphanums).setParseAction(unnumber))
traceParseAction will report what is passed to the parse action and what is returned.
>>entering unnumber(line: 'there was once a potato_man with tw3nty cars and d& 76 different homes', 0, ParseResults(['there'], {}))
<<leaving unnumber (exception: expected string or bytes-like object)
You can see that the value passed in is in a list structure, so you should replace word in your call to re.sub with word[0] (I also modified your input string to add some numbers to the unguarded words, to see the parse action in action):
text = "there was 1once a potato_man with tw3nty cars and d& 76 different99 homes"
def unnumber(word):
return re.sub(r'\d+', '', word[0])
and I get:
['there', 'was', 'once', 'a', 'potato_man', 'with', 'tw3nty', 'cars', 'and', 'd&', '76', 'different', 'homes']
Also, you use the '^' operator for your parser. You may get a little better performance if you use the '|' operator instead, since '^' (which creates an Or instance) will evaluate all paths and choose the longest - necessary in cases where there is some ambiguity in what the alternatives might match. '|' creates a MatchFirst instance, which stops once it finds a match and does not look further for any alternatives. Since your first alternative is a list of the guard words, then '|' is actually more appropriate - if one gets matched, don't look any further.

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Combine two references into one or some serious regex work [duplicate]

This question already has an answer here:
Parse parameters and values of smarty-like string in PHP
(1 answer)
Closed 2 years ago.
Question, basically
If I have a regex ((key1)(value1)|(key2)(value2)), key1 is ref'd by $2 & key2 by $4. Is it possible to combine these into the same reference? (I'm guessing no)
Thus $7 might be key & $8 might be value, regardless of which capture group it originated in
Any regex masters who can solve the below? I've spent a couple hours on it and am kinda stuck.
I would like it to work across different regex engines with minimal modifications. Been testing with PCRE on regexr.com
What I'm doing
I'm trying to make a file format that is parsed into key/value pairs with a single regex.
There's just a few rules:
Keys are a string of characters at the start of a line, followed by a colon (:).
So far, I'm just using [a-z]+ for the keys, but that will be expanded to some more characters. I don't think that will functionally change the regex.
values can be multi-line
all white-space is trimmed from values
I don't think I've added this to the regex yet
Values end when another key begins
delimiters can be used to wrap values in the format key:DELIM: then the value, then :DELIM: on it's own line.
Delimiter can be an empty string, thus :: serves as a delimiter
The regex I have
Correctly matches non-delimited keys & values
([a-z]+):((?:(?:.|\n|\r)(?!^[a-z]+:))+)
Correctly matches delimited keys & values
([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2
Matches everything correctly, BUT requires two sets of references
(?:(?:([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2)|([a-z]+):((?:(?:.|\n|\r)(?!^[a-z]+:))+))
$1 & $5 are keys. $3 & $6 are values
Sample Input
key: value 1
nightmare:DELIM:
notakey:
obviously not a key
notakey:
:DELIM:
abc: value 2
new line
anotherkey:: value
nostring: on this one
::
Which would yield These key/value pairs
key
value1
nightmare
notakey:
obviously not a key
notakey:
abc
value 2
new line
anotherkey
value
nostring: on this one
My latest attempt
My latest attempt got me here, but it doesn't actually match anything:
^([a-z]+): # key CP#1
((?:[A-Z]*:)? # delimiter, optional
(?:\s*(\r?\n|$)) # whitespace, new line OR end of file (line?)
) # CP#2
( # value, CP#3
(?:(?:
(?:.|\n|\r) # characters we want
(?!^[a-z]+:) # But NOT if those characters make up a key
)+)
| # or
((.|\r|\n)*) # characters we want
^:\2 # Ends with delimiter
) # delimited value
Thanks to the commenter for the ?| operator, which turns out to be what I needed.
((key1)(value1)|(key2)(value2)) => (?|(key1)(value1)|(key2)(value2)).
(?|(?:([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2)|([a-z]+):()((?:(?:.|\n|\r)(?!^[a-z]+:))+)) basically does it, though the final product certainly still needs more work.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Parsing Transact SQL with RegEx

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase - so I'm only looking for upperCase
words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the
previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in
quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far with I've gotten only this far: (not too far...)
(?-i)[[:upper:]]
Any help would be most appreciated.
TIA,
MN
This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER'
-- AND NO TABLES HERE AS WELL
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '
or
update TABLE set x='not '' A '''' table' where y=666
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.
A little demo in Python:
#!/usr/bin/env python
import re
input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""
regex = r"""(?xs) # x = enable inline comments, s = enable DOT-ALL
(--[^\r\n]*) # [1] comments
| # OR
('(?:''|[^\r\n'])*') # [2] string literal
| # OR
(\b(?:AND|UPDATE|SET)\b) # [3] keywords
| # OR
([A-Z][A-Z_]*) # [4] capitalized word
| # OR
. # [5] fall through: matches any char
"""
output = ''
for m in re.finditer(regex, input):
# append a `#` if group(4) matched
if m.group(4): output += '#'
# append the matched text (any of the groups!)
output += m.group()
# print the adjusted SQL
print output
which produces:
UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'
This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.
Good luck.