Match both of these with regex, not just one of - regex

I'm looking in a sql table through a bunch of names and I want to get a list of all the different titles used. e.g. SNR, MRS, MR, JNR etc
Sometimes there is an entry that might have 2 titles, e.g: MR NAME NAME JNR. I want both of these titles 'MR' & 'JNR'
I thought a good way to do this would be with regex and find any names that have 2 or 3 characters. A title at the front would be followed by a space, while a title at the end would be preceded by one. So I have:
/(^[A-Z]{2,3})\s|\s(^[A-Z]{2,3}$)/
a regex101 example here.
As you can see I've used 'match either A or B' thing. If I throw at it a name with a title at either the start or finish, I end up getting what I want, but I don't know how to tell it to get both. i.e. strings with 2 titles will only give me back one match.
How do I get both?

Instead of an "OR", you could just match any character in between:
(^[A-Z]{3})\s.*\s([A-Z]{3}$)

Related

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

How to remove/replace specials characters from a 'dynamic' regex/string on ruby?

So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here

Complex regex situation

I have a results list that looks like this:
1lemon_king9mumu (2-1), YearofHell (2-0), kriswithak (2-1)0.44440.75000.4444
2mumu6lemon_king (1-2), MogwaiAC (2-0), Dathanja (2-1)0.66670.62500.5655
3MogwaiAC6Dathanja (2-0), mumu (0-2), Jebnarf (2-1)0.55560.57140.5417
4Jebnarf6YearofHell (2-1), kriswithak (2-0), MogwaiAC (1-2)0.44440.62500.4266
5YearofHell3Jebnarf (1-2), lemon_king (0-2), Mig82 (2-1)0.66670.37500.6012
6Dathanja3MogwaiAC (0-2), Mig82 (2-1), mumu (1-2)0.55560.37500.5417
7Mig823Bye, Dathanja (1-2), YearofHell (1-2)0.33330.42860.3750
8kriswithak0Jebnarf (0-2), lemon_king (1-2)0.83330.20000.6875
I want to be able to pull the username of the person AFTER the rank (first number) but it is mashed together with points gained by the player, as well as their first opponent.
For example, the first persons name is "Lemon_king", and his opponents were "Mumu", "YearofHell" and "Kriswithak". The numbers on the right are irrelevant for me, but the major problem I have is that the number of points won by the player is there. Lemon_King wins 9 points for first place. I would normally just get the name by looking for the string between 1 and 9, but players usernames can have a 9 in it as well.
Can anyone think of a good solution to this problem to be able to grab the persons username?
Thanks
I think you'd need a list of the usernames to compare against; it doesn't look like the results list is "regular" enough for a regular expression.
For example the line
7Mig823Bye, Dathanja
Could be "Mig82" 3 points vs "Bye, Dathanja", but it could also be "Mig8", 23 points, "Bye, Dathanja" or "Mig8", 2 points, "3Bye, Dathanja".
Is that correct? Because if it is, you aren't going to get away with a simple solution.
Edit: Wilson commented that getting the list of usernames might be an option. In that case, something like the following might work:
/^\d+?(username1|username2|username3)\d+?(username1|username2|username3)/
It will probably take some fiddling to get right.
Here's a plnkr demonstrating it with the data you provided: http://plnkr.co/edit/nJeGfbfHgvh5zJcTWRXS?p=preview
That said, a regex might not be the right tool for this job.
As far as I can tell, you want something like
(?x) # allow whitespace and comments just like
# any real programming language
^ # beginning of line
( \d+ ) # starts with one or more digits: CAPTURE 1
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 2
( \d ) # next is a single digit: CAPTURE 3
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 4
# now add things for the rest of the line if you want
Your username should now be in the second capture. I’ve been a tad more careful than strictly necessary, but if you end up munging this, you may need that. I’ve alos put all the captures in case you want to move stuff around or pull more stuff out.
Please provide a bit more information, if you want the thing between the first number and second number:
[0-9]+([^0-9])
The first group will contain the first username.
Please comment on this (so I check) an edit your question with more detail though.
I wouldnt use regex. It will be a pain to debug it, and you'll never be 100% certain you've covered all the edge cases.
Try doing 'manual' parsing using your language of choice's built in string manipulation functions.

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Grep for Pattern in File in R

In a document, I'm trying to look for occurences of a 12-digit string which contains alpha and numerals. A sample string is: "PXB111X2206"
I'm trying to get the line numbers that contain this string in R using the below:
FileInput = readLines("File.txt")
prot_pattern="([A-Z0-9]{12})";
prot_string<-grep(prot_pattern,FileInput)
prot_string
This worked fine until it hit a document containing all upper-case titles and returned a line containing the word "CONCENTRATIO"
The string I am trying to look for is: "PXB111X2206". I am expecting the grep to return the line numbers containing the string : "PXB111X2206". It however is returning the line number containing the word: "CONCENTRATIO"
What is wrong with my expression above? Any idea what I am doing wrong here?
Here is some sample input:
Each design objective described herein is significantly important, yet it is just one aspect of what it takes to achieve a successful project.
A successful project is one where project goals are identified early on and where the >interdependencies of all building systems are coordinated concurrently from the planning and programming phase.
CONCENTRATION:
The areas of concentration for design objectives: accessible, aesthetics, cost effective, >functional/operational, historic preservation, productive, secure/safe, and sustainable and >their interrelationships must be understood, evaluated, and appropriately applied.
Each of these design objectives is presented in the design objectives document number. >PXB111X2206.
>
Thanks & Regards,
Simak
You are using a very powerful tool for a very simple task, the expression
[A-Z0-9]{12}
will match any alphanumeric 12 sized uppercased string, for example the word "CONCENTRATIO", however, your "PXB111X2206" is not even 12 symbols long, so it is not possible that is being matched. If you only want to match "PXB111X2206" you only have to use it as a regular expression itself, for example, if you file contents are:
foo
CONCENTRATIO.
bazz
foo bar bazz PXB111X2206 foo bar bazz
foo
bar
bazz
and you use:
grep('PXB111X2206',readLines("File.txt"))
then R will only match line 4 as you would wish.
EDIT
If you are looking for that specific pattern try:
grep('[A-Z]{3}[0-9]{3}[A-Z]{1}[0-9]{4}',readLines("File.txt"))
That expression will match strings like 'AAADDDADDDD' where A is an capital letter, and D a digit, the regular expression contains a group (symbols inside square brackets) and a quantifier (the number inside the brackets) that tells how many of the previous symbol will the expression accept, if no quantifier is present it assumes it is 1.
Let's take a look at what your regular expression means. [A-Z0-9] means any capitalized letter or number and {12} means the previous expression must occur exactly 12 times. The string CONCENTRATIO is 12 capitaized letters, so it's no surprise that grep picks it up. If you want to take out the matches that match to just letters or just numbers you could try something like
allleters <- grep("[A-Z]{12}",strings)
allnumbers <-grep("[0-9]{12}",strings)
both <- grep("[A-Z0-9]{12}",strings)
the matches you wanted would then be something like
both <- both[!both %in% union(allletters,allnumbers)]
Someone with better regexfu might have a more elegant solution, but this will work too.