Match words which have given regex around - regex

I need to get all the text that have two or more spaces "\s{2,}" around them.
Given the following text:
IP Address Name Location Type
10.1.10.5 USLAXBOWC01RB Santa Monica, CA local
I need to extract:
Line1: "IP Address", "Name", "Location", "Type"
Line2: "10.1.10.5", "USLAXBOWC01RB", "Santa Monica, CA", "local"
EDIT:
Text eligible for extraction:
"IP Address" & "Name" are two or more spaces apart so they are eligible to be extracted. Similarly, "Santa Monica, CA" & "local".

You try to split your text according the pattern "\s{2,}".
Thus, in Python, the regex lib re give you all the needed tools:
import re
line = "IP Address Name Location Type"
result = re.split('\s{2,}',line)
Which gives:
['IP Address', 'Name', 'Location', 'Type']
EDIT
I guess i understood a little more your question : you more care about isolating a sequence between \s{2,}, than splitting it. In your example, however, the solution above seems to be the most suitable.
You asked for a regex, here it is :
reg1 = "[^\s](?!\s{2,})(?:.(?!\s{2,}))*[^\s]"
It first selects a character which is not a space with [^\s](?!\s{2,}) not followed by two spaces or more. To do so, I used the negative lookahead assertion (?!...) ;
Then, it isolates a group (?:...) composed in this way : any character . which is not followed by \s{2,} ;
Repeat with * ;
It happens that the final character is not selected if we stop now. So we should add one more [^\s].
A re.findall(reg1,line), and you should be done. One drawback maybe : it detects sequences which are at least two characters long.
In that case, an other and simpler regex could eventually complete the job : reg2 = "\s{2,}([^\s])\s{2,}". It selects single non-space characters surrounded by two spaces or more. The use of the bracket (...), forces to return only the character.
By the way, I strongly advise a look on the documentation : https://docs.python.org/2/library/re.html
Hope you found what you are looking for :-)

Related

Match Regex Starting After "X" Number of Characters

I am using regex in a Google script to normalize company names, and while I am getting very close to perfect with a combination of replacing certain words, punctuation, and spaces, my last step was to replace any word with 3 or fewer letters.
But that gets rid of a few companies with acronyms at the start of their name, ie AB Holding Company. I don't want this to match AB, I want it to find the rare "the", or company code (particularly foreign ones like SPA and NV along with Co and Inc). These codes are not necessarily at the end of the string, but they seem to always be at least 4 characters after the beginning.
I am currently using
text = text.replace(/\b[a-z]{1,3}\b)/i," ");
Ignore the [a-z] as missing caps, I've dealt with that separately
What I think would work is to "skip over" the first few characters, probably 4 to be safe, and maybe learn how to include spaces and/or digits in there for the future. So I wrote this after seeing 1 other related question here.
text = text.replace(/((.{4})(.*)\b[a-z]{1,3}\b)/i," ");
Scipts does not seem to allow a lookbehind, and my version doesn't seem to work. I'm lost.
I appreciate your help.
Here is a solution:
text = text.replace("/^(.{4}.*)(\b[a-z]{1,3}\b)(.*)/gmi", "$1$3");
What I have changed is:
enclosed all groups in parenthesis - so that they can be captured and used in the replacement;
since you mentioned that the word-to-be-replaced might not be in the end of the string, I also added a third group - to match everything after.
included the part before and after the word in the replacement string (group 1 and group 3).
However, note that it might return false positives - i.e. if a company name is Company ABC, Inc., it will also capture ABC. Thus, if you know the words you want to replace, it might be better to just use an alteration:
text = text.replace("/^(.{4}.*)\b(Co|Inc|SPA|NV|the)\b(.*)/gmi", "$1$3");

Regex: Removing Space Between Quotes, And Stopping Before a Colon (With Yahoo Pipes)

I've been working on this for a while, but it's beyond my understanding of regex.
I'm using Yahoo Pipes on an RSS, and I want to create hashtags from titles; so, I'd like to remove space from everything between quotes, but, if there's a colon within the quotes, I only want the space removed between the words before the colon.
And, it would be great if I could also capture the unspaced words as a group, to be able to use: #$1 to output the hashtag in one step.
So, something like:
"The New Apple: Worlds Within Worlds" Before We Begin...
Could be substituted like #$1 - with this result:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
After some work, I was able to come up with, this regex:
\s(?=\s)?|(‘|’|(Review)|:.*)
("Review" was a word that often came before colons and wouldn't be stripped, if it were later in the title; that's what that's for, but I would like to not require that, to be more universal)
But, it has two problems:
I have to use multiple steps. The result of that regex would be:
"TheNewApple: Worlds Within Worlds" Before We Begin...
And I could then add another regex step, to put the hash # in front
But, it only works if the quotes are first, and I don't know how to fix that...
You can do this all in one step with regex, with a caveat. You run into problems with a repeated capturing group because only the last iteration is available in the replacement string. Searching for ( (\w+))+ and replacing with $2 will replace all the words with just the last match - not what we want.
The way around this is to repeat the pattern an arbitrary number of times that will suffice for your use. Each separate group can be referenced.
Search: "(\w+)(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?
Replace: "#$1$2$3$4$5$6
This will replace up to 6-word titles, exactly as you need them. First, "(\w+) matches any word following a quote. In the replacement string, it is put back as "#$1, adding the hashtag. The rest is a repeated list of (?: (\w+))? matches, each matching a possible space and word. Notice the space is part of a non-capturing group; only the word is part of the inner capture group. In the replacement string, I have $1$2$3$4$5$6, which puts back the words, without the spaces. Notice that a colon will not match any part of this, so it will stop once it hits a colon.
Examples:
"The New Apple: Worlds Within Worlds" Before We Begin...
"The New Apple" Before We Begin...
"One: Two"
only "One" word
this has "Two Words"
"The Great Big Apple Dumpling"
"The Great Big Apple Dumpling Again: Part 2"
Results:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
"#TheNewApple" Before We Begin...
"#One: Two"
only "#One" word
this has "#TwoWords"
"#TheGreatBigAppleDumpling"
"#TheGreatBigAppleDumplingAgain: Part 2"
You can match the text with
"([^:]*)(.*?)"(.*)
then use some programming language to output the result like this:
'"#' + removeSpace($1) + $2 + '"' + $3
I have no idea what language you're using, but this seems like a poor choice for regex. In Python I'd do this:
# Python 3
import re
titles = ['''"The New Apple: Worlds Within Worlds" Before We Begin...''',
'''"Made Up Title: For Example Only" So We Can Continue...''']
hashtagged_titles = list()
for title in titles:
hashtagme, *restofstring = title.split(":")
hashtag = '"#'+hashtagme[1:].translate(str.maketrans('', '', " "))
result = "{}:{}".format(hashtag, restofstring)
hashtagged_titles.append(result)
Do a global search for
\ (?=.*:)
Replaced with nothing. Example
You'll need a second search on the results of that if you want to capture "TheNewApple" as a single word.

Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."

I'm writing a Python chatbot. No matter what the technique is(Levenshtein, LCS, regex, etc.), I want a pattern like My name is [ A ]. smart enough to match strings like:
My name is Tslmy. #Distance should = 0, and groupdict()['a'] outputs "Tslmy"
My name is Tesla Tahomana. #Distance should = 0(!), and groupdict()['a'] outputs "Tesla Tahomana"
my naem ist tslmy . #With a little typo, the distance = 5, and groupdict()['a'] outputs "tslmy "
Allow me to use groupdict()['a'] to refer to what the [ A ] thing (actually (?P<identifier>match)) has captured, please.
In other way, I'm looking for a "Levenshtein" with omits/skippings/blanks/neglects, and pick out what has been skipped as well.
In another way, I'm looking for a fuzzy(a.k.a. approximate) regex that can be less strict with the pattern, still provides the good old groupdict(), as well as a "fuzziness" value (or "edit distance", required to determine "the best matched pattern to the string" later).
This is the preferred solution, since it provides "sufficient" groupdict() if well managed.
However, The TRE library and the REGEX library, which is found to be the closest solution, don't seem to provide a "fuzziness" value. If this can be solved, then so much the better!
Is that possible? Thanks for paying attention.
Update:
I decided to use the powerful regex module in the end, but still unable to get the "fuzziness value".
Since the question on this page is theoratically solved, appending too further will be dishonorable. So I put forward another question about this new issue, and hopes you could solve it!
You could use a RegEx for the basic match:
r"My name is (\w+){1,2}."
And then use the TRE library to allow for variations.
DAT REGEX O_O
(?i)(?:(?:my|ym).?|.?(?:my|ym))\s+(?:.?(?:..me|n..e|na..)|(?:..me|n..e|na..).?)\s+(?:(?:is|si).?|.?(?:is|si))\s+(\w[\w\s])\s
Let's split it up:
(?i) : set the i modifier to match case insensitive
(?:(?:my|ym).?|.?(?:my|ym)) : this will match my, ym, My, Ym, may, amy etc...
\s+ : match white space one or more times
(?:.?(?:..am|n..e|na..)|(?:..am|n..e|na..).?) : match name, naao, tame, lame, n99e, names, Naats etc...
\s+ : match white space one or more times
(?:(?:is|si).?|.?(?:is|si)) : Match is, si, ist, sit, siR etc...
\s+ : match white space one or more times
(\w[\w\s]*) : match words and spaces one or more times and group it (it must start with a word \w)
\s* : match white spaces zero or more times
Online demo

how to use a regular expression to extract json fields?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?
Example:
"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24
I want to extract URL,TITLE,TAGS,
/"(url|title|tags)":"((\\"|[^"])*)"/i
I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:
"
A literal ".
(url|title|tags)
Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.
":"
Another literal string.
(
The beginning of another group. (Group 2)
(
Another group (3)
\\"
The literal string \" - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.
|
or...
[^"]
Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.
)
End of group 3...
*
The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.
)"
The end of group 2, and a literal ".
I've done a few non-obvious things here, that may come in handy:
Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
The i at the end of the expression makes it case insensitive.
Group 1 contains the name of the captured field.
EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.
Your new Regex is:
/"(url|title|tags)":("(\\"|[^"])*"|\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\])/i
All I've done here is alternate the string regular expression I had been using ("((\\"|[^"])*)"), with a regular expression for finding arrays (\[("(\\"|[^"])*"(,"(\\"|[^"])*")*)?\]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:
\[(S(,S)*)?\]
Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.
With our same S Notation, here's the whole dirty Regular Expression:
/"(url|title|tags)":(S|\[(S(,S)*)?\])/i
If it helps to see it in action, here's a view of it in action.
This question is a bit older, but I have had browsed a bit on my PC and found that expression. I passed him as GIST, could be useful to others.
EDIT:
# Expression was tested with PHP and Ruby
# This regular expression finds a key-value pair in JSON formatted strings
# Match 1: Key
# Match 2: Value
# https://regex101.com/r/zR2vU9/4
# http://rubular.com/r/KpF3suIL10
(?:\"|\')(?<key>[^"]*)(?:\"|\')(?=:)(?:\:\s*)(?:\"|\')?(?<value>true|false|[0-9a-zA-Z\+\-\,\.\$]*)
# test document
[
{
"_id": "56af331efbeca6240c61b2ca",
"index": 120000,
"guid": "bedb2018-c017-429E-b520-696ea3666692",
"isActive": false,
"balance": "$2,202,350",
"object": {
"name": "am",
"lastname": "lang"
}
}
]
the json string you'd like to extract field value from
{"fid":"321","otherAttribute":"value"}
the following regex expression extract exactly the "fid" field value "321"
(?<=\"fid\":\")[^\"]*
Please try below expression:
/"(url|title|tags)":("([^""]+)"|\[[^[]+])/gm
Explanation:
1st Capturing Group (url|title|tags): This is alternatively capturing the characters 'url','title' and 'tags' literally (case sensitive).
2nd Capturing Group ("([^""]+)"|[[^[]+]):
1st Alternative "([^""]+)" is matches all words within " and " including " and "
2nd Alternative [[^[]+] is matches all words within [ and ] including [ and ]
I have tested here
I adapted regex to work with JSON in my own library. I've detailed algorithm behavior below.
First, stringify the JSON object. Then, you need to store the starts and lengths of the matched substrings. For example:
"matched".search("ch") // yields 3
For a JSON string, this works exactly the same (unless you are searching explicitly for commas and curly brackets in which case I'd recommend some prior transform of your JSON object before performing regex (i.e. think :, {, }).
Next, you need to reconstruct the JSON object. The algorithm I authored does this by detecting JSON syntax by recursively going backwards from the match index. For instance, the pseudo code might look as follows:
find the next key preceding the match index, call this theKey
then find the number of all occurrences of this key preceding theKey, call this theNumber
using the number of occurrences of all keys with same name as theKey up to position of theKey, traverse the object until keys named theKey has been discovered theNumber times
return this object called parentChain
With this information, it is possible to use regex to filter a JSON object to return the key, the value, and the parent object chain.
You can see the library and code I authored at http://json.spiritway.co/
if your json is
{"key1":"abc","key2":"xyz"}
then below regex will extract key1 or key2 based on a key that you pass in regex
"key2(.*?)(?=,|}|$)
you can verify it here - regex101.com
Why does it have to be a Regular Expression object?
Here we can just use a Hash object first and then go search it.
mh = {"url":"http://www.netcharles.com/orwell/essays.htm","domain":"netcharles.com","title":"Orwell Essays & Journalism Section - Charles' George Orwell Links","tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],"index":2931,"time_created":1345419323,"num_saves":24}
The output of which would be
=> {:url=>"http://www.netcharles.com/orwell/essays.htm", :domain=>"netcharles.com", :title=>"Orwell Essays & Journalism Section - Charles' George Orwell Links", :tags=>["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"], :index=>2931, :time_created=>1345419323, :num_saves=>24}
Not that I want to avoid using Regexp but don't you think it would be easier to take it a step at a time until your getting the data you want to further search through? Just MHO.
mh.values_at(:url, :title, :tags)
The output:
["http://www.netcharles.com/orwell/essays.htm", "Orwell Essays & Journalism Section - Charles' George Orwell Links", ["orwell", "writing", "literature", "journalism", "essays", "politics", "essay", "reference", "language", "toread"]]
Taking the pattern that FrankieTheKneeman gave you:
pattern = /"(url|title|tags)":"((\\"|[^"])*)"/i
we can search the mh hash by converting it to a json object.
/#{pattern}/.match(mh.to_json)
The output:
=> #<MatchData "\"url\":\"http://www.netcharles.com/orwell/essays.htm\"" 1:"url" 2:"http://www.netcharles.com/orwell/essays.htm" 3:"m">
Of course this is all done in Ruby which is not a tag that you have but relates I hope.
But oops! Looks like we can't do all three at once with that pattern so I will do them one at a time just for sake.
pattern = /"(title)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
#<MatchData "\"title\":\"Orwell Essays & Journalism Section - Charles' George Orwell Links\"" 1:"title" 2:"Orwell Essays & Journalism Section - Charles' George Orwell Links" 3:"s">
pattern = /"(tags)":"((\\"|[^"])*)"/i
/#{pattern}/.match(mh.to_json)
=> nil
Sorry about that last one. It will have to be handled differently.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.