Expression to extract country name? - regex

I have a dataframe of coefficients for countries, where each coefficient looks like:
s = "C(Country)[T.China]"
s2 = "C(Country)[T.Italy]"
s3 = "C(Country)[T.United States]"
How would I go about extracting just the country name (i.e: "China" or "Italy"?)
And can this be done with a "strip" command instead of regex?

This expression will do the job:
re.findall('T.([a-z|A-Z]*)',s)

My guess is that maybe this simple expression would work:
T\.\s*([^]]+)
Test
import re
regex = r"T\.\s*([^]]+)"
test_str = ("C(Country)[T.China]\n"
"C(Country)[T.Italy]\n"
"C(Country)[T.United States]")
print(re.findall(regex, test_str))
Output
['China', 'Italy', 'United States']
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.

Related

Regex MySQL find words separated by punctuation

Take an input like "This is, in text, an example! Cool stuff"
I have some C# code that takes it, removes the punctuation, splits on the spaces, and returns the first 6 elements:
var title = new string(input.Where(c => !char.IsPunctuation(c)).ToArray()).Split(' ').Take(6);
so I get an array of:
["This", "is", "in", "text", "an", "example"]
From that array, how can I work backwards to match it to the original input? I've tried doing:
'This|is|in|text|an|example' but it's not precise enough, as I think it's going or's instead of and's.
I'm going to use the regex expression in an SQL query, something like:
SELECT t.*, Max(e.Timestamp) As EventUpdated, Min(e.Timestamp) as Timestamp
From test t
Left Join edithistory e on t.IdTimelineinfo = e.IdTimelineinfo
where t.date = "2020-12-06" and t.Title REGEXP 'Testing|two|events|on|the';
I'm really new to regex and would appreciate any help.
I ended up using REGEX like the following:
DbTitle = string.Join("[^a-zA-Z]*", ArraryOfWords);
var title = $"[^a-zA-Z]*{DbTitle}";
SELECT t.*, Max(e.Timestamp) As EventUpdated, Min(e.Timestamp) as Timestamp
From test t
Left Join edithistory e on t.IdTimelineinfo = e.IdTimelineinfo
where t.date = #date and t.Title Regexp #title and Confirmed = 1;

Replace Words proceeding it with Regex

I have two strings like this:
word=list()
word.append('The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3')
word.append('Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG')
I want to remove the words starting from VHSDVDRIP and DVDRIP onward. So from The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3 to The.Eternal.Evil.of.Asia.1995. and Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG to Guzoo.1986.
I tried the following but it doesn't work:
re.findall(r"\b\." + 'DVDRIP' + r"\b\.", word)
You could use re.split for that (regex101):
s = 'The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3'
import re
print( re.split(r'(\.[^.]*dvdrip\.)', s, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Some test cases:
lst = ['The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3',
'Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG']
import re
for item in lst:
print( re.split(r'(\.[^.]*dvdrip\.)', item, 1, flags=re.I)[0] )
Prints:
The.Eternal.Evil.of.Asia.1995
Guzoo.1986
If you wish to replace those instances, that I'm guessing, with an empty string, maybe this expression with an i flag may be working:
import re
regex = r"(?i)(.*)(?:\w+)?dvdrip\W(.*)"
test_str = """
The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3
Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG
"""
subst = "\\1\\2"
print(re.sub(regex, subst, test_str))
Output
The.Eternal.Evil.of.Asia.1995.x264.AC3
Guzoo.1986.VHSx264.AC3.HS.ES-SHAG
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Consider re.sub:
import re
films = ["The.Eternal.Evil.of.Asia.1995.DVDRip.x264.AC3", "Guzoo.1986.VHSDVDRiP.x264.AC3.HS.ES-SHAG"]
for film in films:
print(re.sub(r'(.*)VHSDVDRiP.*|DVDRip.*', r'\1', film))
Output:
The.Eternal.Evil.of.Asia.1995.
Guzoo.1986.
Note: this leaves the trailing period, as requested.

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

How can I replace a string which comes with some pattern and keeping the rest same?

I have a string something like
val changeMe= "select myTable.x,myTable.y,myTable.myTable from myTable join myTable2 ON myTable.x = myTable2.x"
and I just want to replace the table name myTable with another string like myTable3 and want to keep the column in myTable.myTable as the same myTable. and the output string should be like
val outputString= "select myTable3.x,myTable3.y,myTable3.myTable from myTable3 join myTable2 ON myTable3.x = myTable2.x"
Please let me know how can I do that using regex in scala?
Thanks.
Use a lookbehind.
(?<!\\.)\\bmyTable\\b
See demo.
https://regex101.com/r/OFNGdK/1
You can use : (?<=[, ])myTable(?=[^2])
Demo
Sounds like a job for replaceAll().
changeMe.replaceAll("(?<![.])myTable(?!2)", "myTable3")
//res0: String = select myTable3.x,myTable3.y,myTable3.myTable from myTable3 join myTable2 ON myTable3.x = myTable2.x
Use negative look-behind and look-ahead to help isolate the target string from its imitators.

Find and replace between second and third slash

I have urls with following formats ...
/category1/1rwr23/item
/category2/3werwe4/item
/category3/123wewe23/item
/category4/132werw3/item
/category5/12werw33/item
I would replace the category numbers with {id} for further processing.
/category1/{id}/item
How do i replace category numbers with {id}. I have spend last 4 hours with out proper conclusion.
Assuming you'll be running regex in JavaScript, your regex will be.
/^(\/.*?\/)([^/]+)/gm
and replacement string should look like $1whatever
var str = "your url strings ..."
var replStr = 'replacement';
var re = /^(\/.*?\/)([^/]+)/gm;
var result = str.replace(re, '$1'+replStr);
console.log(result);
based on your input, it should print.
/category1/replacement/item
/category2/replacement/item
/category3/replacement/item
/category4/replacement/item
/category5/replacement/item
See DEMO
We devide it into 3 groups
1.part before replacement
2.replacement
3.part after replacement
yourString.replace(//([^/]*\/[^/]+\/)([^/]+)(\/[^/]+)/g,'$1' + replacement+ '$3');
Here is the demo: https://jsfiddle.net/9sL1qj87/