Regular expression within count() of a list not working - regex

I am trying to count certain expressions in tokenized texts. My code is:
tokens = nltk.word_tokenize(raw)
print(tokens.count(r"<cash><flow>"))
'tokens' is a list of tokenized texts (partly shown below). But the regex here is not working and the output shows 0 occurrence of 'cash flow', which is not correct. And I receive no error message. If I only count 'cash', it works fine.
'that', 'produces', 'cash', 'flow', 'from', 'operations', ',', 'none', 'of', 'which', 'are', 'currently', 'planned', ',', 'the', 'cash', 'flows', 'that', 'could', 'result', 'from'
Anyone knows what the problem is?

You don't need regex for this.
Just the find the matching keywords in tokens and count the elements.
Example:
tokens = ['that','produces','cash','flow','from','operations','with','cash']
keywords = ['cash','flow']
keywords_in_tokens = [x for x in keywords if x in tokens]
count_keywords_in_tokens = len(keywords_in_tokens)
print(keywords_in_tokens)
print(count_keywords_in_tokens)
count_keywords_in_tokens returns 2 because both words are found in the list.
To do it the regex way, you need a string to find the matches based on a regex pattern.
In the example below the 2 keywords are separated by an OR (the pipe)
import re
tokens = ['that','produces','cash','flow','from','operations','with','cash']
string = ' '.join(tokens)
pattern = re.compile(r'\b(cash|flow)\b', re.IGNORECASE)
keyword_matches = re.findall(pattern, string)
count_keyword_matches = len(keyword_matches)
print(keyword_matches)
print(count_keyword_matches)
count_keyword_matches returns 3 because there are 3 matches.

Related

Tokenize a sentence where each word contains only letters using RegexTokenizer Scala

I am using spark with scala and trying to tokenize a sentence where each word should only contain letters. Here is my code
def tokenization(extractedText: String): DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
val textDataFrame = existingSparkSession.createDataFrame(Seq(
(0, extractedText))).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.select("sentence", "words").show(false)
return regexTokenized;
}
If I provide senetence as "I am going to school5" after tokenization it should have only [i, am, going, to] and should drop school5. But with my current pattern it doesn't ignore the digits within words. How am I suppose to drop words with digits ?
You can use the settings below to get your desired tokenization. Essentially you extract words which only contain letters using an appropriate regex pattern.
val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setGaps(false).setPattern("\\b[a-zA-Z]+\\b")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.show(false)
+---+---------------------+------------------+
|id |sentence |words |
+---+---------------------+------------------+
|0 |I am going to school5|[i, am, going, to]|
+---+---------------------+------------------+
For the reason why I set gaps to false, see the docs:
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
You want to repeatedly match the regex, rather than splitting the text by a given regex.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

Split regex matches into multiple lines

I'm using regex to read a line, gather all the matches and print each match as a new line.
So far i have read the line and extracted the data I need but the code prints it all in a single line.
Is there a way to print each match separately?
Here is the code i have been using:
import os
import re
msg = "0,0.000000E+000,NCAP,64Q34,39,39,1028,NCAP,1,1,NCAP"
text = [msg.split(',')]
which gives me [['0', '0.000000E+000', 'NCAP', '64Q34', '39', '39', '1028', 'NCAP', '1', '1', 'NCAP']].
Searching for data between ' ' will get me the individual results.
Using the code below will find all matches but it keeps it all as one line, giving me the same as the input.
text = str(text)
line = text.strip()
m = re.findall("'(.+?)'", line)
found = str(m)
print(found+ '\n')
I am unsure what you are trying to capture using regexs, but from what I understand you want to split msg up by commas ',' and print each element on a new line.
msg = "0,0.000000E+000,NCAP,64Q34,39,39,1028,NCAP,1,1,NCAP"
msg = msg.split(',')
for m in msg:
print(m)
>>> 0
0.000000E+000
NCAP
...
This will print each element of msg on a new line - the elements of msg are split up by ','.
I would also use this great online interactive regex tester to test your regexs in real time to understand how to use regex / which expressions to use. (make sure to select python language).

Python3 - Handling hyphenated words: combine and split

I want to handle hyphenated words. For example, I'd like to handle the word "well-known" in two different ways.
First, combine this word i.e. ("wellknown") and the second way is to split the word i.e. ("well","known").
The input would be: "well-known" and the expected output is:
--wellknown
--well
--known
But I can only parse each word individually, but not both at the same time. When I loop through my text file, and if I am looking for the hyphenated words, I combine them first.
Then after I've combined them, I don't know how to go back to the original word again and do the split operation. The following are short pieces from my code. (please let me know if you need to see more details)
for text in contents.split():
if not re.search(r'\d', text): #remove numbers
if text not in string.punctuation: #remove punctuation
if '-' in term:
combine = text.replace("-", '') #??problem parts (wellknown)
separate = re.split(r'[^a-z]', text) #??problem parts (well, known)
I know the reason I cannot do both of the operation at the same time, because after I replaced the hyphenated word this word disappeared. Then I couldn't find the hyphenated word to do the split (in the code is "separate") operation. Does any one have any idea how to do it? Or how to fix the logic?
why not just use a tuple containing the separated words and the combined word.
split first then combine:
Sample Code
separate = text.split('-')
combined = ''.join(separate)
words = (combined, separate[0], separate[1])
Output
('wellknown', 'well', 'known')
Think of tokens more as an object than strings, then you can create a token with multiple attributes.
For example, we can use the collections.namedtuple container as a simple object to hold the token:
from collections import namedtuple
from nltk import word_tokenize
Token = namedtuple('Token', ['surface', 'splitup', 'combined'])
text = "This is a well-known example of a small-business grant of $123,456."
tokenized_text = []
for token in word_tokenize(text):
if '-' in token:
this_token = Token(token, tuple(token.split('-')), token.replace('-', ''))
else:
this_token = Token(token, token, token)
tokenized_text.append(this_token)
Then you can iterate through the tokenized_text as a list of the Token namedtuple, e.g. if we just need the list of surface strings:
for token in tokenized_text:
print(token.surface)
tokenized_text
[out]:
This
is
a
well-known
example
of
a
small-business
grant
of
$
123,456
.
If you need to access the combined tokens:
for token in tokenized_text:
print(token.combined)
[out]:
This
is
a
wellknown
example
of
a
smallbusiness
grant
of
$
123,456
.
If you want to access the split up tokens, use the same loop but you'll see that you get a tuple instead of string, e.g.
for token in tokenized_text:
print(token.splitup)
[out]:
This
is
a
('well', 'known')
example
of
a
('small', 'business')
grant
of
$
123,456
.
You can use a list comprehension to access the attributes of the Token namedtuples too, e.g.
>>> [token.splitup for token in tokenized_text]
['This', 'is', 'a', ('well', 'known'), 'example', 'of', 'a', ('small', 'business'), 'grant', 'of', '$', '123,456', '.']
To identify the tokens that have hyphen and have been split up, you could easily check for its type, e.g.
>>> [type(token.splitup) for token in tokenized_text]
[str, str, str, tuple, str, str, str, tuple, str, str, str, str, str]