Remove consecutive spaces in RDD lines in Spark - regex

My data set after a lot of programmatic clean up looks like this (showing partial data set here).
ABCD A M#L 79
BGDA F D#L 89
I'd like to convert this into the following for further Spark Dataframe operations
ABCD,A,M#L,79
BGDA,F,D#L,89
val reg = """/\s{2,}/"""
val cleanedRDD2 = cleanedRDD1.filter(x=> !reg.pattern.matcher(x).matches())
But this returns nothing. How do i find and replace empty strings with a delimiter?
Thanks!
rt

It seems you just want to replace all the non-vertical whitespaces in your string data. I suggest using replaceAll (to replace all the occurrences of the texts that match the pattern) with [\t\p{Zs}]+ regex.
Here is just a sample code:
val s = "ABCD A M#L 79\nBGDA F D#L 89"
val reg = """[\t\p{Zs}]+"""
val cleanedRDD2 = s.replaceAll(reg, ",")
print(cleanedRDD2)
// => ABCD,A,M#L,79
// BGDA,F,D#L,89
And here is the regex demo. The [\t\p{Zs}]+ matches 1 or more occurrences of a tab (\t) or any Unicode whitespace from the Space Separator category.
To modify the contents of the RDD, just use .map:
newRDD = yourRDD.map(elt => elt.replaceAll("""[\t\p{Zs}]+""", ","))

If you want to use directly on RDD
rdd_nopunc = rdd.flatMap(lambda x: x.split()).filter(lambda x: x.replace("[,.!?:;]", ""))

Related

Tokenize a sentence where each word contains only letters using RegexTokenizer Scala

I am using spark with scala and trying to tokenize a sentence where each word should only contain letters. Here is my code
def tokenization(extractedText: String): DataFrame = {
val existingSparkSession = SparkSession.builder().getOrCreate()
val textDataFrame = existingSparkSession.createDataFrame(Seq(
(0, extractedText))).toDF("id", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val regexTokenizer = new RegexTokenizer()
.setInputCol("sentence")
.setOutputCol("words")
.setPattern("\\W")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.select("sentence", "words").show(false)
return regexTokenized;
}
If I provide senetence as "I am going to school5" after tokenization it should have only [i, am, going, to] and should drop school5. But with my current pattern it doesn't ignore the digits within words. How am I suppose to drop words with digits ?
You can use the settings below to get your desired tokenization. Essentially you extract words which only contain letters using an appropriate regex pattern.
val regexTokenizer = new RegexTokenizer().setInputCol("sentence").setOutputCol("words").setGaps(false).setPattern("\\b[a-zA-Z]+\\b")
val regexTokenized = regexTokenizer.transform(textDataFrame)
regexTokenized.show(false)
+---+---------------------+------------------+
|id |sentence |words |
+---+---------------------+------------------+
|0 |I am going to school5|[i, am, going, to]|
+---+---------------------+------------------+
For the reason why I set gaps to false, see the docs:
A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.
You want to repeatedly match the regex, rather than splitting the text by a given regex.

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Getting the text between delimiters with regex

How can I get the contents between two delimiters using a regular expression? For example, I want to get the stuff between two |. For example, for this input:
|This is the text I want|
it should return this:
This is what I want
I already tried /^|(.*)$|/, but that returns This is what I want | instead of just This is what I want (shouldn't have the | at the end)
Try escaping the pipes /\|(.*?)\|/
For example, using JavaScript:
var s = '| This is what I want |';
var m = s.match(/\|\s*(.*?)\s*\|/);
m[1]; // => "This is what I want"
See example here:
https://regex101.com/r/qK6aG2/2

Appending Text on the Line Below a Regular Expression Match

I am trying to add text that reads "[value = xxx]" below every line that contains the word "Letters" and also append a comma to the line containing the word "Letters" and I thought using a Regular Expression in Notepad++ would work but I can't quite figure it out. Also, matches are not spaced regularly (ie. It's not as simple as adding "[value = xxx]" to every 3rd row).
What I have currently looks like:
Properties = "_2nastlsgb",
Letters = "#,S"
textline2
textline3
Properties = "_1,N",
Letters = "A"
I would like for the end result to be something like:
Properties = "_2nastlsgb",
Letters = "#,S",
[value = xxx]
textline2
textline3
Properties = "_1,N",
Letters = "A",
[value = xxx]
I'm really close with the following but it ends up just a bit off:
Find What: letter(.*)
Replace with: \1,\n\t\t\t\t[Value = ###]
Result:
Properties = "_2nastlsgb",
s = "#,S",
[Value = ###]
textline2
textline3
Properties = "_1,N",
s = "A",
[Value = ###]
Any help would be appreciated.
Try using:
^(.*?)(Letters.*)
And replace with:
$1$2,\n$1[Value = ###]
This regex will take the indent of the Letters and apply it to the Value as well.
The issue with your regex was that it was replacing letter and not placing it back, thus the lone s.

Replace using RegEx outside of text markers

I have the following sample text and I want to replace '[core].' with something else but I only want to replace it when it is not between text markers ' (SQL):
PRINT 'The result of [core].[dbo].[FunctionX]' + [core].[dbo].[FunctionX] + '.'
EXECUTE [core].[dbo].[FunctionX]
The Result shoud be:
PRINT 'The result of [core].[dbo].[FunctionX]' + [extended].[dbo].[FunctionX] + '.'
EXECUTE [extended].[dbo].[FunctionX]
I hope someone can understand this. Can this be solved by a regular expression?
With RegLove
Kevin
Not in a single step, and not in an ordinary text editor. If your SQL is syntactically valid, you can do something like this:
First, you remove every string from the SQL and replace with placeholders. Then you do your replace of [core] with something else. Then you restore the text in the placeholders from step one:
Find all occurrences of '(?:''|[^'])+' with 'n', where n is an index number (the number of the match). Store the matches in an array with the same number as n. This will remove all SQL strings from the input and exchange them for harmless replacements without invalidating the SQL itself.
Do your replace of [core]. No regex required, normal search-and-replace is enough here.
Iterate the array, replacing the placeholder '1' with the first array item, '2' with the second, up to n. Now you have restored the original strings.
The regex, explained:
' # a single quote
(?: # begin non-capturing group
''|[^'] # either two single quotes, or anything but a single quote
)+ # end group, repeat at least once
' # a single quote
JavaScript this would look something like this:
var sql = 'your long SQL code';
var str = [];
// step 1 - remove everything that looks like an SQL string
var newSql = sql.replace(/'(?:''|[^'])+'/g, function(m) {
str.push(m);
return "'"+(str.length-1)+"'";
});
// step 2 - actual replacement (JavaScript replace is regex-only)
newSql = newSql.replace(/\[core\]/g, "[new-core]");
// step 3 - restore all original strings
for (var i=0; i<str.length; i++){
newSql = newSql.replace("'"+i+"'", str[i]);
}
// done.
Here is a solution (javascript):
str.replace(/('[^']*'.*)*\[core\]/g, "$1[extended]");
See it in action