Regular Expression for Curly Brace - regex

I want to change the Text1 to Text2. How can i write a regular expression İf it is possible. The text contains the sub-section. The new version should be seperated with comma
Text1:
{Any
{White-collar
{Exec-managerial}
{Prof-specialty}
{Sales}
{Adm-clerical}
}
{Blue-collar
{Tech-support}
{Craft-repair}
{Machine-op-inspct}
{Handlers-cleaners}
{Transport-moving}
{Priv-house-serv}
}
{Other
{Protective-serv}
{Armed-Forces}
{Farming-fishing}
{Other-service}
}
}
Text2:
Exec-managerial,White-collar,Any
Prof-specialty ,White-collar,Any
Sales,White-collar,Any
Adm-clerical,White-collar,Any
Tech-support,Blue-collar,Any
Craft-repair,Blue-collar,Any
Machine-op-inspct,Blue-collar,Any
Handlers-cleaners,Blue-collar,Any
Transport-moving,Blue-collar,Any
Protective-serv,Other,Any
Armed-Forces,Other,Any
Farming-fishing,Other,Any
Other-service,Other,Any

You could convert your data structure into JSON, and then use your favourite map/reduce methods to traverse it...
// define input text
var Text1 = `{Any
{White-collar
{Exec-managerial}
{Prof-specialty}
{Sales}
{Adm-clerical}
}
{Blue-collar
{Tech-support}
{Craft-repair}
{Machine-op-inspct}
{Handlers-cleaners}
{Transport-moving}
{Priv-house-serv}
}
{Other
{Protective-serv}
{Armed-Forces}
{Farming-fishing}
{Other-service}
}
}`
// define output array to store lines
var output = []
// parse json string into plain javascript object
JSON.parse(
// wrap input in array
'[' + Text1
// replace opening braces with name/children json structure
.replace(/{([\w-]+)/g, '{"name": "$1", "children": [')
// replace closing braces with array close
.replace(/}/g, ']}')
// add commas between closing and opening braces
.replace(/}([\n\s]*){/g, '},$1{') + ']'
// loop through outer layer
).forEach(outer => outer.children
// inner layer
.forEach(middle => middle.children
// and finally join all keys with comma and push to output
.forEach(inner => output.push([inner.name, middle.name, outer.name].join(',')))
)
)
// join output array with newlines, and assign to Text2
var Text2 = output.join('\n')
/* Text2 =>
Exec-managerial,White-collar,Any
Prof-specialty,White-collar,Any
Sales,White-collar,Any
Adm-clerical,White-collar,Any
Tech-support,Blue-collar,Any
Craft-repair,Blue-collar,Any
Machine-op-inspct,Blue-collar,Any
Handlers-cleaners,Blue-collar,Any
Transport-moving,Blue-collar,Any
Priv-house-serv,Blue-collar,Any
Protective-serv,Other,Any
Armed-Forces,Other,Any
Farming-fishing,Other,Any
Other-service,Other,Any
*/

If it's just the inner braced stuff you want to leave, this should do it.
Find (?s)(?:.*?({[^{}]*})|.*)
Replace $1\r\n
(?s)
(?:
.*?
( { [^{}]* } ) # (1)
|
.*
)
Otherwise, you can't get the nesting info without a complex recursive regex.
Or, use a language with a simple function recursion. You would recurse the function
In the function body, take appropriate action based on the regex \s*{([^\s{}]+)\s*|\s*{([^{}]+)}\s*|\s*}\s*
\s* {
( [^\s{}]+ ) # (1)
\s*
|
\s*
{
( [^{}]+ ) # (2)
}
\s*
|
\s* } \s*
If $1 is not empty, push it onto an array, then call the same function (recursion).
If $2 is not empty, create a temp string, append all the items in the array,
get the next match.
If both $1 and $2 are empty, remove the last item added to the array,
then do a return from the function.
That's all there is to it.
(pseudo-code)
function recurse( string_thats_left )
{
while ( match( string_thats_left, regex ) )
{
if ( $1 matched )
{
push $1 onto array
recurse( match position to end of string );
}
else
if ( $2 matched )
{
write $2 to output
for ( sizeof array )
append "," + element to output
}
else
{
pop the last array element
return
}
}
}
There is actually more to it than this, like matches must be sequential
with no breaks, but this gives the idea.

Related

Is it possible to use Regex to make the first of each word Capital, and replace underscores with spaces?

I can do each of them using regex, but I'm unsure how to combine them into a single regex to achieve both.
For example, this is what I'd like to happen:
HELLO_THERE -> Hello There
Found this, but it does the opposite of what I want, it lowercases the first, when i'd like it to lower case everything AFTER the first:
function changeStr(string){
return string.replace(/(?:_| |\b)(\w)/g, function($1){return $1.toLowerCase().replace('_',' ');});
}
changeStr(HELLO_THERE) -> hELLO tHERE
Not in a single regex, no. But you can make use of an anonymous function for the replacement:
function changeStr( string ){
// Match underscores or the alphabet
return string.replace(/[A-Za-z]+|_/g, function( match ){
if( match === '_' ){
// If we have an underscore then return a space
return ' ';
}
else{
// For everything else we capitalize the first char and lowercase the rest
return match.substr(0, 1).toUpperCase()+match.substr(1).toLowerCase();
}
});
}
console.log( changeStr( 'HELLO_THERE_my_frIenD_MonkeyZeus' ) );

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

IntelliJ: Regular expression to join multiple lines into single CSV line?

Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
input: (lines pasted into some Android Studio editor tab)
Rush
IQ
Saga
Yes
desired output:
'Rush','IQ','Saga','Yes'
Edit > Find > Replace I got close with this regex pattern to match newline character (\n) with goal eliminate it:
search: ^(.*)$\n
replace: '$1',
[x] Regex
but produces this undesired output:
'Rush',IQ
'Saga',Yes
because after the a new line is eliminated the following line is already adjoining so it's skipped... so we get this "every other line" behavior.
The fastest and easiest way I could think of is to replace \n by ',' and then manually wrap the whole line in quotes:
The result of the first replacement would be:
Rush','IQ','Saga','Yes
And then just manually add first and last quote.
Step 1: Concatenate the lines, use
(.+)(?:\R|\z)
Replace with '$1',.
The (.+)(?:\R|\z) pattern matches any 1+ chars other than line break chars as many as possible (.+) and captures this into Group 1 and (?:\R|\z) matches either a line break sequence (\R) or (|) the very end of the string (\z).
Step 2: Post-process by repalcing ,$ with an empty string. This pattern matches , at the end of the line.
Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
Regex may not be the best solution for this.
CSV library
There are several comma-separated values (CSV) libraries available to make quick work of this.
The libraries will handle a particular problem you may overlook in writing your own code: Some of your lines of input having the single-quote mark within their content. Such cases need to be escaped. Quoting RFC 4180 section 2.7:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Here is an example of using Apache Commons CSV library.
We use lambda syntax with a Scanner to get an Iterable of the lines of text from your input.
We specify using a single-quote, as you desire, rather than default of double-quote in standard CSV.
We use try-with-resources syntax to automatically close the CSVPrinter object, whether our code runs successfully or throws an exception.
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Yes";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
}
catch ( IOException e )
{
e.printStackTrace();
}
String output = stringBuilder.toString();
System.out.println( "output: " + output );
When run:
output: 'Rush','IQ','Saga','Yes'
We can shorten that code.
try (
CSVPrinter printer = new CSVPrinter( new StringBuilder() , CSVFormat.RFC4180.withQuoteMode( QuoteMode.ALL ).withQuote( '\'' ) ) ;
)
{
printer.printRecord( ( Iterable < String > ) ( ) -> new Scanner( input ).useDelimiter( "\n" ) );
System.out.println( printer.getOut().toString() ); // Or: `return printer.getOut()` returning an `Appendable` object.
}
catch ( IOException e )
{
e.printStackTrace();
}
Not that this is particularly better shortened. Personally, I would use the longer version wrapped in a method in a utility class. Like this:
public String enquoteLines( String input ) {
String output = "";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
output = printer.getOut().toString();
}
catch ( IOException e )
{
e.printStackTrace();
}
return output;
}
Calling it:
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Oui";
String output = this.enquoteLines( input );

Kotlin .split() with multiple regex

Input: """aaaabb\\\\\cc"""
Pattern: ["""aaa""", """\\""", """\"""]
Output: [aaa, abb, \\, \\, \, cc]
How can I split Input to Output using patterns in Pattern in Kotlin?
I found that Regex("(?<=cha)|(?=cha)") helps patterns to remain after spliting, so I tried to use looping, but some of the patterns like '\' and '[' require escape backslash, so I'm not able to use loop for spliting.
EDIT:
val temp = mutableListOf<String>()
for (e in Input.split(Regex("(?<=\\)|(?=\\)"))) temp.add(e)
This is what I've been doing, but this does not work for multiple regex, and this add extra "" at the end of temp if Input ends with "\"
You may use the function I wrote for some previous question that splits by a pattern keeping all matched and non-matched substrings:
private fun splitKeepDelims(s: String, rx: Regex, keep_empty: Boolean = true) : MutableList<String> {
var res = mutableListOf<String>() // Declare the mutable list var
var start = 0 // Define var for substring start pos
rx.findAll(s).forEach { // Looking for matches
val substr_before = s.substring(start, it.range.first()) // // Substring before match start
if (substr_before.length > 0 || keep_empty) {
res.add(substr_before) // Adding substring before match start
}
res.add(it.value) // Adding match
start = it.range.last()+1 // Updating start pos of next substring before match
}
if ( start != s.length ) res.add(s.substring(start)) // Adding text after last match if any
return res
}
You just need a dynamic pattern from yoyur Pattern list items by joining them with a |, an alternation operator while remembering to escape all the items:
val Pattern = listOf("aaa", """\\""", "\\") // Define the list of literal patterns
val rx = Pattern.map{Regex.escape(it)}.joinToString("|").toRegex() // Build a pattern, \Qaaa\E|\Q\\\E|\Q\\E
val text = """aaaabb\\\\\cc"""
println(splitKeepDelims(text, rx, false))
// => [aaa, abb, \\, \\, \, cc]
See the Kotlin demo
Note that between \Q and \E, all chars in the pattern are considered literal chars, not special regex metacharacters.

Search for substring and store another part of the string as variable in perl

I am revamping an old mail tool and adding MIME support. I have a lot of it working but I'm a perl dummy and the regex stuff is losing me.
I had:
foreach ( #{$body} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
if ( $delimit ) {
next if (/$delimit/ && ! $tp);
last if (/$delimit/ && $tp);
$tp = 1, next if /text.plain/;
$tp = 0, next if /text.html/;
s/<[^>]*>//g;
$newbody .= $_ if $tp;
} else {
s/<[^>]*>//g;
$newbody .= $_ ;
}
} # End Foreach
Now I have $body_text as the plain text mail body thanks to MIME::Parser. So now I just need this part to work:
foreach ( #{$body_text} ) {
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
} # End Foreach
The actual challenge is to find NEMS=12345 or NEMS=1234567 and set $nems=12345 if found. I think I have a very basic syntax problem with the test because I'm not exposed to perl very often.
A coworker suggested:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/.*?(\d{5,7}).*/;
$nems = $1;
next;
}
}
Which seems to be working, but it may not be the preferred way?
edit:
So this is the most current version based on tips here and testing:
foreach (split(/\n/,$body_text)){
next if /^$/;
if ( /NEMS/i ) {
/^\s*NEMS\s*=\s*(\d+)/i;
$nems = $1;
next;
}
}
Match the last two digits as optional and capture the first five, and assign the capture directly
($nems) = /(\d{5}) (?: \d{2} )?/x; # /x allows spaces inside
The construct (?: ) only groups what's inside, without capture. The ? after it means to match that zero or one time. We need parens so that it applies to that subpattern only. So the last two digits are optional -- five digits or seven digits match. I removed the unneeded .*? and .*
However, by what you say it appears that the whole thing can be simplified
if ( ($nems) = /^\s*NEMS \s* = \s* (\d{5}) (?:\d{2})?/ix ) { next }
where there is now no need for if (/NEMS/) and I've adjusted to the clarification that NEMS is at the beginning and that there may be spaces around =. Then you can also say
my $nems;
foreach ( split /\n/, $body_text ) {
# ...
next if ($nems) = /^\s*NEMS\s*=\s*(\d{5})(?:\d{2})?/i;
# ...
}
what includes the clarification that the new $body_text is a multiline string.
It is clear that $nems is declared (needed) outside of the loop and I indicate that.
This allows yet more digits to follow; it will match on 8 digits as well (but capture only the first five). This is what your trailing .* in the regex implies.
Edit It's been clarified that there can only be 5 or 7 digits. Then the regex can be tightened, to check whether input is as expected, but it should work as it stands, too.
A few notes, let me know if more would be helpful
The match operator returns a list so we need the parens in ($nems) = /.../;
The ($nems) = /.../ syntax is a nice shortcut, for ($nems) = $_ =~ /.../;.
If you are matching on a variable other than $_ then you need the whole thing.
You always want to start Perl programs with
use warnings 'all';
use strict;
This directly helps and generally results in better code.
The clarification of the evolved problem understanding states that all digits following = need be captured into $nems (and there may be 5,(not 6),7,8,9,10 digits). Then the regex is simply
($nems) = /^\s*NEMS\s*=\s*(\d+)/i;
where \d+ means a digit, one or more times. So a string of digits (match fails if there are none).