IntelliJ: Regular expression to join multiple lines into single CSV line? - regex

Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
input: (lines pasted into some Android Studio editor tab)
Rush
IQ
Saga
Yes
desired output:
'Rush','IQ','Saga','Yes'
Edit > Find > Replace I got close with this regex pattern to match newline character (\n) with goal eliminate it:
search: ^(.*)$\n
replace: '$1',
[x] Regex
but produces this undesired output:
'Rush',IQ
'Saga',Yes
because after the a new line is eliminated the following line is already adjoining so it's skipped... so we get this "every other line" behavior.

The fastest and easiest way I could think of is to replace \n by ',' and then manually wrap the whole line in quotes:
The result of the first replacement would be:
Rush','IQ','Saga','Yes
And then just manually add first and last quote.

Step 1: Concatenate the lines, use
(.+)(?:\R|\z)
Replace with '$1',.
The (.+)(?:\R|\z) pattern matches any 1+ chars other than line break chars as many as possible (.+) and captures this into Group 1 and (?:\R|\z) matches either a line break sequence (\R) or (|) the very end of the string (\z).
Step 2: Post-process by repalcing ,$ with an empty string. This pattern matches , at the end of the line.

Occasionally I need to join multiple lines of data into a single line, and in this case, specifically as comma-separated values on a single line:
Regex may not be the best solution for this.
CSV library
There are several comma-separated values (CSV) libraries available to make quick work of this.
The libraries will handle a particular problem you may overlook in writing your own code: Some of your lines of input having the single-quote mark within their content. Such cases need to be escaped. Quoting RFC 4180 section 2.7:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Here is an example of using Apache Commons CSV library.
We use lambda syntax with a Scanner to get an Iterable of the lines of text from your input.
We specify using a single-quote, as you desire, rather than default of double-quote in standard CSV.
We use try-with-resources syntax to automatically close the CSVPrinter object, whether our code runs successfully or throws an exception.
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Yes";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
}
catch ( IOException e )
{
e.printStackTrace();
}
String output = stringBuilder.toString();
System.out.println( "output: " + output );
When run:
output: 'Rush','IQ','Saga','Yes'
We can shorten that code.
try (
CSVPrinter printer = new CSVPrinter( new StringBuilder() , CSVFormat.RFC4180.withQuoteMode( QuoteMode.ALL ).withQuote( '\'' ) ) ;
)
{
printer.printRecord( ( Iterable < String > ) ( ) -> new Scanner( input ).useDelimiter( "\n" ) );
System.out.println( printer.getOut().toString() ); // Or: `return printer.getOut()` returning an `Appendable` object.
}
catch ( IOException e )
{
e.printStackTrace();
}
Not that this is particularly better shortened. Personally, I would use the longer version wrapped in a method in a utility class. Like this:
public String enquoteLines( String input ) {
String output = "";
Iterable < String > iterable = ( ) -> new Scanner( input ).useDelimiter( "\n" ); // Lambda syntax to get a `Iterable` of lines from a `String`.
CSVFormat format =
CSVFormat
.RFC4180
.withQuoteMode( QuoteMode.ALL )
.withQuote( '\'' );
StringBuilder stringBuilder = new StringBuilder();
try (
CSVPrinter printer = new CSVPrinter( stringBuilder , format ) ;
)
{
printer.printRecord( iterable );
output = printer.getOut().toString();
}
catch ( IOException e )
{
e.printStackTrace();
}
return output;
}
Calling it:
String input = "Rush\n" +
"IQ\n" +
"Saga\n" +
"Oui";
String output = this.enquoteLines( input );

Related

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

CSV Regex skipping first comma

I am using regex for CSV processing where data can be in Quotes, or no quotes. But if there is just a comma at the starting column, it skips it.
Here is the regex I am using:
(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?|)(?=$|,)
Now the example data I am using is:
,"data",moredata,"Data"
Which should have 4 matches ["","data","moredata","Data"], but it always skips the first comma. It is fine if there is quotes on the first column, or it is not blank, but if it is empty with no quotes, it ignores it.
Here is a sample code I am using for testing purposes, it is written in Dart:
void main() {
String delimiter = ",";
String rawRow = ',,"data",moredata,"Data"';
RegExp exp = new RegExp(r'(?:'+ delimiter + r'"|^")(^,|""|[\w\W]*?)(?="'+ delimiter + r'|"$)|(?:'+ delimiter + '(?!")|^(?!"))([^'+ delimiter + r']*?)(?=$|'+ delimiter + r')');
Iterable<Match> matches = exp.allMatches(rawRow.replaceAll("\n","").replaceAll("\r","").trim());
List<String> row = new List();
matches.forEach((Match m) {
//This checks to see which match group it found the item in.
String cellValue;
if (m.group(2) != null) {
//Data found without speech marks
cellValue = m.group(2);
} else if (m.group(1) != null) {
//Data found with speech marks (so it removes escaped quotes)
cellValue = m.group(1).replaceAll('""', '"');
} else {
//Anything left
cellValue = m.group(0).replaceAll('""', '"');
}
row.add(cellValue);
});
print(row.toString());
}
Investigating your expression
(,"|^")
(""|[\w\W]*?)
(?=",|"$)
|
(,(?!")|^(?!"))
([^,]*?|)
(?=$|,)
(,"|^")(""|[\w\W]*?)(?=",|"$) This part is to match quoted strings, that seem to work for you
Going through this part (,(?!")|^(?!"))([^,]*?|)(?=$|,)
(,(?!")|^(?!")) start with comma not followed by " OR start of line not followed by "
([^,]*?|) Start of line or comma zero or more non greedy and |, why |
(?=$|,) end of line or , .
In CSV this ,,,3,4,5 line should give 6 matches but the above only gets 5
You could add (^(?=,)) at the begining of second part, the part that matches non quoted sections.
Second group with match of start and also added non capture to groups
(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Complete: (?:,"|^")(?:""|[\w\W]*?)(?=",|"$)|(?:^(?=,))|(?:,(?!")|^(?!"))(?:[^,]*?)(?=$|,)
Here is another that might work
(?:(?:"(?:[^"]|"")*"|(?<=,)[^,]*(?=,))|^[^,]+|^(?=,)|[^,]+$|(?<=,)$)
How that works i described here: Build CSV parser using regex

Replace every nth instance of character in string

I'm a bit new to Go, but I'm trying to replace every nth instance of my string with a comma. So for example, a part of my data looks as follows:
"2017-06-01T09:15:00+0530",1634.05,1635.95,1632.25,1632.25,769,"2017-06-01T09:16:00+0530",1632.25,1634.9,1631.65,1633.5,506,"2017-06-01T09:17:00+0530",1633.5,1639.95,1633.5,1638.4,991,
I want to replace every 6th comma with a '\n' so it looks like
"2017-06-01T09:15:00+0530",1634.05,1635.95,1632.25,1632.25,769"
"2017-06-01T09:16:00+0530",1632.25,1634.9,1631.65,1633.5,506"
"2017-06-01T09:17:00+0530",1633.5,1639.95,1633.5,1638.4,991"
I've looked at the regexp package and that just seems to be a finder. The strings package does have a replace but I don't know how to use it to replace specific indices. I also don't know how to find specific indices without going through the entire string character by character. I was wondering if there is a regEx solution that is more elegant than me writing a helper function.
Strings are immutable so I'm not able to edit them in place.
EDIT: Cast the string into []bytes. This allows me to edit the string in place. Then the rest is a fairly simple for loop, where dat is the data.
If that is your input, you should replace ," strings with \n".You may use strings.Replace() for this. This will leave a last, trailing comma which you can remove with a slicing.
Solution:
in := `"2017-06-01T09:15:00+0530",1634.05,1635.95,1632.25,1632.25,769,"2017-06-01T09:16:00+0530",1632.25,1634.9,1631.65,1633.5,506,"2017-06-01T09:17:00+0530",1633.5,1639.95,1633.5,1638.4,991,`
out := strings.Replace(in, ",\"", "\n\"", -1)
out = out[:len(out)-1]
fmt.Println(out)
Output is (try it on the Go Playground):
"2017-06-01T09:15:00+0530",1634.05,1635.95,1632.25,1632.25,769
"2017-06-01T09:16:00+0530",1632.25,1634.9,1631.65,1633.5,506
"2017-06-01T09:17:00+0530",1633.5,1639.95,1633.5,1638.4,991
If you want flexible.
package main
import (
"fmt"
"strings"
)
func main() {
input := `"2017-06-01T09:15:00+0530",1634.05,1635.95,1632.25,1632.25,769,"2017-06-01T09:16:00+0530",1632.25,1634.9,1631.65,1633.5,506,"2017-06-01T09:17:00+0530",1633.5,1639.95,1633.5,1638.4,991,`
var result []string
for len(input) > 0 {
token := strings.SplitN(input, ",", 7)
s := strings.Join(token[0:6], ",")
result = append(result, s)
input = input[len(s):]
input = strings.Trim(input, ",")
}
fmt.Println(result)
}
https://play.golang.org/p/mm63Hx24ne
So I figured out what I was doing wrong. I initially had the data as a string, but if I cast it to a byte[] then I can update it in place.
This allowed me to use a simple for loop below to solve the issue without relying on any other metric other than nth character instance
for i := 0; i < len(dat); i++ {
if dat[i] == ',' {
count += 1
}
if count%6 == 0 && dat[i] == ',' {
dat[i] = '\n'
count = 0
}

Regular Expression for Curly Brace

I want to change the Text1 to Text2. How can i write a regular expression İf it is possible. The text contains the sub-section. The new version should be seperated with comma
Text1:
{Any
{White-collar
{Exec-managerial}
{Prof-specialty}
{Sales}
{Adm-clerical}
}
{Blue-collar
{Tech-support}
{Craft-repair}
{Machine-op-inspct}
{Handlers-cleaners}
{Transport-moving}
{Priv-house-serv}
}
{Other
{Protective-serv}
{Armed-Forces}
{Farming-fishing}
{Other-service}
}
}
Text2:
Exec-managerial,White-collar,Any
Prof-specialty ,White-collar,Any
Sales,White-collar,Any
Adm-clerical,White-collar,Any
Tech-support,Blue-collar,Any
Craft-repair,Blue-collar,Any
Machine-op-inspct,Blue-collar,Any
Handlers-cleaners,Blue-collar,Any
Transport-moving,Blue-collar,Any
Protective-serv,Other,Any
Armed-Forces,Other,Any
Farming-fishing,Other,Any
Other-service,Other,Any
You could convert your data structure into JSON, and then use your favourite map/reduce methods to traverse it...
// define input text
var Text1 = `{Any
{White-collar
{Exec-managerial}
{Prof-specialty}
{Sales}
{Adm-clerical}
}
{Blue-collar
{Tech-support}
{Craft-repair}
{Machine-op-inspct}
{Handlers-cleaners}
{Transport-moving}
{Priv-house-serv}
}
{Other
{Protective-serv}
{Armed-Forces}
{Farming-fishing}
{Other-service}
}
}`
// define output array to store lines
var output = []
// parse json string into plain javascript object
JSON.parse(
// wrap input in array
'[' + Text1
// replace opening braces with name/children json structure
.replace(/{([\w-]+)/g, '{"name": "$1", "children": [')
// replace closing braces with array close
.replace(/}/g, ']}')
// add commas between closing and opening braces
.replace(/}([\n\s]*){/g, '},$1{') + ']'
// loop through outer layer
).forEach(outer => outer.children
// inner layer
.forEach(middle => middle.children
// and finally join all keys with comma and push to output
.forEach(inner => output.push([inner.name, middle.name, outer.name].join(',')))
)
)
// join output array with newlines, and assign to Text2
var Text2 = output.join('\n')
/* Text2 =>
Exec-managerial,White-collar,Any
Prof-specialty,White-collar,Any
Sales,White-collar,Any
Adm-clerical,White-collar,Any
Tech-support,Blue-collar,Any
Craft-repair,Blue-collar,Any
Machine-op-inspct,Blue-collar,Any
Handlers-cleaners,Blue-collar,Any
Transport-moving,Blue-collar,Any
Priv-house-serv,Blue-collar,Any
Protective-serv,Other,Any
Armed-Forces,Other,Any
Farming-fishing,Other,Any
Other-service,Other,Any
*/
If it's just the inner braced stuff you want to leave, this should do it.
Find (?s)(?:.*?({[^{}]*})|.*)
Replace $1\r\n
(?s)
(?:
.*?
( { [^{}]* } ) # (1)
|
.*
)
Otherwise, you can't get the nesting info without a complex recursive regex.
Or, use a language with a simple function recursion. You would recurse the function
In the function body, take appropriate action based on the regex \s*{([^\s{}]+)\s*|\s*{([^{}]+)}\s*|\s*}\s*
\s* {
( [^\s{}]+ ) # (1)
\s*
|
\s*
{
( [^{}]+ ) # (2)
}
\s*
|
\s* } \s*
If $1 is not empty, push it onto an array, then call the same function (recursion).
If $2 is not empty, create a temp string, append all the items in the array,
get the next match.
If both $1 and $2 are empty, remove the last item added to the array,
then do a return from the function.
That's all there is to it.
(pseudo-code)
function recurse( string_thats_left )
{
while ( match( string_thats_left, regex ) )
{
if ( $1 matched )
{
push $1 onto array
recurse( match position to end of string );
}
else
if ( $2 matched )
{
write $2 to output
for ( sizeof array )
append "," + element to output
}
else
{
pop the last array element
return
}
}
}
There is actually more to it than this, like matches must be sequential
with no breaks, but this gives the idea.

Remove text between two tags

I'm trying to remove some text between two tags [ & ]
[13:00:00]
I want to remove 13:00:00 from [] tags.
This number is not the same any time.
Its always a time of the day so, only Integer and : symbols.
Someone can help me?
UPDATE:
I forgot to say something. The time (13:00:00) was picked from a log file. Looks like that:
[10:56:49] [Client thread/ERROR]: Item entity 26367127 has no item?!
[10:57:25] [Dbutant] misterflo13 : ils coute chere les enchent aura de feu et T2 du spawn??*
[10:57:35] [Amateur] firebow ?.SkyLegend.? : ouai 0
[10:57:38] [Novice] iPasteque : ils sont gratuit me
[10:57:41] [Novice] iPasteque : ils sont gratuit mec *
[10:57:46] [Dbutant] misterflo13 : on ma dit k'ils etait payent :o
[10:57:57] [Novice] iPasteque : on t'a mytho alors
Ignore the other text I juste want to remove the time between [ & ] (need to looks like []. The time between [ & ] is updated every second.
It looks like your log has specific format. And you seem want to get rid of the time and keep all other information. Ok - read in comments
I didn't test it but it should work
' Read log
Dim logLines() As String = File.ReadAllLines("File_path")
If logLines.Length = 0 Then Return
' prepare array to fill sliced data
Dim lines(logLines.Length - 1) As String
For i As Integer = 0 To logLines.Count - 1
' just cut off time part and add empty brackets for each line
lines(i) = "[]" & logLines(i).Substring(10)
Next
What you see above - if you know that your file comes in certain format, just use position in the string where to cut it off.
Note: Code above can be done in 1 line using LINQ
If you want to actually get the data out of it, use IndexOf. Since you looking for first occurrence of "[" or "]", just use start index "0"
' get position of open bracket in string
Dim openBracketPos As Integer = myString.IndexOf("[", 0, StringComparison.OrdinalIgnoreCase)
' get position of close bracket in string
Dim closeBracketPos As Integer = myString.IndexOf("]", 0, StringComparison.OrdinalIgnoreCase)
' get string between open and close bracket
Dim data As String = myString.Substring(openBracketPos + 1, closeBracketPos - 1)
This is another possibility using Regex:
Public Function ReplaceTime(ByVal Input As String) As String
Dim m As Match = Regex.Match(Input, "(\[)(\d{1,2}\:\d{1,2}(\:\d{1,2})?)(\])(.+)")
Return m.Groups(1).Value & m.Groups(4).Value & m.Groups(5).Value
End Function
It's more of a readability nightmare but it's efficient and it takes only the brackets containing a time value.
I also took the liberty of making it match for example 13:47 as well as 13:47:12.
Test: http://ideone.com/yogWfD
(EDIT) Multiline example:
You can combine this with File.ReadAllLines() (if that's what you prefer) and a For loop to get the replacement done.
Public Function ReplaceTimeMultiline(ByVal TextLines() As String) As String
For x = 0 To TextLines.Length - 1
TextLines(x) = ReplaceTime(TextLines(x))
Next
Return String.Join(Environment.NewLine, TextLines)
End Function
Above code usage:
Dim FinalT As String = ReplaceTimeMultiline(File.ReadAllLines(<file path here>))
Another multiline example:
Public Function ReplaceTimeMultiline(ByVal Input As String) As String
Dim ReturnString As String = ""
Dim Parts() As String = Input.Split(Environment.NewLine)
For x = 0 To Parts.Length - 1
ReturnString &= ReplaceTime(Parts(x)) & If(x < (Parts.Length - 1), Environment.NewLine, "")
Next
Return ReturnString
End Function
Multiline test: http://ideone.com/nKZQHm
If your problem is to remove numeric strings in the format of 99:99:99 that appear inside [], I would do:
//assuming you want to replace the [......] numeric string with an empty []. Should you want to completely remove the tag, just replace with string.Empty
Here's a demo (in C#, not VB, but you get the point (you need the regex, not the syntax anyway)
List<string> list = new List<string>
{
"[13:00:00]",
"[4:5:0]",
"[5d2hu2d]",
"[1:1:1000]",
"[1:00:00]",
"[512341]"
};
string s = string.Join("\n", list);
Console.WriteLine("Original input string:");
Console.WriteLine(s);
Regex r = new Regex(#"\[\d{1,2}?:\d{1,2}?:\d{1,2}?\]");
foreach (Match m in r.Matches(s))
{
Console.WriteLine("{0} is a match.", m.Value);
}
Console.WriteLine();
Console.WriteLine("String with occurrences replaced with an empty string:");
Console.WriteLine(r.Replace(s, string.Empty).Trim());