How to format given string using regex?

How to format given string using regex? - regex

So I have defined variables in such a way in my file:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world= "bye world"
I have many declarations like that.
Is it possible to format them as(All '=' in a line):
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
I can't even think of a way to do it. Any kind of help is appreciated.
P.S If it matters, I use sublime text 2.

If it is a one-time task you might try the following:
Import the text file into, e.g., Excel using the 'text in columns' functionality (separation character: space) so that column A contains "public" in each row, column B "static", ..., column E the variable names, column F the "=" signs, and column G the variable values (strings).
Then put the following formula into cell H1 (and copy it down to the other rows):
="public static final String "&E1&REPT(" ";50-LEN(E1))&" = "&""""&G1&""""
Afterwards, column H contains the following outputs:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
Please note that the Excel functions REPT and LEN are named differently if your Excel language is not English.

If you're careful with your original layout (so that = signs are separated from the variable name, for example, unlike the third line of data in the example), then this will do the job:
awk '{ if (length($5) > max) max = length($5);
name[NR] = $5; value[NR] = $0; sub(/^[^"]*"/, "\"", value[NR]); }
END { format = sprintf("public static final String %%-%ds = %%s\n", max);
for (i = 1; i <= NR; i++) printf(format, name[i], value[i]); }'
It assumes you are dealing with 'public static final String' throughout (but doesn't verify that). It keeps track of the length of the longest name it reads (line 1), and also the variable name and the material from the open double quote to the end of line (line 2). At the end, it generates a format string which will print the variable names left justified in a field as long as the longest (line 3). It then applies that to the saved data (line 4), generating:
public static final String hello_world = "hello world"
public static final String awesome_world = "awesome world"
public static final String bye_world = "bye world"
To make it bomb-proof (e.g. the original data), you have to work a bit harder, though it shouldn't be insuperable. The simplest fix for the sloppy original format would be to pre-filter the data with:
sed 's/=/ = /'
Extra spaces around properly spaced input won't affect the output, and the missing spaces in the 3 sample line of data are fixed. It would be fiddly to do that inside awk because you'd want it to resplit the line after editing it. You could do something very similar in Perl.
Given that the volumes of data to be processed are unlikely to be in the megabyte range, let alone larger, the two command-command solution is perfectly reasonable; you're unlikely to be able to measure the cost of the sed process.

There is no single regex that can solve your problem. Your only option would be to run a series of regexes, one to handle each line length:
s/^(.{40})=/\1 =/
s/^(.{39})=/\1 =/
s/^(.{38})=/\1 =/
And even then, that's probably not what you want and it's probably much, much easier to do it by hand.
The problem is that the only way a regex substitution can insert different strings at different times is if what it's inserting is a backref, and there's no backref to give you your 5 - N space characters. Your other option would be to try to capture a variable number of characters, but in this case there's no way to make that do it for you either.
Regexes were not made to do things like that (they don't support arithmetic), but some text editors are, so just find a fancy text editor or do it by hand.

Since you're using Sublime Text 2, there's a much easier way to do that.
There's a great package for Sublime Text 2 which will do exactly what you want:
Sublime Alignment
Dead-simple alignment of multi-line selections and
multiple selections for Sublime Text 2.
Features:
Align multiple selections to the same column by inserting spaces (or
tabs)
Align all lines in a multi-line selection to the same indent
level
Align the first = on each line of a multi-line selection to the
same column
Before:
After:

Related

How to differentiate between space and tab while performing file operations in ocaml

I have a dumped(.rem) file with 3 entries per line, separated by tabs - "\t" as shown below.
Hello World Ocaml
I like Ocaml
To read from this file, the type is passed in a cast(attrbs) along with the file like this:
type attrbs = list (string * string * string);
let chi = (open_in file : attrbs) in
let v = input_value chi in close_in chi
Now, I get a list in "v", which I use further. In fact, it also works if the entries are separated by space.
This works fine if all the 3 entries in a row do not contain any spaces within themselves. I would like to use another file which has the first entry as a string with spaces, second entry as a string without spaces, and third entry as any string as shown below:
This is with spaces Thisiswithoutspaces Thisissomestring
Another one with spaces Anotheronewithoutspaces AnotherString
If I use the code mentioned, since it does not differentiate between space and tab, it takes only the first three words - "This", "is", and "with". I want it to include the spaces and consider "This is with spaces" as an entire string.
I tried searching the web, but couldn't find any solution for it.
Update:
The issue was with the way I read them. If I use specific formats like "%s %s %s", they will work only if we add the # character like "%s#\t%s#\t%s". It is given under the title: "Scanning indications in format strings" in https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.html. The issue is solved.

Glad you managed to do this yourself.
However, I wouldn't recommend using Scanf for that. You can do this:
match String.split_on_char '\t' (input_line chi) with
| [a;b;c] -> ...
| exception End_of_file -> ...
| l_wrong_size -> ...
This way, you are not only sure to not rely on the quirky behavior of Scanf, but you can also easily specify what to do on malformed input.

The issue was with the way I read them. If I use specific formats like "%s %s %s", they will work only if we add the # character like "%s#\t%s#\t%s". It is given under the title: "Scanning indications in format strings" in https://caml.inria.fr/pub/docs/manual-ocaml/libref/Scanf.html. The issue is solved.

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)

=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns

To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.

I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

String excerpts

I would like to copy a certain string (out of a longer range of strings in one cell) and show it in a different cell with Google Sheets. This is what is in the initial cell A1:A :
"String 1","String 2","String 3"
In B1:B I'd like ONLY String 3, so without the "" and the other strings.
Is this possible with spreadsheets?
Or is there any other way of doing so?

Update
So the task is to get word inside double quotes. And the mathcing string is placed in the end of text.
You may use regular expressions to deal with that, the basic formula is:
=REGEXEXTRACT(A1,"([^""]+)""$")
This will give a word inside "" from text in cell A1 at the end of text.
For example:
some text...,"Thisthat","https://www.url.com/de/Thisthat"
gives https://www.url.com/de/Thisthat
You may also use arrayformula:
=ArrayFormula(REGEXEXTRACT(A1:A3,"([^""]+)""$"))
Please, read more about this functions here and here.
Old answer
if you want strings to be on their rows, use this formula in B1:
=ArrayFormula(if(A1:A = "String 3";A1:A;""))
If you have cells in A1:A, which contain 'string 3', and you want to match them too, use this:
=ArrayFormula(if(REGEXMATCH(A1:A , "String 3"),"String 3",""))

How can I normalize / asciify Unicode characters in Google Sheets?

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.
I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.
I tried to write a formula (similar to this one):
REGEXREPLACE("público","(\pL)\pM*","$1")
But Sheets produces the following error:
Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.
I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.
Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?

[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.
But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".
Workaround
Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:
=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))
And then make directory of replacements in range D:E:
D E
1 ú u
2 ē e
3 ... ...
This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.
Or use Java Script
Also found a good solution, which works in google sheets.

This did it for me in Google Sheets, Google Apps Scripts, GAS
function normalizetext(text) {
var weird = 'öüóőúéáàűíÖÜÓŐÚÉÁÀŰÍçÇ!#£$%^&*()_+?/*."';
var normalized = 'ouooueaauiOUOOUEAAUIcC ';
var idoff = -1,new_text = '';
var lentext = text.toString().length -1
for (i = 0; i <= lentext; i++) {
idoff = weird.search(text.charAt(i));
if (idoff == -1) {
new_text = new_text + text.charAt(i);
} else {
new_text = new_text + normalized.charAt(idoff);
}
}
return new_text;
}

This answer doesn't require a Google App Script, and it's still fast, and relatively simple. It builds on Max's answer by providing a full lookup table, and it also allows for case-sensitive transliteration (normally VLOOKUP is NOT case-sensitive).
Here is a link to the Google Spreadsheet if you want to jump right into it. If you want to use your own sheet, you'll need to copy the TRANS_TABLE sheet into your Spreadsheet.
In the code snippet below, the source cell is A2, so you'd place this formula in any column on row 2. Using REGEXREPLACE AND SPLIT, we split apart the string in A2 into an array of characters, then USING ARRAYFORMULA, we do the following to EACH character in the array: First, the character is converted to its 'decimal' CODE equivalent, then matched against a table on the TRANS_TABLE sheet by that number, then using VLOOKUP, a character X number of columns over (the index value provided) on the TRANS_TABLE sheet (in this case, the 3rd column over) is returned. When all characters in the array have been transliterated, we finally JOIN the array of characters back into a single string. I provided examples with named ranges as well.
=iferror(
join(
"",
ARRAYFORMULA(
vlookup(
code(split(REGEXREPLACE($A2,"(.)", "$1;"),";",TRUE)),
TRANS_TABLE!$A$5:$F,3
)
)
)
,)
You'll note on the TRANS_TABLE sheet I made, I created 4 different transliteration columns, which makes it easy to have a column for each of your transliteration needs. To reference the column, just use a different index number in the VLOOKUP. Each column is simply a replacement character column. In some cases, you don't want any conversion made (A -> A or 3 -> 3), so you just copy the same character from the source Glyph column. Where you DO want to convert characters, you type in whatever character you want replaced (ñ -> n etc). If you want a character removed altogether, you leave the cell blank (? -> ''). You can see examples of the transliteration output on the data sheet in which I created 4 different transliteration columns (A-D) referencing each of the Transliteration tables from the TRANS_TABLE sheet for different use case scenarios.
I hope this finally answers your question in a fashion that isn't so "ugly." Cheers.

Regex to Split by comma and remove text delimiters [duplicate]

I have a text file that is in a comma separated format, delimited by " on most fields. I am trying to get that into something I can enumerate through (Generic Collection, for example). I don't have control over how the file is output nor the character it uses for the delimiter.
In this case, the fields are separated by a comma and text fields are enclosed in " marks. The problem I am running into is that some fields have quotation marks in them (i.e. 8" Tray) and are accidentally being picked up as the next field. In the case of numeric fields, they don't have quotes around them, but they do start with a + or a - sign (depicting a positive/negative number).
I was thinking of a RegEx, but my skills aren't that great so hopefully someone can come up with some ideas I can try. There are about 19,000 records in this file, so I am trying to do it as efficiently as possible. Here are a couple of example rows of data:
"00","000000112260 ","Pie Pumpkin ","RET","6.99 "," ","ea ",+0000000006.99000
"00","000000304078 ","Pie Apple caramel ","RET","9.99 "," ","ea ",+0000000009.99000
"00","StringValue here","8" Tray of Food ","RET","6.99 "," ","ea ",-00000000005.3200
There are a lot more fields, but you can get the picture....
I am using VB.NET and I have a generic List setup to accept the data. I have tried using CSVReader and it seems to work well until you hit a record like the 3rd one (with a quote in the text field). If I could somehow get it to handle the additional quotes, than the CSVReader option will work great.
Thanks!

I recommend looking at the TextFieldParserClass in .Net. You need to include
Imports Microsoft.VisualBasic.FileIO.TextFieldParser
Here's a quick sample:
Dim afile As FileIO.TextFieldParser = New FileIO.TextFieldParser(FileName)
Dim CurrentRecord As String() ' this array will hold each line of data
afile.TextFieldType = FileIO.FieldType.Delimited
afile.Delimiters = New String() {","}
afile.HasFieldsEnclosedInQuotes = True
' parse the actual file
Do While Not afile.EndOfData
Try
CurrentRecord = afile.ReadFields
Catch ex As FileIO.MalformedLineException
Stop
End Try
Loop

From here:
Encoding fileEncoding = GetFileEncoding(csvFile);
// get rid of all doublequotes except those used as field delimiters
string fileContents = File.ReadAllText(csvFile, fileEncoding);
string fixedContents = Regex.Replace(fileContents, #"([^\^,\r\n])""([^$,\r\n])", #"$1$2");
using (CsvReader csv =
new CsvReader(new StringReader(fixedContents), true))
{
// ... parse the CSV

As this link says... Don't roll your own CSV parser!
Use TextFieldParser as Avi suggested. Microsoft has already done this for you. If you ended up writing one, and you find a bug in it, consider replacing it instead of fixing the bug. I did just that recently and it saved me a lot of time.

Give a look to the FileHelpers library.

You could give CsvHelper (a library I maintain) a try and it's available via NuGet. It follows the RFC 4180 standard for CSV. It will be able to handle any content inside of a field including commas, quotes, and new lines.
CsvHelper is simple to use, but it's also easy to configure it to work with many different types of delimited files.
CsvReader csv = new CsvReader( streamToFile );
IEnumerable<MyObject> myObjects = csv.GetRecords<MyObject>();
If you want to read CSV files on a lower level, you can use the parser directly, which will return each row as a string array.
var parser = new CsvParser( myTextReader );
while( true )
{
string[] line = parser.ReadLine();
if( line == null )
{
break;
}
}

I am posting this as an answer so I can explain how I did it and why.... The answer from Mitch Wheat was the one that gave me the best solution for this case and I just had to modify it slightly due to the format this data was exported in.
Here is the VB Code:
Dim fixedContents As String = Regex.Replace(
File.ReadAllText(csvFile, fileEncoding),
"(?<!,)("")(?!,)",
AddressOf ReplaceQuotes)
The RegEx that was used is what I needed to change because certain fields had non-escaped quotes in them and the RegEx provided didn't seem to work on all examples. This one uses 'Look Ahead' and 'Look Behind' to see if the quote is just after a comma or just before. In this case, they are both negative (meaning show me where the double quote is not before or after a comma). This should mean that the quote is in the middle of a string.
In this case, instead of doing a direct replacement, I am using the function ReplaceQuotes to handle that for me. The reason I am using this is because I needed a little extra logic to detect whether it was at the beginning of a line. If I would have spent even more time on it, I am sure I could have tweaked the RegEx to take into consideration the beginning of the line (using MultiLine, etc) but when I tried it quickly, it didn't seem to work at all.
With this in place, using CSV reader on a 32MB CSV file (about 19000 rows), it takes about 2 seconds to read the file, perform the regex, load it into the CSV Reader, add all the data to my generic class and finish. Real quick!!

RegEx to exclude first and last quote would be (?<!^)(?<!,)("")(?!,)(?!$). Of course, you need to use RegexOptions.Multiline.
That way there is no need for evaluator function. My code replaces undesired double quotes with single quotes.
Complete C# code is as below.
string fixedCSV = Regex.Replace(
File.ReadAllText(fileName),
#"(?<!^)(?<!;)("")(?!;)(?!$)", "'", RegexOptions.Multiline);

There are at least ODBC drivers for CSV files. But there are different flavors of CSV.
What produced these files? It's not unlikely that there's a matching driver based on the requirements of the source application.

Your problem with CSVReader is that the quote in the third record isn't escaped with another quote (aka double quoting). If you don't escape them, then how would you expect to handle ", in the middle of a text field?
http://en.wikipedia.org/wiki/Comma-separated_values
(I did end up having to work with files (with different delimiters) but the quote characters inside a text value weren't escaped and I ended up writing my own custom parser. I do not know if this was absolutely necessary or not.)

The logic of this custom approach is: Read through file 1 line at a time, split each line on the comma, remove the first and last character (removing the outer quotes but not affecting any inside quotes), then adding the data to your generic list. It's short and very easy to read and work with.
Dim fr As StreamReader = Nothing
Dim FileString As String = ""
Dim LineItemsArr() as String
Dim FilePath As String = HttpContext.Current.Request.MapPath("YourFile.csv")
fr = New System.IO.StreamReader(FilePath)
While fr.Peek <> -1
FileString = fr.ReadLine.Trim
If String.IsNullOrEmpty(FileString) Then Continue While 'Empty Line
LineItemsArr = FileString.Split(",")
For Each Item as String In LineItemsArr
'If every item will have a beginning and closing " (quote) then you can just
'cut the first and last characters of the string here.
'i.e. UpdatedItems = Item. remove first and last character
'Then stick the data into your Generic List (Of String()?)
Next
End While

public static Encoding GetFileEncoding(String fileName)
{
Encoding Result = null;
FileInfo FI = new FileInfo(fileName);
FileStream FS = null;
try
{
FS = FI.OpenRead();
Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8 };
for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)
{
FS.Position = 0;
byte[] Preamble = UnicodeEncodings[i].GetPreamble();
bool PreamblesAreEqual = true;
for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)
{
PreamblesAreEqual = Preamble[j] == FS.ReadByte();
}
if (PreamblesAreEqual)
{
Result = UnicodeEncodings[i];
}
}
}
catch (System.IO.IOException)
{
}
finally
{
if (FS != null)
{
FS.Close();
}
}
if (Result == null)
{
Result = Encoding.Default;
}
return Result;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js