Format text file on OSX to tab delimited text file - regex

I have a file that looks like the following, which is outputted from a program i wrote in applescript.
AXTitle: Blah
AXSize: Width: 300
AXSize: Height: 44
AXPosition: X: 217
AXPosition: Y: 170
AXHelp: Blah
AXValue: Value, On
AXEnabled: true
AXFocused: false
AXRole: AXStaticText
AXRoleDescription: Blah
AXTopLevelUIElement: Blah
AXWindow: Blah
I need to put into a format that will be compatible with a database. So a tab delimited format. This is the format i would like to output it in. I don't need all the data above just what i select
123 456 0 1 2 3 4 5 text text
123 456 0 1 2 3 4 5 text text
I am thinking regular expressions to format text as i need.
Can i do this in applescript? If not what programs should i look at considering i am woking on OSX? sed, gawk, perl, python? Can i insert these programs in my applescript program or will i have to run separately?

When you are getting your selected items, just add the tabs as you go. You can also put the items into a list, where you can use text item delimiters to convert to a string with tabs between the items.
set X to {123, 456, 0, 1, 2, 3, 4, 5, "text", "text"}
set {tempTID, AppleScript's text item delimiters} to {AppleScript's text item delimiters, tab}
set {X, AppleScript's text item delimiters} to {X as text, tempTID}
X --> result

Here is a good tutorial:
http://www.grymoire.com/Unix/Sed.html
set xxx to "Here is my input"
do shell script "echo " & xxx & " | sed 's/my/your/'"

Related

Extracting multiple lines of text between delimiters from a single cell

In Google Sheets or Excel, I would like to extract multiple lines of text between the delimiters x/ and / using a single formula.
INPUT:
x/Apple Juice/,Banana,Grape,x/Pear Juice/,Cherry,Orange,Blueberry
OUTPUT expected:
Apple Juice, Pear Juice
The input line of text may be longer or shorter and the position and instances of "x/text/" can vary.
=ARRAYFORMULA(TEXTJOIN(", ", 1, IFERROR(REGEXEXTRACT(SPLIT(A1, ","), "x/(.*)/"))))

Is the VBScript RegEx Flavor Lookaround Method known to have problems with textfiles exceeding 5MB?

I would like to know why the following RegEx's:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.)
and:
\b\w{7}\b\s[1]\s[\S\s]+?(?=WHAT WHERE WHAT WHERE WHAT\,\sWHERE\sWHAT.|HOW WHO HOW WHO HOW\,\sWHO\sHOW\.)
seem to work perfectly fine on the following test string:
THIS THAT THIS THAT THIS,
THAT
THIS.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
WHAT WHERE WHAT WHERE WHAT,
WHERE
WHAT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
HOW WHO HOW WHO HOW,
WHO
HOW.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IF OR IF OR IF.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
TO FOR TO FOR
TO FOR TO FOR.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
IN UNDER IN
UNDER IN UNDER.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
LEFT RIGHT LEFT
RIGHT LEFT.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
UP DOWN UP DOWN UP
DOWN.
CHAPTER 1
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 2
Text text text 2 text text text 3 text text text 4 text text text.
CHAPTER 3
Text text text 2 text text text 3 text text text 4 text text text.
THE END.
But, when I use the same type of Expression on files that exceed 5MB, it fails.
The VBScript that I am using is as follows:
Option Explicit
Dim strPath : strPath = "myFile.txt"
If Instr(1, WScript.FullName, "CScript", vbTextCompare) = 0 Then
With CreateObject("WScript.Shell")
.Run "cmd.exe /k cscript //nologo """ & WScript.ScriptFullName & """", 1, False
WScript.Quit
End With
Else
With CreateObject("Scripting.FileSystemObject")
If .FileExists(strPath) Then
Call Main(strPath)
Else
WScript.Echo "Input file doesn't exists"
End If
End With
End If
Private Sub Main(filePath)
Dim TempDictionary, Books, Book, b
Set TempDictionary = CreateObject("Scripting.Dictionary")
Set Books = RegEx(GetFileContent(filePath),"\b\w{7}\b\s[1]\s[\S\s]+?THE SECOND BOOK OF MOSES")
If Books.Count > 0 Then
For Each Book In Books
WScript.Echo Replace(Left(Book.Value,70),vbCrLf," ")
Next
Else
WScript.Echo "Document didn't contain any valid books"
WScript.Quit
End If
End Sub
Private Function GetFileContent(filePath)
Dim objFS, objFile, objTS
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objFile = objFS.GetFile(filePath)
Set objTS = objFile.OpenAsTextStream(1, 0)
GetFileContent = objTS.Read(objFile.Size)
Set objTS = Nothing
End Function
Private Function RegEx(str,pattern)
Dim objRE, Match, Matches
Set objRE = New RegExp
objRE.Pattern = pattern
objRE.Global = True
Set RegEx = objRE.Execute(str)
WScript.Echo objRE.Test(str)
End Function
Editor that I am using is here: http://www.regexr.com/
Q: What are you trying to do?
A: I want to be able to split any textfile into several string chunks, based on a smart regex code that captures anything between two strings. The first string determiner is a fixed term, i.e. "CHAPTER 1", but the second string determiner is unfixed. The second string determiner is unfixed and changing, but it is known. It can be placed into an array, and then parsed.
The problem that I am having is that the Lookaround (?=) seems to either escape or get stuck in a loop. I have been playing around with the "|" operator, as you can see in the second RegEx at the start of this OP.
The test file that I am working with seems to parse just fine. No problem. But the larger files that I am working with... I don't know. Something just goes wrong.

Possible to change the record delimiter in R?

Is it possible to manipulate the record/observation/row delimiter when reading in data (i.e. read.table) from a text file? It's straightforward to adjust the field delimiter using sep="", but I haven't found a way to change the record delimiter from an end-of-line character.
I am trying to read in pipe delimited text files in which many of the entries are long strings that include carriage returns. R treats these CRs as end-of-line, which begins a new row incorrectly and screws up the number of records and field order.
I would like to use a different delimiter instead of a CR. As it turns out, each row begins with the same string, so if I could use use something like \nString to identify true end-of-line, the table would import correctly. Here's a simplified example of what one of the text files might look like.
V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,
Should read into R as
V1 V2 V3 V4
String A 5 some text
String B 2 more text and more text
String B 7 some different text
String A N/A N/A
I can open the files in a text editor and clean them with a find/replace before reading in, but a systematic solution within R would be great. Thanks for your help.
We can read them in and collapse them afterwards. g will have the value 0 for the header, 1 for the next line (and for follow on lines, if any, that are to go with it) and so on. tapply collapses the lines according to g giving L2 and finally we re-read the lines:
Lines <- "V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,"
L <- readLines(textConnection(Lines))
g <- cumsum(grepl("^String", L))
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.csv(text = L2, as.is = TRUE)
DF$V4[ DF$V4 == "" ] <- NA
This gives:
> DF
V1 V2 V3 V4
1 String A 5 some text
2 String B 2 more text and more text
3 String B 7 some different text
4 String A NA <NA>
If you're on Linux/Mac, you should really be using a command line tool, like e.g. sed, instead. Here are two slightly different approaches:
# keep the \n
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/"\\1\\n\\2"/\' test.txt'))
# V1 V2 V3 V4
#1 String A 5 some text
#2 String B 2 more text and\nmore text
#3 String B 7 some different text
#4 String A NA
# get rid of the \n and replace with a space
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/\\1 \\2/\' test.txt'))
# V1 V2 V3 V4
#1 String A 5 some text
#2 String B 2 more text and more text
#3 String B 7 some different text
#4 String A NA

Highlight columns which differer from previous lines

I have text file with numerical data written in lines; there are interspersed input lines (unknowns) and residuals lines (which are supposed to be minimized). I am investigating ways how the iterative solver handles various cases, and would like to highlight every (space-delimited) field in the residuals line which is (textually) different from the same field in the previous residuals line (2 lines above, better given by a regexp). I am free to decorate beginnings of the lines as I like, if that helps.
Is this at all possible with Vim and some regexp magic?
Example file:
input 1 2 3 4 5 6
errors .2 .2 .3 .1 0 0
input 1 2.1 2.9 4 5 6 ## here, 2.1 and 2.9 should be highlighted
errors .21 .3 .44 .3 0 0
input 1 2 3 3.9 5.2 6 ## here, 2, 3, 3.9 and 5.2 should be highlighted
errors .2 .2 .34 .9 1 0
Note: I could code script extracting differences in Python, but I want to have a look at both the actual data and the changes. When it does not work with Vim, I will process it with python and output highlighted HTML, but I will lose automatic folding capabilities.
I think that if you use arrays rather than a regex for the comparison, it might be a lot easier:
<?php
$lastRow=array()
$h=fopen('file','r');
while ($r=fgetcsv($h,0,' ')) // Retrieve a line into an array and split on spaces
{
if (count($lastRow)==0)
{
$lastRow=$r; // Store last line
}
$count++;
if ($r[0]=='input')
{
/*
* this won't find any differences the first run through
*/
$diffs=array_diff($lastRow,$r); // Anything that has changed from the last "input" line is now in $diffs
$lastRow=$r; // Only copy into $lastRow if it's an "input" line
}
/*
* Put your code to output line here with relevant CSS for highlighting
*/
}
fclose($h);
?>
I've not tested this code but I think it's shows how you could get a solution without delving into regex

Regex not replacing the last occurrence of a match

I have the following text layout:
Heading
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
Heading
Chapter 2:1 This is text
2 This is more text...
and I am trying to add the first Chapter reference and the last one in that Chapter right after the Heading, written in parentheses. Like so:
Heading (Chapter 1:1-15)
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
I've come up with this regular expression so far:
~s/(?s)(Heading)\r(^\d*\w+\s*\d+:\d+|\d+:\d+)(.*?)(\d+)(.*?\r)(?=Heading)/\1 (\2-\4)\r\2\3\4\5/g;
but this is grabbing the first number right after Chapter 1:1 (i.e. "2", "Heading (Chapter 1:1-2)"), instead of the last one ("15" as in "Heading (Chapter 1:1-15)"). Could someone please tell me what's wrong with the regex? Thank you!
Edit for updated question
Here's a regex with explanation that will solve your problem. http://codepad.org/mSIYCw4R
~s/
((?:^|\n)Heading) #Capture Heading into group 1.
#We can't use lookbehind because of (?:^|\n)
(?= #A lookahead, but don't capture.
\nChapter\s #Find the Chapter text.
(\d+:\d+) #Get the first chapter text. and store in group 2
.* #Capture the rest of the Chapter line.
(?:\n(\d+).+)+ #Capture every chapter line.
#The last captured chapter number gets stored into group 3.
)
/$1 (Chapter $2-$3)/gx;
An implementation of #FMc's comment could be something like:
#!/usr/bin/perl
use warnings;
use strict;
my $buffer = '';
while (<DATA>) {
if (/^Heading \d+/) { # process previous buffer, and start new buffer
process_buffer($buffer);
$buffer = $_;
}
else { # add to buffer
$buffer .= $_;
}
}
process_buffer($buffer); # don't forget last buffer's worth...
sub process_buffer {
my($b) = #_;
return unless length $b; # don't bother with an unpopulated buffer
my($last) = $b =~ /(\d+)\s.*$/;
my($chap) = $b =~ /^(Chapter \d+:\d+)/m;
$b =~ s/^(Heading \d+)/$1 ($chap-$last)/;
print $b;
}
__DATA__
Heading 1
Chapter 1:1 This is text
2 This is more text
3 This is more text
4 This is more text
5 This is more text
6 This is more text
7 This is more text
8 This is more text
9 This is more text
10 This is more text
11 This is more text
12 This is more text
13 This is more text
14 This is moret text
15 This is more text
Heading 2
Chapter 2:1 This is text
2 This is more text...
3 This is more text