Reading files which contain unquoted newline characters in text fields using R - regex

I am trying to read a large table into R but one of the text fields occasionally contains one or more unquoted, un-escaped newline characters (\n), thus the read.table() function is not able to easily import this file. The file is pipe delimited and the text fields are not quoted.
I can read it in if I pass the argument fill=T with read.table() but, of course, rows with newline characters in a text field are be corrupted by this.
I have successfully been able to use f <- readChar(fname, nchars=file.info(fname)["size"], TRUE) to read sub-segments of the file, then use gsub() to search and destroy the offending newline characters. (see code below) However, the full file is > 100mb, so gsub() does little more than turn my laptop into a hand-warmer (it's still trying to gsub all the newline characters as I write this).
Anyone have any suggestions for how to efficiently read in a file like this?
It seems like there should be some way of telling R to expect a certain number of delimiters before expecting a newline, but I haven't been able to find any way to do this in the documentation.
Sorry, this seems like it should be easy, but it's really been stumping me, and I have not been able to find anything in stackoverflow or google offering a solution.
Here is the code I've tried so far:
attempt 1:
fdat = read.table(file=fname,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=T,
strip.white=T,
comment.char="",
header=T,
sep="|")
attempt 2:
f <- readChar(fname, nchars=file.info(fname)["size"], TRUE)
f2 = gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=T)
fdat = read.table(text=f2,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=F,
strip.white=T,
comment.char="",
header=T,
sep="|")
Here are a few lines from the file:
NCT_ID|DOWNLOAD_DATE|DOWNLOAD_DATE_DT|ORG_STUDY_ID|BRIEF_TITLE|OFFICIAL_TITLE|ACRONYM|SOURCE|HAS_DMC|OVERALL_STATUS|START_DATE|COMPLETION_DATE|COMPLETION_DATE_TYPE|PRIMARY_COMPLETION_DATE|PRIMARY_COMPLETION_DATE_TYPE|PHASE|STUDY_TYPE|STUDY_DESIGN|NUMBER_OF_ARMS|NUMBER_OF_GROUPS|ENROLLMENT_TYPE|ENROLLMENT|BIOSPEC_RETENTION|BIOSPEC_DESCR|GENDER|MINIMUM_AGE|MAXIMUM_AGE|HEALTHY_VOLUNTEERS|SAMPLING_METHOD|STUDY_POP|VERIFICATION_DATE|LASTCHANGED_DATE|FIRSTRECEIVED_DATE|IS_SECTION_801|IS_FDA_REGULATED|WHY_STOPPED|HAS_EXPANDED_ACCESS|FIRSTRECEIVED_RESULTS_DATE|URL|TARGET_DURATION|STUDY_RANK
NCT00000105|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|2002LS032|Vaccination With Tetanus and KLH to Assess Immune Responses.|Vaccination With Tetanus Toxoid and Keyhole Limpet Hemocyanin (KLH) to Assess Antigen-Specific Immune Responses||Masonic Cancer Center, University of Minnesota|Yes|Terminated|July 2002|March 2012|Actual|March 2012|Actual|N/A|Observational|Observational Model: Case Control, Time Perspective: Prospective||3|Actual|112|Samples With DNA|analysis of blood samples before and 4 weeks postvaccination|Both|18 Years|N/A|Accepts Healthy Volunteers|Probability Sample|- Normal volunteers
- Patients with Cancer (breast, melanoma, hematologic)
- Transplant patients (umbilical cord blood transplant, autologous transplant)
- Patients receiving other cancer vaccines|March 2012|March 26, 2012|November 3, 1999|Yes|Yes|Replaced by another study.|No||http://clinicaltrials.gov/show/NCT00000105||6670
NCT00000106|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|NCRR-M01RR03186-9943|41.8 Degree Centigrade Whole Body Hyperthermia for the Treatment of Rheumatoid Diseases|||National Center for Research Resources (NCRR)||Active, not recruiting||||||N/A|Interventional|Allocation: Randomized, Intervention Model: Parallel Assignment, Primary Purpose: Treatment|||||||Both|18 Years|65 Years|No|||November 2000|June 23, 2005|January 18, 2000||||No||http://clinicaltrials.gov/show/NCT00000106||7998
As can be seen, this sample lines from my problem file include the header (line #1), a problematic line (line #2), and a non-problematic line (line #3). Each non-header line starts with NCT and ends with \n (this was leveraged in gsub's regular expression).
Any suggestions are much appreciated.

It seems there is no way to solve it using read.table. Sadly, it doesn't allow to change the "record separator" as awk can do, for example.
Your attempt 2 failed because the DOS format newline is \r\n (0x0d 0x0a) and only \n is matched by gsub. Say you have following file:
NCTa|b|c
NCT1|how
are
you?|well
NCT2|are
you
sure?|yes
Then look at the output of your second command:
f2 <- gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
f2
# [1] "NCTa|b|c\r\nNCT1|how\r are\r you?|well\r\nNCT2|are\r you\r sure?|yes\r "
So you have to remove \r too. Just fix it to:
f2 <- gsub(pattern="\r?\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
And it will work.
Regarding performance, you can try to readChar it by smaller chunks in a loop, gsub them and write them back to file, then read.table it. Just an idea.

Related

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.

Advanced text replacement (cloze deletion)

Well, I'd like to replace specific texts based on text, yeah sounds funny, so here it is.
The problem is how to replace the tab-separated values. Essentially, what I'd like to do is replace the matching vocabulary string found on the sentence with {...}.
The value before the tab \t is the vocab, the value after the tab is the sentence. The value on the left of the \t is the first column, to its right is the second column
TL;DR Version (English Version)
Essentially, I want to replace the text on the second column based on the first Column.
Examples:
ABCD \t 19475ABCD_97jdhgbl
would turn into
ABCD \t 19475{...}_97jdhgbl
ABCD is the first column here and 19475ABCD_97jdhgbl is the second one.
If you don't get the context of the Long Version below, solving this ABCD problem would be fine by me. I think it's quite a simple code but given that it's been about 4 years since I last coded in C and I've only recently started learning python, I can't do it.
Long Version: (Japanese-specific text)
1. Case 1: (For pure Kanji)
全部 \t それ、全部ください。
would become
全部 \t それ、{...}ください。
2. Case 2: (For pure Kana)**
ああ \t ああうるさい人は苦手です。
would become
ああ \t {...}うるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
would become
あいづち \t 彼の話に私は{...}を打ったの。
For Case 1 and Case 2 it has to be exact matches, especially for kana because otherwise it might replace other kana in the sentence. The coding for Case 3 has to be different (see next).
3. Case 3: (for mixed Kana and Kanji)
This is the most complex one. For this one, I'd like the script/solution to change only the matching strings, i.e., it will ignore what doesn't match and only replace those with found matches. What it does is it takes the longest possible match and replace accordingly.
上げる \t 彼は荷物をあみだなに上げた。
would become
上げる \t 彼は荷物をあみだなに{...}た。
Note here that the first column has 上げる but the second column has 上げた because it has changed in tense (First column has る while the second one has た).
So, Ideally the solution should take the longest string found in both columns, in this case it is 上げ, so this is the only string replaced with {...}, while it leaves た.
Another example
が増える \t 値段がが増える
would become
が増える \t 値段が{...}
More TL;DR
I'm actually using this for Anki.
I could use excel or notepad++ but I don't think they could replace text based on placeholders.
My goal here is to create pseudo-cloze sentences that I can use as hints hidden in a hint field only to be used for ridiculously hard synonyms or homonyms (I have an Auditory card).
I know I'm missing a fourth case, i.e., pure kana with the possibility of a sentence having changed its tense, hence its spelling. Well, that'd be really hard to code so I'd rather do it manually so as not to mess up the other kana in the sentence.
Update
I forgot to say that the text is contained in a .txt file in this format:
全部 \t それ、全部ください。
ああ \t ああうるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
上げる \t 彼は荷物をあみだなに上げた。
There are about 7000 lines of those things so it has to check the replacements for every line.
Code works, thanks, just a minor bug with sentences including non-full replacements, it creates broken characters.
上げたxxxx 彼は荷物をあみだなに上げあ。
ABCD ABCD123
86876 xx86876h897
全部 それ、全部ください
ああ ああうるさい人は苦手です。
上げたxxxx 彼は荷物をあみだなに上げあ。
務める ああうるさい人は苦手で務めす。
務める ああうるさい務めす人は苦手で。
turns into:
Just edited James' code a bit for testing purposes (I'm using this edited version to check what kind of strings would throw off the code.
So far I've discovered that spaces in the vocabulary could cause some trouble.
This code prints the original line below the parsed line.
Just change this line:
fout.write(output)
to this
fout.write(output+str(line)+'\n')
This regex should deal with the cases you are looking for (including matching the longest possible pattern in the first column):
^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$
Regex demo here.
You can then go on to use the match groups to make the specific replacement you are looking for. Here is an example solution in python:
import re
regex = re.compile(r'^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$')
with open('output.txt', 'w', encoding='utf-8') as fout:
with open('file.txt', 'r', encoding='utf-8') as fin:
for line in fin:
match = regex.match(line)
if match:
hint = match.group(3).replace(match.group(1), '{...}')
output = '{0}\t{1}\n'.format(match.group(1) + match.group(2), hint)
fout.write(output)
Python demo here.

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of
companies mentioned in that weeks' issue, plus the page numbers they are appear on.
I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in
Adobe InDesign CS3 and Adobe InCopy CS3).
I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular
expressions, I am able to find most page numbers where a company is mentioned. However, where a
company name contains two or more words, the search I am running will not pick up instances where the
name appears over more than one line.
For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the
text appeared like this:
DTZ beat BNP PRE, CB [line break here]
Richard Ellis and Cushman & [line break here]
Wakefield to secure the contract. [line end here]
Could someone advise me on how to write a regular expression that will ignore white space between
words and ignore line endings OR one that will look for the words including all types of white space (ie uneven
spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is
imbedded somehow in PDF files).
Here is a sample of the set of terms I have asked PowerGREP to search for:
\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b
[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]
By the way, I am a production journalist and not usually involved in finding IT-type solutions and am
finding it difficult to get to grips with the technical language on the PowerGREP site.
Thanks for assistance
Alison
You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.
E.g.:
CB\s+Richard\s+Ellis
What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.
The regex for matching spaces is \s, so it would be
\bCB\s+Richard\s+Ellis\b
(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:
\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b