How to remove CRLF conditionally from a text file preferably in Notepad ++ - regex

I've been looking for this one all day now, this is the closest useful ref I found.
My problem: huge files are imported from a closed system (can't be altered at the source) and need to be imported. These files are | separated and have a CRLF at the end of each line
(until the last one). Now they found it funny to include a new type that can contain text with CR and CRLF in the text (instedd of <br>).
So what I need to do before I can process this file in our system, is to replace all CRLF and CR occurrences that are not preceded by a | to <br>, so that every line starts with a code like 000| ... 600|
Closest I've got in Notepad ++:
Find: (?<![\|])[\r\n]+$
Replace: <br>
The prroblem is that it will not give a <br> for every crlf, misses crlf after cr... Other attempts to select the |crlf too forget the CR altogether.
Any thoughts greatly appreciated. Do keep in mind that the file can be over 500MB (complicating things a bit)
Extract of the file:
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :CR
CRLF
------------------------------CR
CRLF
CRLF
Gezien U het evaluatierooster niet heeft ingevuld, blijft CR
CRLF
CRLF
|||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET CR
CRLF
(DE) GEBOUW(EN) CR
CRLF
CR
CRLF
indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF
...
Required result: (rough copy paste job ;))
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :<BR><BR>---------------------<BR><BR><BR>Gezien U het evaluatierooster niet heeft ingevuld, blijft <BR><BR>||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET <BR><BR>(DE) GEBOUW(EN) <BR><BR><BR><BR>indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF

Wow, this one phased me for a little while...
It's tricky to do it in one pass.
The N++ constraint probably makes it tougher than it needs to be, but short of writing some code to do what you want it's a good way to go I guess.
While I'm not sure it's optimal, I had success with this combo.
Find:
([^|])\r([\r\n])*
Replace:
$1<br>
You need the $1 in the replace or you lose a character from your replaced lines - probably not what you want!
Ideally, you should look into some Perl (I'm no perl advocate, other scripting languages handling regex are available...) or something to do this.
Edit:
Just a thought. This makes the assumption that there won't be sections of your file that contain |CRLF or |CR or |CRCR that are not 'real' line endings.

Edit: Scrapped my last suggestions - didn't work
As suggested by BunjiquoBianco, I think that this is not possible in one pass.
Would be much better if you could use awk. If you are using Windows, try http://gnuwin32.sourceforge.net/packages/gawk.htm
If awk is a viable option, re-ask the question and the awk nuts will probably suggest a one-liner from command prompt to parse the whole file.
awk is fast too - would give you a much faster transformation and can be included in other scripts more easily thereby cutting out any manual N++ process.

Related

Advanced text replacement (cloze deletion)

Well, I'd like to replace specific texts based on text, yeah sounds funny, so here it is.
The problem is how to replace the tab-separated values. Essentially, what I'd like to do is replace the matching vocabulary string found on the sentence with {...}.
The value before the tab \t is the vocab, the value after the tab is the sentence. The value on the left of the \t is the first column, to its right is the second column
TL;DR Version (English Version)
Essentially, I want to replace the text on the second column based on the first Column.
Examples:
ABCD \t 19475ABCD_97jdhgbl
would turn into
ABCD \t 19475{...}_97jdhgbl
ABCD is the first column here and 19475ABCD_97jdhgbl is the second one.
If you don't get the context of the Long Version below, solving this ABCD problem would be fine by me. I think it's quite a simple code but given that it's been about 4 years since I last coded in C and I've only recently started learning python, I can't do it.
Long Version: (Japanese-specific text)
1. Case 1: (For pure Kanji)
全部 \t それ、全部ください。
would become
全部 \t それ、{...}ください。
2. Case 2: (For pure Kana)**
ああ \t ああうるさい人は苦手です。
would become
ああ \t {...}うるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
would become
あいづち \t 彼の話に私は{...}を打ったの。
For Case 1 and Case 2 it has to be exact matches, especially for kana because otherwise it might replace other kana in the sentence. The coding for Case 3 has to be different (see next).
3. Case 3: (for mixed Kana and Kanji)
This is the most complex one. For this one, I'd like the script/solution to change only the matching strings, i.e., it will ignore what doesn't match and only replace those with found matches. What it does is it takes the longest possible match and replace accordingly.
上げる \t 彼は荷物をあみだなに上げた。
would become
上げる \t 彼は荷物をあみだなに{...}た。
Note here that the first column has 上げる but the second column has 上げた because it has changed in tense (First column has る while the second one has た).
So, Ideally the solution should take the longest string found in both columns, in this case it is 上げ, so this is the only string replaced with {...}, while it leaves た.
Another example
が増える \t 値段がが増える
would become
が増える \t 値段が{...}
More TL;DR
I'm actually using this for Anki.
I could use excel or notepad++ but I don't think they could replace text based on placeholders.
My goal here is to create pseudo-cloze sentences that I can use as hints hidden in a hint field only to be used for ridiculously hard synonyms or homonyms (I have an Auditory card).
I know I'm missing a fourth case, i.e., pure kana with the possibility of a sentence having changed its tense, hence its spelling. Well, that'd be really hard to code so I'd rather do it manually so as not to mess up the other kana in the sentence.
Update
I forgot to say that the text is contained in a .txt file in this format:
全部 \t それ、全部ください。
ああ \t ああうるさい人は苦手です。
あいづち \t 彼の話に私はあいづちを打ったの。
上げる \t 彼は荷物をあみだなに上げた。
There are about 7000 lines of those things so it has to check the replacements for every line.
Code works, thanks, just a minor bug with sentences including non-full replacements, it creates broken characters.
上げたxxxx 彼は荷物をあみだなに上げあ。
ABCD ABCD123
86876 xx86876h897
全部 それ、全部ください
ああ ああうるさい人は苦手です。
上げたxxxx 彼は荷物をあみだなに上げあ。
務める ああうるさい人は苦手で務めす。
務める ああうるさい務めす人は苦手で。
turns into:
Just edited James' code a bit for testing purposes (I'm using this edited version to check what kind of strings would throw off the code.
So far I've discovered that spaces in the vocabulary could cause some trouble.
This code prints the original line below the parsed line.
Just change this line:
fout.write(output)
to this
fout.write(output+str(line)+'\n')
This regex should deal with the cases you are looking for (including matching the longest possible pattern in the first column):
^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$
Regex demo here.
You can then go on to use the match groups to make the specific replacement you are looking for. Here is an example solution in python:
import re
regex = re.compile(r'^(\S+)(\S*?)\s+?(\S*?(\1)\S*?)$')
with open('output.txt', 'w', encoding='utf-8') as fout:
with open('file.txt', 'r', encoding='utf-8') as fin:
for line in fin:
match = regex.match(line)
if match:
hint = match.group(3).replace(match.group(1), '{...}')
output = '{0}\t{1}\n'.format(match.group(1) + match.group(2), hint)
fout.write(output)
Python demo here.

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?
I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Reading files which contain unquoted newline characters in text fields using R

I am trying to read a large table into R but one of the text fields occasionally contains one or more unquoted, un-escaped newline characters (\n), thus the read.table() function is not able to easily import this file. The file is pipe delimited and the text fields are not quoted.
I can read it in if I pass the argument fill=T with read.table() but, of course, rows with newline characters in a text field are be corrupted by this.
I have successfully been able to use f <- readChar(fname, nchars=file.info(fname)["size"], TRUE) to read sub-segments of the file, then use gsub() to search and destroy the offending newline characters. (see code below) However, the full file is > 100mb, so gsub() does little more than turn my laptop into a hand-warmer (it's still trying to gsub all the newline characters as I write this).
Anyone have any suggestions for how to efficiently read in a file like this?
It seems like there should be some way of telling R to expect a certain number of delimiters before expecting a newline, but I haven't been able to find any way to do this in the documentation.
Sorry, this seems like it should be easy, but it's really been stumping me, and I have not been able to find anything in stackoverflow or google offering a solution.
Here is the code I've tried so far:
attempt 1:
fdat = read.table(file=fname,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=T,
strip.white=T,
comment.char="",
header=T,
sep="|")
attempt 2:
f <- readChar(fname, nchars=file.info(fname)["size"], TRUE)
f2 = gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=T)
fdat = read.table(text=f2,
allowEscapes=F,
stringsAsFactors=F,
quote="",
fill=F,
strip.white=T,
comment.char="",
header=T,
sep="|")
Here are a few lines from the file:
NCT_ID|DOWNLOAD_DATE|DOWNLOAD_DATE_DT|ORG_STUDY_ID|BRIEF_TITLE|OFFICIAL_TITLE|ACRONYM|SOURCE|HAS_DMC|OVERALL_STATUS|START_DATE|COMPLETION_DATE|COMPLETION_DATE_TYPE|PRIMARY_COMPLETION_DATE|PRIMARY_COMPLETION_DATE_TYPE|PHASE|STUDY_TYPE|STUDY_DESIGN|NUMBER_OF_ARMS|NUMBER_OF_GROUPS|ENROLLMENT_TYPE|ENROLLMENT|BIOSPEC_RETENTION|BIOSPEC_DESCR|GENDER|MINIMUM_AGE|MAXIMUM_AGE|HEALTHY_VOLUNTEERS|SAMPLING_METHOD|STUDY_POP|VERIFICATION_DATE|LASTCHANGED_DATE|FIRSTRECEIVED_DATE|IS_SECTION_801|IS_FDA_REGULATED|WHY_STOPPED|HAS_EXPANDED_ACCESS|FIRSTRECEIVED_RESULTS_DATE|URL|TARGET_DURATION|STUDY_RANK
NCT00000105|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|2002LS032|Vaccination With Tetanus and KLH to Assess Immune Responses.|Vaccination With Tetanus Toxoid and Keyhole Limpet Hemocyanin (KLH) to Assess Antigen-Specific Immune Responses||Masonic Cancer Center, University of Minnesota|Yes|Terminated|July 2002|March 2012|Actual|March 2012|Actual|N/A|Observational|Observational Model: Case Control, Time Perspective: Prospective||3|Actual|112|Samples With DNA|analysis of blood samples before and 4 weeks postvaccination|Both|18 Years|N/A|Accepts Healthy Volunteers|Probability Sample|- Normal volunteers
- Patients with Cancer (breast, melanoma, hematologic)
- Transplant patients (umbilical cord blood transplant, autologous transplant)
- Patients receiving other cancer vaccines|March 2012|March 26, 2012|November 3, 1999|Yes|Yes|Replaced by another study.|No||http://clinicaltrials.gov/show/NCT00000105||6670
NCT00000106|Information obtained from ClinicalTrials.gov on September 25, 2012|9/25/2012|NCRR-M01RR03186-9943|41.8 Degree Centigrade Whole Body Hyperthermia for the Treatment of Rheumatoid Diseases|||National Center for Research Resources (NCRR)||Active, not recruiting||||||N/A|Interventional|Allocation: Randomized, Intervention Model: Parallel Assignment, Primary Purpose: Treatment|||||||Both|18 Years|65 Years|No|||November 2000|June 23, 2005|January 18, 2000||||No||http://clinicaltrials.gov/show/NCT00000106||7998
As can be seen, this sample lines from my problem file include the header (line #1), a problematic line (line #2), and a non-problematic line (line #3). Each non-header line starts with NCT and ends with \n (this was leveraged in gsub's regular expression).
Any suggestions are much appreciated.
It seems there is no way to solve it using read.table. Sadly, it doesn't allow to change the "record separator" as awk can do, for example.
Your attempt 2 failed because the DOS format newline is \r\n (0x0d 0x0a) and only \n is matched by gsub. Say you have following file:
NCTa|b|c
NCT1|how
are
you?|well
NCT2|are
you
sure?|yes
Then look at the output of your second command:
f2 <- gsub(pattern="\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
f2
# [1] "NCTa|b|c\r\nNCT1|how\r are\r you?|well\r\nNCT2|are\r you\r sure?|yes\r "
So you have to remove \r too. Just fix it to:
f2 <- gsub(pattern="\r?\n(?!NCT)",replacement=" ",x=f, perl=TRUE)
And it will work.
Regarding performance, you can try to readChar it by smaller chunks in a loop, gsub them and write them back to file, then read.table it. Just an idea.

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile

Regex query: how can I search PDFs for a phrase where words in that phrase appear on more than one line?

I am trying to set up an index page for the weekly magazine I work on. It is to show readers the names of
companies mentioned in that weeks' issue, plus the page numbers they are appear on.
I want to search all the PDF files for the week, where one PDF = one magazine page (originally made in
Adobe InDesign CS3 and Adobe InCopy CS3).
I have set up a list of companies I want to search for and, using PowerGREP and using delimited regular
expressions, I am able to find most page numbers where a company is mentioned. However, where a
company name contains two or more words, the search I am running will not pick up instances where the
name appears over more than one line.
For example, when looking for "CB Richard Ellis" and "Cushman & Wakefield", I got no result when the
text appeared like this:
DTZ beat BNP PRE, CB [line break here]
Richard Ellis and Cushman & [line break here]
Wakefield to secure the contract. [line end here]
Could someone advise me on how to write a regular expression that will ignore white space between
words and ignore line endings OR one that will look for the words including all types of white space (ie uneven
spaces between words; spaces at the end of lines or line endings; and tabs (I am guessing that this info is
imbedded somehow in PDF files).
Here is a sample of the set of terms I have asked PowerGREP to search for:
\bCB Richard Ellis\b
\bCB Richard Ellis Hotels\b
\bCentaur Services\b
\bChapman Herbert\b
\bCharities Property Fund\b
\bChetwoods Architects\b
\bChurch Commissioners\b
\bClive Emson\b
\bClothworkers’ Company\b
\bColliers CRE\b
\bCombined English Stores Group\b
\bCommercial Estates Group\b
\bConnells\b
\bCooke & Powell\b
\bCordea Savills\b
\bCrown Estate\b
\bCushman & Wakefield\b
\bCWM Retail Property Advisors\b
[Note that there is a delimited hard return between each \b at the end of each phrase and beginnong of the next phrase.]
By the way, I am a production journalist and not usually involved in finding IT-type solutions and am
finding it difficult to get to grips with the technical language on the PowerGREP site.
Thanks for assistance
Alison
You have hard-coded spaces in your names. Replace them with \s+ and you should be OK.
E.g.:
CB\s+Richard\s+Ellis
What's happening is, when you have a forced line break it doesn't have that space (" ") character anymore. Instead it has \n or \r\n. Using \s+ means that you are looking for any whitespace character, including carriage-returns and linefeeds, in quantity of one or more.
The regex for matching spaces is \s, so it would be
\bCB\s+Richard\s+Ellis\b
(\s+ = match at least one whitespace). Line breaks are \n (newline) and \r (return), depending on your OS. So form a group using [] including all [\r\n\s] would result in:
\bCB[\r\n\s]+Richard[\r\n\s]+Ellis\b