Match beginning of file to string literal - regex

I'm working with a multi line text block where I need to divide everything into 3 groups
1: beginning of the file up to a string literal // don't keep
2: The next line //KEEP THE LINE FOLLOWING STRING LITERAL
3: Everything following that line to the end of file. // don't keep
<<
aFirstLing here
aSecondLine here
MyStringLiteral //marks the next line as the target to keep
What I want to Keep!
all kinds of crap that I don't
<<
I'm finding plenty of ways to pull from the beginning of a line but am unable to see how to include an unknown number of non-blank lines until I reach that string literal.
EDIT: I'm removing the .net-ness to focus on regex only. Perhaps this is a place for understanding backreferences?

Rather than read the entire file into memory, just read what you need:
List<string> TopLines = new List<string>();
string prevLine = string.Empty;
foreach (var link in File.ReadLines(filename))
{
TopLines.Add(line);
if (prevLine == Literal)
{
break;
}
prevLine = line;
}
I suppose there's a LINQ solution, although I don't know what it is.
EDIT:
If you already have the text of the email in you application (as a string), you have to split it into lines first. You can do that with String.Split, splitting on newlines, or you can create a StringReader and read it line-by-line. The logic above still applies, but rather than File.ReadLines, just use foreach on the array of lines.
EDIT 2:
The following LINQ might do it:
TopLines = File.ReadLines(filename).TakeWhile(s => s != Literal).ToList();
TopLines.Add(Literal);
Or, if the strings are already in a list:
TopLines = lines.TakeWhile(s => s != Literal).ToList();
TopLines.Add(Literal);

.*(^MyStringLiteral\r?\n)([\w|\s][^\r\n]+)(.+) seems to work. the trick wasn't back references - it was the exclusion of \r\n.

File.ReadAllLines() will give you an array you can iterate over until you find your literal, then take the next line
string[] lines = File.ReadAllLines();
for(int i;i<lines.Length;i++)
{
if(line == Literal)
return lines[i + 1];
}

Related

How to remove newlines inside csv cells using regex/terminal tools?

I have a csv file where some of the cells have newline character inside. For example:
id,name
01,"this is
with newline"
02,no newline
I want to remove all the newline characters inside cells.
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
This is actually a harder problem than it looks, and in my opinion, means that regex isn't the right solution. Because you're dealing with quoting/escaped strings, spanning multiple 'lines' you end up with a complicated and difficult to read regex. (It's not impossible, it's just messy).
I would suggest instead - use a parser. Perl has one in Text::CSV and it goes a bit like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new( { binary => 1, eol => "\n" } );
while ( my $row = $csv->getline( \*ARGV ) ) {
s/\n/ /g for #$row;
$csv->print( \*STDOUT, $row );
}
This will take files as piped in/specified on command line - that's what \*ARGV does - it's a special file handle that lets you do ... basically what sed does:
somecommand.sh | myscript.pl
myscript.pl filename_to_process
The ARGV filehandle doe either automagically. (You could explicitly open a file or use \*STDIN if you prefer)
I suspect that instead of removing the newline you actually want to replace it with a space. If your input file is as simple as it looks this should do it for you:
$ awk '{ORS=( (c+=gsub(/"/,"&"))%2 ? FS : RS )} 1' file
id,name
01,"this is with newline"
02,no newline
If you are using this xlsx2csv tool, it has this option:
-e, --escape Escape \r\n\t characters
Use it, and then replace \n as needed, like (if \n should be replaced by the empty string):
sed 's/\\n//g' filein.csv` > fileout.csv
In one pass:
PATH/TO/xlsx2csv.py -e filein.xlsx | sed 's/\\n//g' > fileout.csv
How to do it with regex or with other terminal tools generically without knowing number of columns in advance?
I don't think a regex is the most appropriate approach and might end up being quite complicated. Instead, I think a separate program to process the files might be easier to maintain in the long-term.
Since you're OK with any terminal tools, I've chosen python, and the code's below:
#!/usr/bin/python3 -B
import csv
import sys
with open(sys.argv[1]) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
stripped = [col.replace('\n', ' ') for col in row]
print(','.join(stripped))
I think the code above is very straightforward and easy to understand, without a need for complicated regular expressions.
The input file here has the following contents:
id,name
01,"this is
with newline"
02,no newline
To prove it works, its output is reproduced below:
➜ ~ ./test.py input.csv
id,name
01,this is with newline
02,no newline
You could call the python script from some other program and feed filenames to it. You just need to add a minor update for the python program to write out files, if that's what you really need.
I've replaced the newlines with spaces to avoid a potentially unwanted concatenation (e.g. this iswith newline), but you can replace the newline with whatever you want, including the empty string ''.
I have written a method to remove the embedded new line inside the cell. The method below returns a java.util.List object that contains all rows in the CSV file
List<String> getAllRowsInCSVFileAsList(File selectedCSVFile){
FileReader fileReader = null;
BufferedReader reader = null;
List<String> values = new ArrayList<String>();
try{
fileReader = new FileReader(selectedCSVFile);
reader = new BufferedReader(fileReader);
String line = reader.readLine();
String previousLine = "";
//
boolean intendLineInCell = false;
while(line != null){
if(intendLineInCell){
if(line.indexOf("\"") != -1 && line.indexOf("\"") == line.lastIndexOf("\"")){
previousLine += line;
values.add(previousLine);
previousLine = "";
intendLineInCell = false;
} else if(line.indexOf("\"") != -1 && line.indexOf("\"") != line.lastIndexOf("\"")){
if(getTotalNumberOfCharacterSequenceOccurrenceInString("\"", line) % 2 == 0){
previousLine += line;
}else{
previousLine += line;
values.add(previousLine);
previousLine = "";
intendLineInCell = false;
}
} else{
previousLine += line;
}
}else{
if(line.indexOf("\"") == -1){
values.add(line);
}else if ((line.indexOf("\"") == line.lastIndexOf("\"")) && line.indexOf("\"") != -1){
intendLineInCell = true;
previousLine = line;
}else if(line.indexOf("\"") != line.lastIndexOf("\"") && line.indexOf("\"") != -1){
values.add(line);
}
}
line = reader.readLine();
}
}catch(IOException ie){
ie.printStackTrace();
}finally{
if(fileReader != null){
try {
fileReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if(reader != null){
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return values;
}
int getTotalNumberOfCharacterSequenceOccurrenceInString(String characterSequence, String text){
int count = 0;
while(text.indexOf(characterSequence) != -1){
text = text.replaceFirst(characterSequence, "");
count++;
}
return count;
}
Imagine you are creating a csv file with one row and five columns and in the 4th cell you have an embedded new line(enter inside the cell)
Your data will be look like below (actually we have only one row in csv but if you opened it in notepad it would look like 2 rows).
dinesh,kumar,24,"23
tambaram india",green
If there is a enter inside the cell could be like below
"23
tambaram india"
That cell starts with double quote(") and ends with double quote(").
Through using the double quote(") while reading the line if there is a double quote(") we can understand there is a embedded enter inside the cell.
The code concats the next line with that line and checks whether there is an end double quote(") or not. If there is, it adds a new row in the java.util.List object else it concats the next line and check it for end double quote(") and so on. Here I have explained for one cell, but the method also works if the row has a lot of cells with embedded enter.
Open the *csv file with notepadd++ and then press Ctrl+ H. Go to tab replace and enter to search box the "newline" and then write to replace the word you want to replace or let it empty if you want.

Split string and get values before different delimiters

Given the code:
procedure example {
x=3;
y = z +c ;
while p {
b = a+c ;
}
}
I would like to split the code by using the delimiters {, ;, and }.
After splitting, I would like to get the information before it together with the delimiter.
So for example, I would like to get procedure example {, x=3;, y=z+c;, }. Then I would like to push it into a list<pair<int, string>> sList. Could someone explain how this can be done in c++?
I tried following this example: Parse (split) a string in C++ using string delimiter (standard C++), but I could only get one token. I want the entire line. I am new to c++, and the list, splitting, etc. is confusing.
Edit: So I have implemented it, and this is the code:
size_t openCurlyBracket = lines.find("{");
size_t closeCurlyBracket = lines.find("}");
size_t semiColon = lines.find(";");
if (semiColon != string::npos) {
cout << lines.substr(0, semiColon + 1) + "\n";
}
However, it seems that it can't separate based on semicolon separately, openBracket and closeBracket. Anyone knows how to separate based on these characters individually?
2nd Edit:
I have done this (codes below). It is separating correctly, I have one for open curly bracket. I was planning on adding the value to the list in the commented area below. However, when i think about it, if i do that, then the order of information in the list will be messed up. As i have another while loop which separates based on open curly bracket. Any idea on how i can add the information in an order?
Example:
1. procedure example {
2. x=3;
3. y = z+c
4. while p{
and so on.
while (semiColon != string::npos) {
semiColon++;
//add list here
semiColon = lines.find(';',semiColon);
}
I think that you should read about std::string::find_first_of function.
Searches the string for the first character that matches any of the characters specified in its arguments.
I have a problem to understand what you really want to achieve. Let's say this is an example of the find_first_of function use.
list<string> split(string lines)
{
list<string> result;
size_t position = 0;
while((position = lines.find_first_of("{};\n")) != string::npos)
{
if(lines[position] != '\n')
{
result.push_back(lines.substr(0, position+1));
}
lines = lines.substr(position+1);
}
return result;
}

Notepad++ or UltraEdit: regex remove special duplicates

I need to remove duplicates if
key = anything
but NOT
key=anything
the key can be anything too
e.g.
edit_home=home must be in place
while
edit_home = home or even other string must be removed IF edit_home is a duplicate
for all the lines of the document
thank you
p.s. clearer example:
one=you are
two=we are
three_why=8908908
one = good
two = fine
three_4 = best
three_why = win
from that list i only need to keep:
one=you are
two=we are
three_why=8908908
three_4 = best // because three_4 doesn't have a duplicate
I found a method to do it, but I would need a better search list support by regex or a plugin or a direct regex (which I don't know).
That is: I have two files to compare.
One has the full keys, the other has incomplete.
I merge in a new file all the keys from the first file with those ones of the second, in groups (because the keys are in groups e.g. many keys titled one, many titled two and so on...). Then I regex replace all the keys in the new file by
find (.*)(\s\=\s) replace with \1\=
So they all become key=anything
Then I replace everything after = with empty to isolate the keys.
Then remove the duplicates.
At this point I have trouble to do something like
^.*(^keyone\b|^keytwo\b|^keythree\b).*$
to find all those keys in the document I need. So from that I can select all and replace with the correct keys.
Why? Because in this example the keys are 3 only BUT indeed the keys are many and the find field breaks at a certain point.
How to do it right?
Update: I found Toolbucket plugin which allows to search for many strings, but another issue is that in addition to duplicate, I also have to remove the original.
That is, if I find 2 times the same key "one" I have to remove all the lines containing one.
Ctrl + F
Find tab
Find what: ^.*\S=\S.*$
Find All in Current Document
Copy result from result window to a new window (the list of Line 1: Line 2: Line 3: ...)
Ctrl + F
Replace tab
(the following will remove the leading "Line number:" from every line)
Find what: ^.*?\d:\s
Replace with: Empty
ok, after all that i wrote, one solution could be (therefore, once i have the merged keys)
(?m)^(.*)$(?=\r?\n^(?!\1).*(?s).*?\1)
with this i can mark/highlight all the duplicated keys :-) so then i can manage those only, removing them from the first list and adding what remains to the second file...
If someone has a solution with a direct regex will be really appreciated
Here is a commented UltraEdit script for this task.
// Note: This script does not work for large files as it loads the
// entire file content into very limited scripting memory for fast
// processing even with multiple GB of RAM installed.
if (UltraEdit.document.length > 0) // Is any file opened?
{
// Define environment for this script and select entire file content.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
UltraEdit.activeDocument.selectAll();
// Determine line termination used currently in active file.
var sLineTerm = "\r\n";
if (typeof(UltraEdit.activeDocument.lineTerminator) == "number")
{
// The two lines below require UE v16.00 or UES v10.00 or later.
if (UltraEdit.activeDocument.lineTerminator == 1) sLineTerm = "\n";
else if (UltraEdit.activeDocument.lineTerminator == 2) sLineTerm = "\r";
}
else // This version of UE/UES does not offer line terminator property.
{
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\n"; // Not DOS, perhaps UNIX.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r"; // Also not UNIX, perhaps MAC.
if (UltraEdit.activeDocument.selection.indexOf(sLineTerm) < 0)
{
sLineTerm = "\r\n"; // No line terminator, use DOS.
}
}
}
}
// Get all lines of active file into an array of strings
// with each string being one line from active file.
var asLines = UltraEdit.activeDocument.selection.split(sLineTerm);
var nTotalLines = asLines.length;
// Process each line in the array.
for(var nCurrentLine = 0; nCurrentLine < asLines.length; nCurrentLine++)
{
// Skip all lines not containing or starting with an equal sign.
if (asLines[nCurrentLine].indexOf('=') < 1) continue;
// Get string left to equal sign with tabs/spaces trimmed.
var sKey = asLines[nCurrentLine].replace(/^[\t ]*([^\t =]+).*$/,"$1");
// Skip lines beginning with just tabs/spaces left to equal sign.
if (sKey.length == asLines[nCurrentLine].length) continue;
var_dump(sKey);
// Build the regular expression for the search in all other lines.
var rRegSearch = new RegExp("^[\\t ]*"+sKey+"[\\t ]*=","g");
// Ceck all remaining lines for a line also starting with
// this key string case-sensitive with left to an equal sign.
var nLineCompare = nCurrentLine + 1;
while(nLineCompare < asLines.length)
{
// Does this line also has this key left to equal
// sign with or without surrounding spaces/tabs?
if (asLines[nLineCompare].search(rRegSearch) < 0)
{
nLineCompare++; // No, continue on next line.
}
else // Yes, remove this line from array.
{
asLines.splice(nLineCompare,1);
}
}
}
// Was any line removed from the array?
if (nTotalLines == asLines.length)
{
UltraEdit.activeDocument.top(); // Cancel the selection.
UltraEdit.messageBox("Nothing found to remove!");
}
else
{
// If version of UE/UES supports direct write to clipboard, use
// user clipboard 9 to paste the lines into file with overwritting
// everything as this is much faster than using write command in
// older versions of UE/UES.
if (typeof(UltraEdit.clipboardContent) == "string")
{
var nActiveClipboard = UltraEdit.clipboardIdx;
UltraEdit.selectClipboard(9);
UltraEdit.clipboardContent = asLines.join(sLineTerm);
UltraEdit.activeDocument.paste();
UltraEdit.clearClipboard();
UltraEdit.selectClipboard(nActiveClipboard);
}
else UltraEdit.activeDocument.write(asLines.join(sLineTerm));
var nRemoved = nTotalLines - asLines.length;
UltraEdit.activeDocument.top();
UltraEdit.messageBox("Removed " + nRemoved + " line" + ((nRemoved != 1) ? "s" : "") + " on updated file.");
}
}
Copy this code and paste it into a new ASCII file using DOS line terminators in UltraEdit.
Next use command File - Save As to save the script file for example with name RemoveDuplicateKeys.js into %AppData%\IDMComp\UltraEdit\MyScripts or wherever you want to have saved your UltraEdit scripts.
Open Scripting - Scripts and add the just saved UltraEdit script to the list of scripts. You can enter a description for this script, too.
Open the file with the list, or make this file active if it is already opened in UltraEdit.
Run the script by clicking on it in menu Scripting, or by opening Views - Views/Lists - Script List and double clicking on the script.

seekg() not working as expected

I have a small program, that is meant to copy a small phrase from a file, but it appears that I am either misinformed as to how seekg() works, or there is a problem in my code preventing the function from working as expected.
The text file contains:
//Intro
previouslyNoted=false
The code is meant to copy the word "false" into a string
std::fstream stats("text.txt", std::ios::out | std::ios::in);
//String that will hold the contents of the file
std::string statsStr = "";
//Integer to hold the index of the phrase we want to extract
int index = 0;
//COPY CONTENTS OF FILE TO STRING
while (!stats.eof())
{
static std::string tempString;
stats >> tempString;
statsStr += tempString + " ";
}
//FIND AND COPY PHRASE
index = statsStr.find("previouslyNoted="); //index is equal to 8
//Place the get pointer where "false" is expected to be
stats.seekg(index + strlen("previouslyNoted=")); //get pointer is placed at 24th index
//Copy phrase
stats >> previouslyNotedStr;
//Output phrase
std::cout << previouslyNotedStr << std::endl;
But for whatever reason, the program outputs:
=false
What I expected to happen:
I believe that I placed the get pointer at the 24th index of the file, which is where the phrase "false" begins. Then the program would've inputted from that index onward until a space character would have been met, or the end of the file would have been met.
What actually happened:
For whatever reason, the get pointer started an index before expected. And I'm not sure as to why. An explanation as to what is going wrong/what I'm doing wrong would be much appreciated.
Also, I do understand that I could simply make previouslyNotedStr a substring of statsStr, starting from where I wish, and I've already tried that with success. I'm really just experimenting here.
The VisualC++ tag means you are on windows. On Windows the end of line takes two characters (\r\n). When you read the file in a string at a time, this end-of-line sequence is treated as a delimiter and you replace it with a single space character.
Therefore after you read the file you statsStr does not match the contents of the file. Every where there is a new line in the file you have replaced two characters with one. Hence when you use seekg to position yourself in the file based on numbers you got from the statsStr string, you end up in the wrong place.
Even if you get the new line handling correct, you will still encounter problems if the file contains two or more consecutive white space characters, because these will be collapsed into a single space character by your read loop.
You are reading the file word by word. There are better methods:
while (getline(stats, statsSTr)
{
// An entire line is read into statsStr.
std::string::size_type posn = statsStr.find("previouslyNoted=");
// ...
}
By reading entire text lines into a string, there is no need to reposition the file.
Also, there is a white-space issue when reading by word. This will affect where you think the text is in the file. For example, white space is skipped, and there is no telling how many spaces, newlines or tabs were skipped.
By the way, don't even think about replacing the text in the same file. Replacement of text only works if the replacement text has the same length as the original text in the file. Write to a new file instead.
Edit 1:
A better method is to declare your key strings as array. This helps with positioning pointers within a string:
static const char key_text[] = "previouslyNoted=";
while (getline(stats, statsStr))
{
std::string::size_type key_position = statsStr.find(key_text);
std::string::size_type value_position = key_position + sizeof(key_text) - 1; // for the nul terminator.
// value_position points to the character after the '='.
// ...
}
You may want to save programming type by making your data file conform to an existing format, such as INI or XML, and using appropriate libraries to parse them.

C : Using substr to parse a text file

I just need a little help with file parsing. We have to parse a file that has 6 string entries per row in the format:
"string1", "string2", "string3", "string4", "string5", "string6"
My instructor recently gave us a little piece of code as a "hint," and I'm supposed to use it. Unfortunately, I can't figure out how to get it to work. Here's my file parsing function.
void parseData(ifstream &myFile, Book bookPtr[])
{
string bookInfo;
int start, end;
string bookData[6];
getline(myFile, bookInfo);
start = -2;
myFile.open("Book List.txt");
for (int j = 0; j < 6; j++)
{
start += 3;
end = bookInfo.find('"', start);
bookData[j] = bookInfo.substr(start, end-start);
start = end;
}
}
So I'm trying to read the 6 strings into an array of strings. Can someone please help walk me through the process?
start = -2;
for (int j = 0; j < 6; j++)
{
start += 3;
end = bookInfo.find('"', start);
bookData[j] = bookInfo.substr(start, end-start);
start = end;
}
So ", " is four characters. The leading closing quote is 3 characters behind the opening closing quote.
At entry to the loop start is pointing to the last closing quote. (On first entry to loop it is faked as -2 to be pointing to the closing quote of the imaginary "-1th" element.)
So we advance from the last closing quote to the following opening quote:
start += 3;
Then we use std::string::find to find the closing quote:
end = bookInfo.find('"', start);
The offset tells it to ignore all characters up to and including that position.
We then have the two quote positions, start..end, so we use substr to extract the string:
bookData[j] = bookInfo.substr(start, end-start);
And we then update start for the next loop to be the last closing quote:
start = end
Please, for your own sake, create a minimal example. This starts with a string like the line you gave as example and ends with the different parts in an array. Leave the loading from a file out for now, getline() seems to work for you, or? Then, do not declare every variable you might want to use at the beginning of a function. This is not ancient C, where you simply had to do that or introduce additional {} blocks. There is another thing odd, and that is the Book bookPtr[]. This is indeed just a Book* bookPtr, i.e. you are not passing an array to a function but just a pointer. Don't fall for this misleading syntax, it's a lie! Anyway, you don't seem to be using that pointer to the object(s) of the unknown type anyway.
Concerning the splitting of a line into strings, one approach is to locate pairs of double quotes. Everything in between is one of the strings, everything without is irrelevant. The string class has a find() function which optionally takes a starting position. Starting position is always one behind the previously found position.
Your code above seems to assume that there is exactly one double quote, a comma, a space and another double quote that separates two strings. This isn't 100% clear, I would also be prepared for handling multiple spaces or no space at all. Also, is the comma guaranteed? Are the double quotes guaranteed? Anyway, keep it simple. Unless you get a better spec on the input, just assume that only the parts between the quotes is what differs.
Then, what exactly works and what doesn't? You need to ask more specific questions and give more detailed information. The code above doesn't look broken per se, although there are a few things a bit off. For example, you don't typically pass ifstreams to a function, but use the istream baseclass. In your case, you read a line from that file and then open another file using the same fstream object, which doesn't make sense to me, since you don't use it after that. If you only needed that stream locally, you would create and open it there (handling errors of course!) and pass in the filename as parameter only.