How to split a string based on empty/blank lines? - c++

I'm writing a c++ application (Qt Widgets) that is supposed to parse an .srt subtitle file. Each part of the file is separated by an empty line, like this:
1
00:00:08,000 --> 00:00:11,000
[Line]
2
00:00:56,034 --> 00:00:57,492
[Line]
[Another line]
3
00:01:13,676 --> 00:01:15,420
[Line]
Basically, I want to read the entire file to a QString, and split it by empty lines into QString array, each item containing one of those sections like this:
2
00:00:56,034 --> 00:00:57,492
[Line]
[Another line]
However, I cannot figure out how to do this. I tried splitting the string by \r and \n, but that split everything into separate lines, not by empty lines.
This is the routine I had in mind to get the data from the .srt file:
Read all of the contents of the file to a QString (named something along the lines of content).
Split the QString by empty lines, and append to a QStringList (named something along the lines of sections).
For each item in sections, split the second line by the --> identifier, and assign indexes 0 and 1 to QString variables called startTime, and endTime, respectively.
Take the rest of the lines (everything after line 2 is the subtitle text), and append them to a QString called subtitleText.
Add all the gathered information to an SrtSubtitle instance, and append it to QList<SrtSubtitle>
How can I achieve this?

New lines are usually represented as \n.
To split the string when there are 2 new lines without anything between them, you can use \n\n as delimiter.

I would improve upon ziarra's answer. You certainly want the solution to be robust and work also with Windows line endings which are "\r\n" instead of "\n". In that case ziarra's solution would not suffice.
So my proposal is to do it in two steps:
replace all occurrences of "\r\n" with "\n"
split the text by "\n\n" (as ziarra suggests)

Related

How to read CSV file with newline and comma characters inside cells in C++

I've got a CSV file containing cells with break lines ("\n") and/or commas which are enclosed with double quotes.
When I use getline() function to get each row, it consider each line inside cell as a new row of csv file. In addition, when using splitIntoVec to get vector of each row, it condiders comma inside a cell as a new vector element.
I want to store the content of csv file into a vector of vectors which each row is a vector of strings inside its cells.
for instance, for the following csv file content
"Row 1 cell 1
With break line","Row1 cell2, with comma"
"Row 2 cell 1
With break line","Row2 cell2, with comma"
Row 3 cell 1,Row3 cell 2
I get the result vector of 4 string vectors which the first one has only one element and the second one has 3 elements.
Here is my code :
vector<vector<string>> readFromCsv(string &fileName, char rowDelimiter = "\n", char colDelimiter = ",") {
ifstream file(fileName); // declare file stream
string value;
vector<vector<string>> contentVec;
vector<string> rowVec;
string rowStr;
while (getline(file, rowStr, rowDelimiter)) {
rowVec = splitIntoVec(rowStr, colDelimiter);
contentVec.push_back(rowVec);
}
return contentVec;
}
Is there any other function (in libraries like boost) available to resolve these issues? Any help would be appreciated.
In PHP , I get the content of the csv file by fgetcsv() correctly . Is there any alternative function in c++?
#Simone already said in his comment that it is not the CSV file. But seeing your problem you will need to get your hand dirty and do some text processing to get it separate. You can read complete file in a string and then break it further using loops or which ever way you see fit. For this you will need to keep track of the encountered " while traversing and breaking only when it is not inside double quotes.
For Example,
(opening apostrophes)"Row 1 cell 1
With break line"(closing apostrophes),"(opening apostrophes)Row1 cell2, with comma"(closing apostrophes)
You will have to keep track of opening and closing double apostrophes using index or number and break for rows only if '\n' is found outside the opening and closing apostrophes.
You can use regex also if you are sure there are no " in the cells.
Thanks #Alex Useful link if someone else faces the same issue : http://mybyteofcode.blogspot.nl/2010/11/parse-csv-file-with-embedded-new-lines.html
You have to completely separate by ", keeping 2 states: inside "" and outside. , and EOL have different meanings based on the states.
You can use getline(file, rowStr, '"') to read in everything up to the ", but your logic to separate in records will be a bit more complex. If numbers are allowed without quotation marks, then it becomes even more complex.

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.

How to split multiple line text by regex

I have multiple lines text
SUBJECT=Testing001
TEXT=TestingLine001-Test
TEXT=TestingLine002-Test
REFER=Reference001
SUBJECT=Testing002
TEXT=TestingLine003-Test
SUBJECT=Testing003
TEXT=TestingLine004-Test
REFER=Reference002
Just want to split text blocks (for this case, three text blocks, "Subject" is the first line of the text block) like as:
SUBJECT=Testing001
TEXT=TestingLine001-Test
TEXT=TestingLine002-Test
REFER=Reference001
SUBJECT=Testing002
TEXT=TestingLine003-Test
SUBJECT=Testing003
TEXT=TestingLine004-Test
REFER=Reference002
(?=\bSUBJECT\b)(?!^)
You can use this split.See demo.
https://regex101.com/r/mG8kZ9/9

Replace the whole string if it contains specific letters/character

Replace the whole string if it contains specific letters/character…
I have a text file (myFile.txt) that contains multiple lines, for example:
The hotdog
The goal
The goat
What I want to do is the following:
If any word/string in the file contains the characters 'go' then, replace it with a brand new word/string ("boat"), so the output would look like this:
The hotdog
The boat
The boat
How can I accomplish this in Python 2.7?
It sounds like you want something like this:
with open('myFile.txt', 'r+') as word_bank:
new_lines = []
for line in word_bank:
new_line = []
for word in line.strip().split():
if 'go' in word:
new_line.append('boat')
else:
new_line.append(word)
new_lines.append('%s\n' % ' '.join(new_line))
word_bank.truncate(0)
word_bank.seek(0)
word_bank.writelines(new_lines)
Open the file for reading and writing, iterate through it splitting each line into component words and looking for instances of 'go' to replace. Keep in list because you do not want to modify something you're iterating over. You will have a bad time. Once constructed, truncate the file (erase it) and write what you came up with. Notice I switched to sticking an explicit '\n' on the end because writelines will not do that for you.

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.