Find Lines with N occurrences of a char - regex

I have a txt file that I’m trying to import as flat file into SQL2008 that looks like this:
“123456”,”some text”
“543210”,”some more text”
“111223”,”other text”
etc…
The file has more than 300.000 rows and the text is large (usually 200-500 chars), so scanning the file by hand is very time consuming and prone to error. Other similar (and even more complex files) were successfully imported.
The problem with this one, is that “some lines” contain quotes in the text… (this came from an export from an old SuperBase DB that didn’t let you specify a text quantifier, there’s nothing I can do with the file other than clear it and try to import it).
So the “offending” lines look like this:
“123456”,”this text “contains” a quote”
“543210”,”And the “above” text is bad”
etc…
You can see the problem here.
Now, 300.000 is not too much if I could perform a search using a text editor that can use regex, I’d manually remove the quotes from each line. The problem is not the number of offending lines, but the impossibility to find them with a simple search. I’m sure there are less than 500, but spread those in a 300.000 lines txt file and you know what I mean.
Based upon that, what would be the best regex I could use to identify these lines?
My first thought is: Tell me which lines contain more than 4 quotes (“).
But I couldn’t come up with anything (I’m not good at Regex beyond the basics).

this pattern ^("[^"]+){4,} will match "lines containing more than 4 quotes"
you can experiment with replacing 4 with 5 or more, depending on your data.

I think that you can be more direct with a Regex than you're planning to be. Depending on your dialect of Regex, something like this should do it:
^"\d+",".*".*"

You could also use a regex to remove the outside quotes and use a better delimeter instead. For example, search for ^"([0-9]+)","(.*)"$ and replace it with \1+++++DELIM+++++\2.
Of course, this doesn't directly answer your question, but it might solve the problem.

Related

Regex-Match while ignoring a char from Searchword

I am using an Engineering Program which lets me Code formulas in order to filter out specific lines in a database. I am trying to look for a certain line in the database which contains e.g. "concrete" as a property.
In the Code I can use regular expressions.
The regex I was using so far looked like this:
".*(concrete).*";
so if the line in the database contains concrete, I will get the wanted result.
Now the Problem is: i would like to switch the word concrete with a variable, so that it Looks like this:
".*(#VARIABLE1).*";
(the Syntax with the # works in the program btw.)
the Problem is: if i set the variable as concrete, the program automatically switches it for 'concrete' . Obviously, the word concrete cant be found anymore, since the searchterm now contains the two ' Symbols in the beginning and i the end.
Is there a way to ignore those two characters using the Right regex?
what I want it to do is the following:
If a line in the database contains "25cm concrete in Grey"
I should get a match from the regex.
with the searchterm ".*(concrete).*"; it works, with the variable ".*(#VARIABLE1).*"; it doesnt.
EDIT:
the whole "Formula" in the program Looks like that:
if(Match(QTO(Typ:="Attribut{FloorsLayer_02_MaterialName}");".*(#V_QUALITY).*" ;"regex") ;QTO(Typ:="Attribut{Fläche}");0)
I want the if-condition to be true, when the match inside is true.
the whole QTO function is just the programs Syntax to use a certain Attribute into the match-function, the middle part is my Problem. I really don't know the programming language or anything,I'm new to this. hope it helps!
Thats more of a hack than a real solution and i'm not sure if it even works:
if you use the regex
.*(#VARIABLE1)?).*
and the string ?concrete(
this will result in a regex looking like this:
.*('?concrete(')?).*
which makes the additional characters optional.
This uses the following assumtption:
the string (#VARIABLE1) gets replaced by the ('<content of VARIABLE1>')

use RegEx to ignore lines that contain a specific phrase but do not end with an extension and period

Struggling to write a RegEx to use with my ANT FilterChain.
I am cleaning up a log file that contains entries such as (BAD ENTRY):
017Z WARN Broken: D:/folder/subfolder/subsubfolder/../Images/Filename (stuff about file).
The log also contains entries like this (GOOD ENTRIES):
017Z WARN Broken: D:/folder/subfolder/subsubfolder/../Images/Filename (stuff about file).ai.
I am able to successfully delete the "bad" entry and keep the "good" ones if I do something like this:
"^(?=.*WARN Broken:)(?!.*[.]ai.).*"
That, however, would force me to write a regex for all possible extensions. I actually don't know how to properly do the OR regex, in this case, to even do that. I would like to be able to do jpg, gif, eps, pdf in addition to ai.
Ideally, I would like to be able to keep everything that ends with ".extension."
That last period really makes things complicated for me.
Here is how you would add the desired OR clause to your regex:
^(?=.*WARN Broken:)(?!.*[.](ai|jpg|gif|eps|pdf).).*

Notepad++ - Selecting or Highlighting multiple sections of repeated text IN 1 LINE

I have a text file in Notepad++ that contains about 66,000 words all in 1 line, and it is a set of 200 "lines" of output that are all unique and placed in 1 line in the basic JSON form {output:[{output1},{output2},...}]}.
There is a set of characters matching the RegEx expression "id":.........,"kind":"track" that occurs about 285 times in total, and I am trying to either single them out, or copy all of them at once.
Basically, without some super complicated RegEx terms, I am stuck because I can't figure out how to highlight all of them at once, and also the Remove Unbookmarked Lines feature does not apply because this is all in one line. I have only managed to be able to Mark every single occurrence.
So does this require a large number of steps to get the file into multiple lines and work from there, or is there something else I am missing?
Edit: I have come up with a set of Macro schemes that make the process of doing this manually work much faster. It's another alternative but still takes a few steps and quite some time.
Edit 2: I intended there to be an answer for actually just highlighting the different sections all at once, but I guess that it not possible. The answer here turns out to be more useful in my case, allowing me to have a list of IDs without everything else.
You seem to already have a regex which matches single instances of your pattern, so assuming it works and that we must use Notepad++ for this:
Replace .*?("id":.........,"kind":"track").*?(?="id".........,"kind":"track"|$) with \1.
If this textfile is valid JSON, this opens you up to other, non-notepad++ options, like using Python with the json module.
Edited to remove unnecessary steps

Maven replacer plugin - repeat while matches exist

I am using the maven replacer plugin and I've run into a situation where I have a regular expression that matches across lines which I need to run on the input file until all matches have been replaced. The configuration for this expression looks like this:
<regexFlags>
<regexFlag>DOTALL</regexFlag>
</regexFlags>
<replacements>
<replacement>
<token>\#([^\n\r=\#]+)\#=([^\n\r]*)(.*)(\#default\.\1\#=[^\n\r]*)(.*)</token>
<value>#$1#=$2$3$5</value>
<replacement>
<replacements>
The input could look like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#default.a.b.c#=QQQ
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
and I want the output to look like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#asdfasd.fasdfs.asdfa#=23423
#default.RR.TT#=393993
The intention is to re-write the file, but without the tokens with a #default prefix, where another token without the prefix has already been defined.
#default.a.b.c#=QQQ and #default.h.i.j#=234 have been removed from the output because other tokens already contains a.b.c and h.i.j.
The current problem I have is that the replacer plugin only replaces the first match, so my output looks like this:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
Here, #default.a.b.c=QQQ is gone, which is correct, but #default.h.i.j#=234 is still present.
If I were writing this in code, I think I could probably just loop while attempting to match on the entire output, and break when there are no matches. Is there a way to do this with the replacer plugin?
Edit: I may have over simplified my example. A more realistic one is:
#d.e.f#=y
#a.b.c#=x
#h.i.j#=aaaa
#default.a.b.c#=QQQ
#asdfasd.fasdfs.asdfa#=23423
#default.h.i.j#=234
#default.RR.TT#=393993
#x.y.z#=0
#default.q.r.s#=1
#l.m.n#=8.3
#q.r.s#=78
#blah.blah.blah#=blah
This shows that it's possible for a default.x.x.x=y to precede a x.x.x=y token (as #default.q.r.s#=1 preceedes #q.r.s#=78`), my prior example wasn't clear about this. I do actually have an expression to capture this, it looks a bit like this:
\#default\.([^\n\r=#|]+)#=([^\n\r|]*)(.*)#\1#=([^\n\r|]*)(.*)
I know line separators are missing from this even though they were in the other one - I was experimenting with removing all line separators and treating it as a single line but that hasn't helped. I can resolve this problem simply by running each replacement multiple times by copying and pasting the configurations a few times, but that is not a good solution and will fail eventually.
I don't believe you could solve this problem as is, a work-around is to reverse the order of the file top to bottom, perform lookahead regex and then reverse the result order
pattern = #default\.(.*?)#[^\r\n]+(?=[\s\S]*#\1#) Demo
another way (depending on the capabilities of "Maven") is to run this pattern
#(.*)(#[\s\S]*)#default\.\1.*
and replace with #$1$2 Demo in a loop until there are no matches
then run this pattern
#default\.(.*)#.*(?=[\s\S]*\1)
and replace with nothing Demo in a loop until there are no matches
It doesn't look like the replacer plugin can actually do what I want. I got around this by using regular expressions to build multiple filter files, and then applying them to the resource files.
My original goal had been to use regular expressions to build a single, clean, and tidy filter file. In the end, I discovered that I was able to get away with just using multiple filters (not as clean or tidy) and apply them in the correct order.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.