I'm using Atom to format some text data for analysis (I know there are probably better ways of doing it than this so I'm all ears) but it doesn't seem to be recognizing my regular expression.
The text is POS tagged tokens with sentences being delineated with newlines, formatted as such:
good\tJJ\n
workout\tNN\n
.\t.\n
''\t''\n
\n
Perhaps\tRB\n
the\tDT\n
I was able to replace all of the tabs (\t) with a front slash (/) no problem, but I'm now trying to turn all newlines that DON'T delineate sentences with just a space. I tried \S\n and it "wasn't found". I also tried to highlight all delineating newlines with ^\n$ but there were only two matches and only at the end of the document.
Am I doing this wrong? My only usage of regex is with Python, so maybe there's just a different way to do it in Atom.
EDIT: I'm just giving up and gonna use Python to process it. Nothing suggested work. The search function seemed to just be bugging out in general (e.g. one search would not work but then if I closed the search function and reopened it, the same search would work) because it's a long file (700,000+ lines) despite it not being a large file, data-wise (6,235 KB). If anyone can recommend a large file text editor, though, it would be appreciated.
Related
Hey stackoverflow community. Ive need help with huge information file. Is it possible with regular expression to find in this tag:
<category_name><![CDATA[Prekiniai ženklai>Adler|Kita buitinė technika>Buičiai naudingi prietaisai|Kita buitinė technika>Lygintuvai]]></category_name>
Somehow replace all the other data and leave only 'Adler' or 'Lygintuvai'. Im using Altova to edit xml files, so i cant find other way then find-replace. And im new in the regex stuff. So i thought maby you can help me.
#\<category_name\>.+?gt\;([\w]+?)\|.+?gt;([\w]+?)\]\]\>\<\/category_name\>#i
\1 - Adler
\2 - Lygintuvai
PHP
regex101.com
Fields may contain alphanumeric characters without spaces.
If you want to modify the scope of acceptable characters change [\w] to something other:
[a-z] - only letters
[0-9] - only digits
etc.
It's possible, but use of regular expressions to process XML will never be 100% correct (you can prove that using computer science theory), and it may also be very inefficient. For example, the solution given by Luk is incorrect because it doesn't allow whitespace in places where XML allows it. Much better to use XQuery or XSLT, both of which are designed for the job (and both work in Altova). You can then use XPath expressions to locate the element or attribute nodes you are interested in, and you can still use regular expressions (e.g. in the XPath replace() function) to process the content of text or attribute nodes.
Incidentally, your input is rather strange because it uses escape sequences like > within a CDATA section; but XML escape sequences are not recognized in a CDATA section.
I'm trying to clean up an long CSV file using SublimeText instead of Excel.
I've created a RegExp which use some greedy expression like
^.somehting.com.au.$
The search pattern works fine, but when it comes to the replacing everything with a blank string, Sublime return an error in the bottom bar I can barely read as it immediately disappears without anything happening.
I do suspect it's an error, and I have read something "too generic" rule or something.
Any help?
Ok, Sublime Text use a particular syntax for Reg Expr that slightly differs from the one used in coding.
In my specific circumstance, to find a domain in a string using a greedy expression including the carriage return (useful to clean a huge amount of rubbish in an SEO backlinks spreadsheet) I ended up using the following.
.*://leaderlive.co.uk/.*\n
Dots doesn't require escaping ... no need to add the end of string ^$ ... it simply works and I didn't spend time investigating the reasons.
I'm working with an enterprise CMS and in order to properly create our weekly-updated dropdown menu without republishing our entire site, I have an XML document being created which has a various number of useful XML elements. However, when pulling in a link with the CMS, the generated XML also outputs the link's contents (the entire HTML for the page). Needless to say, with roughly 50 items, the XML file is too big for use on the web (as it stands I think it's over 600KB). The element is <page-content>filler here</page-content>.
What I'm trying to do is use TextWrangler to find and replace all <page-content> tags as well as their containing content.
I've tried a few different regex's, but I can't seem to match the closing tag, so it will just trail on.
Here's what I've tried:
(<page-content>)(.*?)
The above will match up until the next starting <page-content> tag, which is not what I want.
(<page-content>)(.*?)(<\/page-content>)
(<page-content>)(.*?)(<\/page\-content>)
The above finds no matches, even though the below will find the 7 matches it should.
(<content>)(.*?)(<\/content>)
I don't know if there's a special way to deal with hyphens (I'm inexperienced in regular expressions), but if anyone could help me out, it would be greatly appreciated.
Thanks!
EDIT: Before you tell me that Regex isn't meant to parse HTML, I know that, but there seems to be no other way for me to easily find and replace this. There are too many occurences to manually delete it and save the file again every week.
It seems the problem is that your . is not matching newlines that exist between your open and close tags.
An easy solution for this would be to add the s flag in order for your . to match over newlines. TextWrangler appears to support inline modifiers (?s). You could do it like this:
(<page-content>)(?s)(.*?)(<\/page-content>)
More information on modifiers here.
I was told to add some text to an existing paragraph on a webpage. I opened the corresponding file and copied some of existing text and searched for it in the source code. Search found nothing and I was going nuts. It turned out in the source code some of the words had extra spaces between them and since this wasn't displayed in the browser it screwed up searching for them in the code.
Anyone have tips to avoid this? For example is there a way to ignore white spaces, perhaps using regular expressions? It should be simple to add a sentence of text but I ended up using the design view in Dreamweaver.
Using regular expressions, you can search for multiple spaces using +a +pattern +like +this by putting a + following each space, which will instruct the search to look for one or more.
Problem:
^.+ matches only the first line of the source code in dreamweaver. I need it to match each line so that I can wrap each full line in P tags. I have 500 files to do this in.
I know ^ should match the beginning of a line and I also know that multi-line mode must be enabled for it to work on each line and not just at the beginning of the file. I also know dreamweaver uses javascript source code.
Is lack of multi-line mode the problem? Is there any way to turn it on in dreamweaver? I tried using /m at the beginning search to enable multi-line mode, but that didn't work either.
I'm open to any solution for my current problem, even if it involves a different program. However, a fix for dreamweaver is ideal, 2nd place is a way to do this in notepad++, 3rd place is a way do to this in python or something (I only know javascript, you'll have to spell it out exactly in another language).
Thank you,
robert
p.s.
I found I can "select all > right click > selection > indent" to add two spaces to the beginning of each line in dreamweaver. This allows me to find the beginning of each line with / {2,}/. I really don't want to select all > indent on all 500 files, but i'm about to start since I've already spent a few hours bludgeoning dreamweaver.
Don't use Dreamweaver for this - use Notepad++ (since you are familiar with it) at its regular expression support is superior.
If you are comfortable with a more robust scripting language (Python, Ruby, Perl, etc.) then that would be an ever better way to do it.
The way that I might do this in DW would not involve using the find-replace tool's "Regular Expression" option, but instead using just plain old matching on a CrLf.
In the Find portion, since you can't directly enter a CrLf, you'll have to copy one to your clipboard beforehand and paste it in where needed.
In the Replace portion, replace with:
</p>[CrLf]
<p>
Again, be sure to paste in a proper "[CrLf]". This will work on all but the very first and very last lines of your document, so I know this isn't a 100% solution. There are probably better solutions using other tools that someone else can recommend!
Good luck!
-Mike
I had a flash of insight right after posting. (isn't that the way of it?)
Dreamweaver can find the end of each line with \r\n so instead of trying to work forward, i should have just worked backwards.
search: (.+)(\r\n)
replace: <p>$1</p>$2
[\w\W]* matches anything, including a newline. Its greedy, so it fact it matches everything.