Change `"` quotation marks to latex style - regex

I'm editing a book in LaTeX and its quotation marks syntax is different from the simple " characters. So I want to convert "quoted text here" to ``quoted text here''.
I have 50 text files with lots of quotations inside. I tried to write a regular expression to substitute the first " with `` and the second " with '', but I failed. I searched on internet and asked some friends, but I had no success at all. The closest thing I got to replace the first quotation mark is
s/"[a-z]/``/g
but this is clearly wrong, since
"quoted text here"
will become
``uoted text here"
How can I solve my problem?

I'm a little confused by your approach. Shouldn't it be the other way round with s/``/"[a-z]/g? But then, I think it'll be better with:
s/``(.*?)''/"\1"/g
(.*?) captures what's between `` and ''.
\1 contains this capture.
If it's the opposite that you're looking for (i.e. I wrongly interpreted your question), then I would suggest this:
s/"(.*?)"/``\1''/g
Which works on the same principles as the previous regex.

Use the following to tackle multiple quotations, replacing all " in one step.
echo '"Quote" she said, "again."' | sed "s/\"\([^\"]*\)\"/\`\`\1''/g"
The [^\"]* avoids the need for ungreedy matching, which does not seem possible in sed.

If you are using the TeXmaker software, you could use a regular expression with the Replace command (CTRL+R), and put the following into the Find field:
"([^}]*)"
and into the Replace field:
``$1''
And then just press the Replace All button. But after that, you still have to check that everything is fine, and maybe you need to do some corrections. This has worked pretty well for me.

Try grouping the word:
sed 's/"\([a-z]\)/``\1/'
On my PC:
abhishekm71#PC:~$ echo \"hello\" | sed 's/"\([a-z]\)/``\1/'
``hello"

It depends a little on your input file (are quotes always paired, or can there be ommissions?). I suggest the following robust approach:
sed 's/"\([0-9a-zA-Z]\)/``\1/g'
sed "s/\([0-9a-zA-Z]\)\"/\1\'\'/g"
Assumption: An opening quotation mark is always immediately followed by a letter or digit, a closing quotation mark is preceeded by one. Quotations can span over several words an even several input lines (some of the other solutions don't work when this happens).
Note that I also replace the closing quotation mark: Depending on the fonts you use the double quotation mark can be typeset as neutral straight quotation mark.

You are looking for something contained in straight quotation marks not containing a quotation mark, so the best regex is "([^"]*?)". Replace it with ``\1''. In Perl this can be simplified to s/"([^"]*?)"/``\1''/g. I would be very careful with this approach, it only works if all opening quotation marks have matching closing ones, for example in "one" two "three" four. But it will fail in "one" t"wo "three" four producing ``one'' t``wo ''three".

Related

Select last character of a substring in regexp

I'm trying to clean a huge geoJson datafile. I need to change the format of "text" field from
"text": "(2:Placename,Placename)"
to
"text": "Placename".
In Sublime text I managed to write a regular expression which enabled me to select and remove the first part leaving something like this:
"text": "Placename)"
With following regexp I can select the text above, but I need to narrow it down to the last character:
text\": \".*?\)
No matter what I can't figure out how to select the ")" character in the end of Placename string in the whole file and remove it. Note that the "Placename" here can be any place name, like New York, London etc.
I tried to build an expression where first part finds the text field, then ignores n-amount of characters until it finds the ")" character.
After experimenting and Googling I couldn't find a solution here.
You can capture the value of the second placemark field with the following regexp:
/"text": "+\(\d+:[^,]+,(.*?)\)/
Which will capture "Placename" in $1
More info on capturing parenthesis: http://www.regular-expressions.info/brackets.html
The trick is to use the inverted character classes and to escape any parentheses you want to match.
HTH
I do not know if you are using a Unix system, but probably sed can do much of the work for you. It can interpret regular expressions, capture groups, and substitute by other groups of characters. I have tried an example with sed and the following sed command worked for me:
echo "\"text\": \"(2:Placename,Placename)\"" | sed -r 's/(\"text\": )\"\([[:digit:]]:[^0-9]+,([^0-9]+)\)\"/\1\"\2\"/g'
-r allows sed to interpret regular expressions. I am using parentheses to capture groups that I will use later in the substitution (e.g., a group for "text", and a group for the second placename). In the substitution part of sed, you can use groups by using \n where n is the group number that you want to used. This expression should help you to achieve your desired result.

Regex optimization: negative character class "[^#]" nullifies multiline flag "m"

I'm trying to parse a text line by line, catching everything EXCEPT what's after a specific marker, # for example. No escaping to take into account, pretty basic.
For instance, if the input text is:
Multiline input text
Mid-sentence# cut, this won't be matched
Hey there
If want to retrieve
['Multiline input text',
'Mid-sentence',
'Hey There']
This is working fine with /(.*?)(?:#.*$|$)/mg (even though there are a few empty matches). However, if I try to improve the regex (by avoiding backtracking and getting rid of empty matches) with /([^#]++)(?:#.*$|$)/mg, it returns
[
"Multiline input text
Mid-sentence",
"
Hey There"
]
As if [^#] was including linebreaks, even with the multiline flag on. As far as I can tell I can fix that by adding [^#\n\r] into the class character, but this makes the multiline option kind of useless and I'm afraid it could break on some weird linebreaks in some environments/encoding.
Would any of you know the reason for this behavior, and if there's another workaround? Thanks!
Edit
Originally, it happens in PCRE. But even in Javascript with /([^#]+)(?:#.*$|$)/mg, same unwanted multiline behavior. I know I could probably use the language to parse the text line by line, but I'd like to do it with regex only.
It seems you got your definition of /m wrong. The only thing this flag does is to change what ^ and $ matches, so that they also match at the beginning and end of line respectively. It does not affect anything else. If you don't want to match line breaks you should do as you suggested and use [^#\n\r].
The regex that will work for you is:
^(.*?)(?:#.*|)$
Online Demo: http://regex101.com/r/aP8eV6
DIfference is use of .*? instead of [^#]+.
[^#]+ by definition matches anything but # and that includes newlines as well.
multiline flag m only lets you use line start/end anchors ^ and $ in multiline inputs.

JSP Tag Spacing Regex

We are suppose to migrate all our apps from one type of server to another. The new servers do not accept invalid JSP tags where a space is not inserted between the attributes. For example, the following.
<input type="text"name="myField" />
The following regex was given to us to use, but it seems to not be perfect.
[\w.-]+[\s]*=[\s]*"[^"]+"[^\s/%>]
For example, it returns string assignments like the following.
span.style.fontWeight = "bold";
Can anyone suggest a better regex for locating just the invalid JSP code?
UPDATE
I was this regex to work using the Eclipse Search > File functionality.
Try simply this RegEx: (<.+?[^" ]+?="[^"]+?")([^ ]+?)(.+?>). Will locate all "tags" with a " not followed by a space. Then you can replace the captured groups like this: $1 $2$3 to add a space.
Tenub's answer is nearly correct, but as Rachel G. mentioned, it will return false positives when the closing bracket immediately follows the closing quotation mark.
(<[^?%].+?[^" ]+?="[^"]+?")([^/ >]+?)([^>]*(?:/|\?|%)?>)
Should give you the results you're after.
Disclaimer: This is not a strict checker. You could have a tag such as <..." asdf/> go undetected, but as the tags are presumably well formed enough to work under the old system, this should be sufficient.
Simple version:
Find: (=\s*"[^"]*")(\w)
Replace with: $1 $2
Explanation
The find regex looks for = followed by optional whitespace followed by "...", immediately followed by a single alphanumeric character or underscore.
It's separated out into two capturing groups, which are represented by $1 and $2 in the replace expression - with a space inserted between them.
[Minor Issue: This won't work for attribute values that include escaped double quotation marks. Haven't addressed this as am assuming it is pretty unlikely. However, it justifies doing a manual find/replace rather than "replace all" just in case.]

Regex -- replace all spaces before a particular character

The goal is to make something like
This is some text=This is some text
become:
This\ is\ some\ text=This is some text
I've been playing with variations of things I know will grab spaces/whitespaces (like "\ " or \s) in front of (?==) which seems to select until the = character, but nothing seems to be working in Intellij IDEA's search and replace.
Any suggestions?
Copying the answer from the comments in order to remove this question from the "Unanswered" filter:
This - (\s)(?=.*=) should work. Replace it with \$1
~ answer per Rohit Jain
This was additionally confirmed by the OP:
That worked, though I used a literal space instead of the \s because it was picking up some additional white space I didn't want replaced. Also had to do some silly escaping for the replace (\\$1 )

RegEx Expression to find strings with quotation marks and a backslash

I am using a program that pastes what is in the clipboard in a modified format according to what I specify.
I would like for it to paste paths (i.e. "C:\folder\My File") without the pair of double quotes.
This, which isn't using RegEx works: Find " (I simply enter than in one line) and replace with nothing. I enter nothing in the second field. I leave it blank.
Now, though that works, it will remove double quotes in this scenario: Bob said "What are you doing?"
I would like the program to remove the quotes only if the the words enclosed in the double quotes have a backslash.
So, once again, just to make sure I am clear, I need the following:
1) RegEx Expression to find strings that have both double quotes and a backslash within those set of quotes.
2) A RegEx Expression that says: replace the backslashes with backslashes (i.e. leave them there).
Thank you for the fast response. This program has two fields. One for what to find and the other for what to replace. So, what would go in the 2nd field?
The program came with the Remove HTML entry, which has
<[^>]*> in the match pattern
and nothing (it's blank) in the Replacement field.
You didn't say which language you use, here's an example in Javascript:
> s = 'say "hello" and replace "C:\\folder\\My File" thanks'
"say "hello" and replace "C:\folder\My File" thanks"
> s.replace(/"([^"\\]*\\[^"]*)"/g, "$1")
"say "hello" and replace C:\folder\My File thanks"
This should work in .NET:
^".*?\\.*?"$