String replace on a very large file - regex

I have a giant text file that is JSON. You can see it here: http://api.mtgdb.info/cards/. I have saved this JSON to a file called cards.json.
In cards.json, I need to escape every single quote ' with a backslash \.
So I need to replace ' with \'.
Usually this is trivial in any editor, however the file is too large. How can I escape all single quotes in this string?
What I've tried:
I tried using sed. My command was sed s/\'/\\\'/ cards.json > cards_cleaned.json. However the cards_cleaned.json file did not have any escaped ', it was just an exact copy of cards.json. Sed works when i do sed s/\'/foobar/ cards.json > cards_cleaned.json, so I'm assuming something is wrong with my escaping backslashes.
I tried using vim. I opened cards.json in vim $ vi cards.json. Then I tried a global string replace using :%s/'/\'/g. This did not change anything in the file.

While #anubhava's or #gboffi's answers works, they produces and INVALID JSON.
JSON allows only few characters after the backslash:
\"
\\
\/
\b
\f
\n
\r
\t
\u four-hex-digits
e.g. the part of the following original (correct) JSON
[
{
"description" : "Whenever a land enters the battlefield, Ankh of Mishra deals 2 damage to that land's controller.",
"rarity" : "Rare",
"name" : "Ankh of Mishra"
}
]
you want to get
[
{
"description" : "Whenever a land enters the battlefield, Ankh of Mishra deals 2 damage to that land\'s controller.",
"rarity" : "Rare",
"name" : "Ankh of Mishra"
}
]
#e.g. instead of the land's want land\'s
But this is an INVALID JSON.
So, if you (for some strange reason) want have the backslash, you need to use double \\, such:
[
{
"description" : "Whenever a land enters the battlefield, Ankh of Mishra deals 2 damage to that land\\'s controller.",
"rarity" : "Rare",
"name" : "Ankh of Mishra"
}
]
Solution (for both)
with perl
perl -pE "s/'/\\\'/g" < mtg_cards.json > cards.malformed.json
#changes "land's" to wrong "land\'s"
and
perl -pE "s/'/\\\\'/g" < mtg_cards.json > card_with_double_BS.json
#changes "land's" to "land\\s"
Ps: Because your file is only one long (30MB) line, the vim has some problems. You can pretty print (fold and indent) the JSON, before editing. Many tools here, i'm using the json_xs command from the JSON_XS perl package. After "prettyfying" you can use the vim safely.

You need to use double quotes in the shell to avoid quoting the single quote character, but the you have to be careful because the shell, for a double quoted string, use the backslash as a quoting character
$ echo "eoieriou'iouou'oiuiouiuo"|sed "s/'/\\'/g"
eoieriou'iouou'oiuiouiuo
and the command that sed is trying to execute is s/'/\'/g but sed quoting character is the backslash, so that you substitute each single quote with a single quote...
We have to quote the backslash also when it arrives to sed, so let's try
$ echo "eoieriou'iouou'oiuiouiuo"|sed "s/'/\\\\'/g" # Four (4) backslashes in a row
eoieriou\'iouou\'oiuiouiuo
$
That's OK, isn't it? because sed is instructed to do s/'/\\'/g so that the quoted character, from the POV of sed, is the backslash itself...
Please note that the quotes, single or double, are not special characters from the POV of sed, they're special only in the context of the shell.

In Vi you will need to escape the \ character.
Try using
:%s/'/\\'/g
For me it worked.
Test.txt
\'\'\' \'\'\'

You need to double escape the backelas, so use:
sed -i.bak "s/'/\\\\'/g" cards.json

You can use like this, in vim.
:%s/'/\\\'/g
In sed,
sed "s/'/\\\'/g" filename

Here is an awk version:
cat file
hi'more data here'
awk '{gsub(g,"\\"g)}1' g="'" file
hi\'more data here\'
Or if you need double backslash:
awk '{gsub(g,"\\\\"g)}1' g="'" file
hi\\'more data here\\'

sed "s/'/\\\\&/g" cards.json > cards_cleaned.json
no need of your first escaped in search pattern \'
you should surround by double quote (single if single quote was not the char to change) and escape the escape due to double quote used at shell level in this case

Related

GNU sed regex to fix mySQL db inserts for SQLite

I am trying to translate a huge mySQL database dump file from mySQL syntax into SQLite syntax.
At https://regex101.com/ I have successfully created a ECMAScript flavor regex to turn something like:
,'foo\'s bar!',
into:
,"foo\'s bar!"
with this regular expression:
/,'([^']+)\\'([^']+)',/"$1\\'$2"/g
testing against this short file:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
Resulting in:
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n'"Henry\'s_1.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h'"Henry\'s_2.csv"'text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
but for the life of me I cannot translate this into a GNU sed flavor regex.
For example, this command does not make any substitutions in the output:
sed -r s/,'([^']+)\\'([^']+)',/"$1\\'$2"/g <test.sql
...
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g <test.sql: doesn't work either.
I have looked for a regex tool online that translates between different flavors of regex but cannot find one that works on GNU sed (shipped with GIT: sed (GNU sed) 4.8). PCRE seems to be close to what sed has but that doesn't work. I tried perl as well, no luck.
Anyone know a regex expression that works or a translator tool that works?
I am just about ready to write a nodejs program to do this for me.
Also, for extra credit, how can I write a sed script to handle any number of escaped quotes within a quoted string? I have that issue to deal with as well in my DB dump file.
Examples:
'foo\'-bar' // on instance
'foo\'and\'bar' // two instances
'foo\'and\'bar\'s on the deck' // three instances
and so on...
Thanks!
You can use
sed -E "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g test.sql
The "s/,'([^']+)\\\\'([^']+)',/"'"'"\\1\\\\'\\2"'"'/g consists of
"s/,'([^']+)\\\\'([^']+)',/" - a s/,'([^']+)\\'([^']+)',/ part (inside double quotes, so backslashes need doubling)
'"' - a " char (inside single quotes)
"\\1\\\\'\\2" - \1\\'\2 pattern (inside double quotes, so backslashes are doubled)
'"' - a " char (inside single quotes)
/g - the global flag (no need quoting here).
First look at your command
sed -r s/,'([^']+)\\'([^']+)',/"\1\\'\2"/g test.sql
I prefer writing the whole sed command in single quotes. When you need a single quote, you must close the string ('), use an escaped single quote (\') and open the next string with a ', all joined: '\''.
I also added two , characters.
sed -r 's/,'\''([^'\'']+)\\'\''([^'\'']+)'\'',/,"\1\\'\''\2",/g' test.sql
# Shorter
sed -r 's/,'\''([^'\'']+\\'\''[^'\'']+)'\'',/,"\1",/g' test.sql
# Using another way to write the single quotes, with the hex notation
sed -r 's/,\x27([^\x27]+\\\x27[^\x27]+)\x27,/,"\1",/g' test.sql
This works for simple cases, not for 'foo\'and\'bar\'s on the deck'.
I think you want to replace the quotes in the simple fields too.
Suppose you want to transform
(1058,'gpl5q0x51349lmdq3e0ijm4k9b6n','Henry\'s_1.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33854,'mUVk0/XGX+afIpkrqBm7LQ==','2021-01-06 03:07:23'),
(1059,'xzj8mivsenkakkrurfjytxjsaj1h','Henry\'s_2.csv','text/csv','{\"identified\":true,\"analyzed\":true}',33555,'KfRYqfAWtSIYXZ6oQZyYbA==','2021-01-06 03:07:23'),
(2000,'extra credit from question','foo\'and\'bar\'s on the deck','text/csv','{\"identified\":true,\"analyzed\":true}',33999,'KgSBFstbdthdsssssstvbA==','2022-01-02 13:07:23'),
into
(1058,"gpl5q0x51349lmdq3e0ijm4k9b6n","Henry\'s_1.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33854,"mUVk0/XGX+afIpkrqBm7LQ==","2021-01-06 03:07:23"),
(1059,"xzj8mivsenkakkrurfjytxjsaj1h","Henry\'s_2.csv","text/csv","{\"identified\":true,\"analyzed\":true}",33555,"KfRYqfAWtSIYXZ6oQZyYbA==","2021-01-06 03:07:23"),
(2000,"extra credit from question","foo\'and\'bar\'s on the deck","text/csv","{\"identified\":true,\"analyzed\":true}",33999,"KgSBFstbdthdsssssstvbA==","2022-01-02 13:07:23"),
In this answer I don't use the '\'' but the hexadecimal notation \x27.
First "backup" the \' combinations (replace them by an unused character like \r), replace all normal quotes by double quotes and "restore the backup" (change back the \r).
sed 's/\\\x27/\r/g; s/\x27/"/g; s/\r/\\\x27/g' test.sql
# or hex value for double quote "
sed 's/\\\x27/\r/g; s/\x27/\x22/g; s/\r/\\\x27/g' test.sql

How to use sed to add double quotes around every word, excluding colons and commas

I want to alter a string so that I have double quotes around every "word," excluding colons and commas ':,'.
For example, my input may look like:
[ANALYSIS:true, RESTRICTED:false, STRING_PARAMETER:World,
JOB_NAME:Hello_Jenkins]
but I want it to appear as
["ANALYSIS":"true", "RESTRICTED":"false", "STRING_PARAMETER":"World",
"JOB_NAME":"Hello_Jenkins"]
I've been using something like (using '_' as the delimiter)
'echo ${params} | sed -i "s_\'/\\([^:]*\\):/i\'_\'"$1" :\'_g" '
based off of what I've found online, yet it makes no changes to my string.
> sed -r 's/[^], :[]+/"&"/g' file
["ANALYSIS":"true", "RESTRICTED":"false", "STRING_PARAMETER":"World", "JOB_NAME":"Hello_Jenkins"]
In the above sed we exclude colons, commas, the brackets and the spaces, as your example says so. If your case is not fully represented by your example, you could modify the excluded characters, but the order of the brackets in the expression is important.
$ echo '[ANALYSIS:true, RESTRICTED:false, STRING_PARAMETER:World, JOB_NAME:Hello_Jenkins]' |
sed 's/[[:alnum:]_]\+/"&"/g'
["ANALYSIS":"true", "RESTRICTED":"false", "STRING_PARAMETER":"World", "JOB_NAME":"Hello_Jenkins"]
or if you have to exclude instead of include chars in the regexp:
$ echo '[ANALYSIS:true, RESTRICTED:false, STRING_PARAMETER:World, JOB_NAME:Hello_Jenkins]' |
sed 's/[^][,: ]\+/"&"/g'
["ANALYSIS":"true", "RESTRICTED":"false", "STRING_PARAMETER":"World", "JOB_NAME":"Hello_Jenkins"]

Replacing spaces with underscores within quotes

I need to replace within a large text file all occurrences such as 'yw234DV w-23-sDf wef23s-d-f' with the same strings but with underscores instead of spaces for all spaces within quotes, without replacing any spaces outside quotes with underscores.
I'm trying to find a solution for substitution within vim, but a sed solution would also be much appreciated. The number of tokens in each quote-delimited string may vary.
I've been playing with some regexes in vim, but they're pretty elementary and seem to be missing what I need.
My current attempt:
%s/'{[:alnum:] }*/'\0\_/g
And I'm experimenting with variations on that.
This is most similar to my question, though it is Java:
Replacing spaces within quotes
Sample Input:
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
Sample Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
You may try this with VIM, tried this on Macvim:
%s/\%('[^']*'\)*\('[^']*'\)/\=substitute(submatch(1), ' ', '_', 'g')/g
Much simpler solution , Thanks to #SergioAraujo:
#%s/\v%(('[^']*'))/\=substitute(submatch(1),' ', '_', 'g')/g
Not sure however, if below is the outcome you have expected
Output:
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
In perl:
perl -i -pe's{(\x27.*?\x27)}{ (my $subst = $1) =~ tr/ /_/ }ge' yourfile
or with perl5.14 or above:
perl -i -pe's{(\x27.*?\x27)}{ $1 =~ tr/ /_/r }ge'
With this the input file:
$ cat file
'wiUEF7-gvouw ow wo24-RTeih we', 'yt23IT iug-76'
We can convert all spaces inside of single-quotes into underscores with:
$ sed -E ":a; s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/; ta" file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'
How it works
:a
This creates a label a.
s/^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]/\1_/
This inserts the underscores where we want them.
^(([^']*'[^']*')*[^']*'[^']*)[[:space:]]
This looks for any odd number of single quotes followed by any number of non-quote characters followed by a space. Everything before that space is saved in group 1.
\1_
This replaces the matched text with group 1 followed by an underscore.
ta
If the previous command put any new underscores in the string, then jump back to label a and try again.
Using FPAT variable in gnu awk you can do this:
awk -v OFS=', ' -v FPAT="'[^']*'" '{for (h=1; h<=NF; h++)
{gsub(/[[:blank:]]/, "_", $h); printf "%s%s", $h, (h < NF ? OFS : ORS)}}' file
'wiUEF7-gvouw_ow_wo24-RTeih_we', 'yt23IT_iug-76'

Sed with both " and ' in insert string

I am using sed command in Ubuntu for making shell script.
I have a problem because the string I am inserting has both single and double quotes. Dashes also. This is the expample:
sed -i "16i$('#myTable td:contains("Trunk do SW-BG-26,
GigabitEthernet0/22")').parents("tr").remove();" proba.txt
It should insert
$('#myTable td:contains("Trunk do SW-BG-26, GigabitEthernet0/22")').parents("tr").remove();
in line 16 of the file proba.txt
but instead it inserts
$('#myTable td:contains(
because it exits prematurely . How can resolve this, I cannot find solution here on site bcause I have both quotation signs and there are explanations only for one kind.
2nd try
I set \ in front every double quote except the outermost ones but I still didn't get what I want. Result is:
.parents("tr").remove();
Then I put \ in front of every ' too but the result was an error in script. This is the 4th row:
sed -i "16i$(\'#myTable td:contains(\"QinQ tunnel - SCnet wireless\")\').parents(\"tr\").remove();" proba.txt
This is the error:
4: skripta.sh: Syntax error: "(" unexpected (expecting ")")
Maybe there is easier way to insert line into the file at the exact line if that line has ", ', /?
3rd time is a charm
Inserting many lines last day I came across another problem using sed. I want to insert this text:
$(document).ready( function() {
with command:
sed -i "16i$(document).ready( function() {" proba.txt
and I get as result this text inserted as document is something special or because of the $:
.ready( function() {
Any thoughts about that?
There are two ways around this. The easy way out is to put the script into a file and use that on the command line. For example, sed.script contains:
16i\
$('#myTable td:contains("Trunk do SW-BG-26, GigabitEthernet0/22")').parents("tr").remove();
and you run:
sed -f sed.script ...
If you want to do it without the file, then you have to decide whether to use single quotes or double quotes around your sed -e expression. Using single quotes is usually easier; there are no other special characters to worry about. Each embedded single quote is replaced by '\'':
sed -e '16i\
$('\''#myTable td:contains("Trunk do SW-BG-26, GigabitEthernet0/22")'\'').parents("tr").remove();' ...
If you want to use double quotes, then each embedded double quote needs to be replaced by \", but you also have to escape embedded back quotes `, dollar signs $ and backslashes \:
sed -e "16i\\
\$('#myTable td:contains(\"Trunk do SW-BG-26, GigabitEthernet0/22\")').parents(\"tr\").remove();" ...
(To the point: I forgot to escape the $ before I checked the script with double quotes; I got the script with single quotes right first time.)
Because of all the extra checking, I almost invariably use single quotes, unless I need to get shell variables substituted into the script.
sed -i "6 i\\
\$('#myTable td:contains(\"Trunk do SW-BG-26, GigabitEthernet0/22\")').parents(\"tr\").remove();" proba.txt
escape the double quote, the slash and new line needed after the i instruction and the $ due to double quote shell interpretation

How do I escape a left paren in a Perl regex?

If I have a file containing some escaped parens, how can I replace all instances with an unescaped paren using Perl?
i.e. turn this:
.... foo\(bar ....
into this
.... foo(bar ....
I tried the following but receivied this error message:
perl -pe "s/\\\(/\(/g" ./file
Unmatched ( in regex; marked by <-- HERE in m/\\( <-- HERE / at -e line 1.
You're forgetting that backslashes mean something to the shell, too. Try using single quotes instead of double quotes. (Or put your script in a file, where you won't need to worry about shell quoting.)
Gah. From command line, no less. Way too many levels of metacharacter interpretation.
Try replacing your double quotes with single quotes, see if that helps.
cjm's answer is probably the best. If you must do it at the command line, try using quotemeta() or the metaquoting escape sequence (\Q...\E). This worked for me in a bash prompt:
perl -pe "s/\Q\(\E/(/g" ./file