Regular expression in postgresql - regex

I have the table mytable with the column images, there I store strings like JSON objects.
This column contains invalid string in some records, not because the string is incorrect, but because when I try to cast it to JSON the query fails. Example:
`SELECT images::JSON->0 FROM mytable WHERE <any filter>`
If all elements of the JSON object are good that query works successfully, but if some string has " in incorrect place (to be specific, in this case in the title key) the error happens.
Good strings are like this:
[
{
"imagen": "http://www.myExample.com/asd1.png",
"amazon": "http://amazonExample.com/asd1.jpg",
"title": "A title 1."
},
{
"imagen": "http://www.myExample.com/asd2.png",
"amazon": "http://amazonExample.com/asd2.jpg",
"title": "A title 2."
},
{
"imagen": "http://www.myExample.com/asd3.png",
"amazon": "http://amazonExample.com/asd3.jpg",
"title": "A title 3."
}
]
Bad are like this:
[
{
"imagen": "http://www.myExample.com/asd1.png",
"amazon": "http://amazonExample.com/asd1.jpg",
"title": "A "title" 1."
},
{
"imagen": "http://www.myExample.com/asd2.png",
"amazon": "http://amazonExample.com/asd2.jpg",
"title": "A title 2."
},
{
"imagen": "http://www.myExample.com/asd3.png",
"amazon": "http://amazonExample.com/asd3.jpg",
"title": "A title 3."
}
]
Please put attention in the difference of title keys.
I need a regular expression to convert bad strings into good ones in PostgreSQL.

It will be very complicated, if possible to do this in one regexp, but it will be very easy to do in two or more.
For example, replace all the double quotes with \" and then replace {\" with {", \":\" with ":", \",\" with ",", \"} with "}. The quotes that are not escaped are the ones that breaks JSON.
Alternatively, replace "(?=[^}:]*"[\s]*}) (quotes in title only) with \" and then replace ":\" with ":". See details: https://regex101.com/r/pB6rD9/1
Crafting replace that will be able to do so in one go requires lookbehinds and I suppose that PSQL does not support them.

Related

How to extract more than label text items in a single annotation using Google NLP

I have created dataset using Google NLP Entity extraction and I uploaded input data's(train, test, validation jsonl files) like NLP format that will be stored in google storage bucket.
Sample Annotation:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 10,
"start_offset": 0
}
},
"display_name": "Name"
}],
"text_snippet": {
"content": "JJ's Pizza\n "
}
} {
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 9,
"start_offset": 0
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "San Francisco\n "
}
}
Here is the input text to predict the label as "Name", "City" and "State"
Best J J's Pizza in San Francisco, CA
Result in the following screenshot,
I expect the predicted results would be in the following,
Name : JJ's Pizza
City : San Francisco
State: CA
According to the sample annotation you provided, you're setting the whole text_snippet to be a name (or whatever field you want to extract).
This can confuse the model in understanding that all the text is that entity.
It would be better to have training data similar to the one in the documentation. In there, there is a big chunk of text and then we annotate the entities that we want extracted from there.
As an example, let's say that from these text snippets I tell the model that the cursive part is an entity named a, while the bold part is an entity called b:
JJ Pizza
LL Burritos
Kebab MM
Shushi NN
San Francisco
NY
Washington
Los Angeles
Then, when then the model reads Best JJ Pizza, it thinks all is a single entity (we trained the model with this assumption), and it will just choose the one it matches the best (in this case, it would likely say it's an a entity).
However, if I provide the following text sample (also annotated like cursive is entity a and bold is entity b):
The best pizza place in San Francisco is JJ Pizza.
For a luxurious experience, do not forget to visit LL Burritos when you're around NY.
I once visited Kebab MM, but there are better options in Washington.
You can find Shushi NN in Los Angles
You can see how you're training the model to find the entities within a piece of text, and it will try to extract them according to the context.
The important part about training the model is providing training data as similar to real-life data as possible.
In the example you provided, if the data in your real-life scenario is going to be in the format <ADJECTIVE> <NAME> <CITY>, then your training data should have that same format:
{
"annotations": [{
"text_extraction": {
"text_segment": {
"end_offset": 16,
"start_offset": 6
}
},
"display_name": "Name"
},
{
"text_extraction": {
"text_segment": {
"end_offset": 30,
"start_offset": 21
}
},
"display_name": "City"
}],
"text_snippet": {
"content": "Worst JJ's Pizza in San Francisco\n "
}
}
Note that the point of a Natural Language ML model is to process natural language. If your inputs are going to look as similar/simple/short as that, then it might not be worth going the ML route. A simple regex should be enough. Without the natural language part, it is going to be hard to properly train a model. More details in the beginners guide.

How to swap two words in Visual Studio Code with a find and replace?

The project I'm working on has a number of yaml files, where all the instances of lat: and long: need to be swapped, since the data is incorrectly labeled.
So for instance, the following:
- lat: "-82.645672"
long: '44.941747'
title: "Item 1"
- lat: "-82.645744"
long: '44.940731'
title: "Item 2"
- lat: "-82.645744"
long: '44.940731'
title: "Item 3"
- lat: "-82.646599"
long: '44.941441'
title: "Item 4"
Would need to look like this:
- long: "-82.645672"
lat: '44.941747'
title: "Item 1"
- long: "-82.645744"
lat: '44.940731'
title: "Item 2"
- long: "-82.645744"
lat: '44.940731'
title: "Item 3"
- long: "-82.646599"
lat: '44.941441'
title: "Item 4"
I'm struggling to figure out how to swap these two words globally. I looked at the plugins that are available, but they only seem to work with the current file you're editing, and when highlighting only a couple of words (i.e. like this one https://marketplace.visualstudio.com/items?itemName=davidmart.swap-word). I was looking into using regex as a possible solution, but can only find ways to reorder words on the same line. Is there a regex that can be used in a find and replace to swap two words that can get applied to all files in a project?
To swap words across files (see end to swap words in one file easily):
Try this regex:
^(-\s+)(lat)(.*)(\n\s*)(long) // I made a small change here
and replace with:
$1$5$3$4$2
See regex101 demo.
This works perfectly fine for me in the find/replace widget but not in the search/replace across files panel. Why? See this "resolved" issue: issue: regex search and replace.
The issue seems to indicate it was provisionally "fixed" but it doesn't appear that it has been.
I was going to open a new issue but found this from earlier this week: issue: capture groups don't work when regex includes newline . So hopefully it will be fixed this iteration.
I am happy to report that this bug has been fixed in the Insiders Build 2019-09-16!! Demo below in Insider's Build:
To swap words in a single file only, you can use this extension I wrote: Find and Transform and this keybinding:
{
"key": "alt+s", // whatever keybinding you want
"command": "findInCurrentFile",
"args": {
"find": "(lat)|(long)",
"replace": "${1:+long}${2:+lat}", // swap here
"isRegex": true
}
}
There is no reason you couldn't make that swap 3+ words in some sequence you want.
${1:+long} is a conditional which says if there is a capture group 1, replace it with the text long.
You can use only replace feature.
If you are using Windows, the shortcut is ctrl+h.
ctrl+h, replace lat to dummy
ctrl+h, replace long to lat
ctrl+h, replace dummy to long
With the Replace Rules extension
"replacerules.rules": {
"Swap lat-long 1": {
"find": ["lat","long"],
"replace": ["XYZ","ABC"]
},
"Swap lat-long 2": {
"find": ["XYZ","ABC"],
"replace": ["long","lat"]
}
},
"replacerules.rulesets": {
"Swap lat-long": {
"rules": [
"Swap lat-long 1",
"Swap lat-long 2"
]
}
}
Then execute command: Replace Rules: Run Ruleset...
Dude.
Remember the algorithm to swap 2 strings?
temp=str1
str1=str2
str2=temp
replace "long" with "TEMP".
replace "lat" with "long".
replace "TEMP" with "lat".
Thats it.

AWS DynamoDB Golang issue with inserting items into table

I've been following Miguel C's tutorial on setting up a DynamoDB table in golang but modified my json to look like this instead of using movies. I modified the movie struct into a Fruit struct (so there is no more info) and in my schema I defined the partition key as "Name" and the Sort Key as "Price". But when I run my code it says
"ValidationException: One of the required keys was not given a value"
despite me printing out the input as
map[name:{
S: "bananas"
} price:{
N: "0.25"
}]
which clearly shows that String bananas and Number 0.25 both have values in them.
My Json below looks like this:
[
{
"name": "bananas",
"price": 0.25
},
{
"name": "apples",
"price": 0.50
}
]
Capitalization issue, changed "name" to "Name" and it worked out.

Notepad++ and Regex to put a bunch of word into [ ], each word in quotation mark and separated by commas

I have a list, a thousand row like this
"Categories": "Action, Adventure, Comedy, Fantasy",
"Categories": "Action, Adventure",
"Categories": "Action, Adventure, Comedy, Drama,Fantasy, Martial Arts, Mystery, Supernatural",
"Categories": "Action,Adventure, Comedy, Fantasy,Psychological, School Life, Supernatural",
and I'd like to make into this
"Categories": ["Action", "Adventure", "Comedy", "Fantasy"]
"Categories": ["Action", "Adventure"]
"Categories": ["Action", "Adventure", "Comedy", "Drama", "Fantasy", "Mystery", "Supernatural"]
"Categories": ["Action", "Adventure", "Comedy", "Fantasy", "Psychological", "Supernatural"]
I've tried a bunch of regular expression, such as
("Categories":) "(\b.*?), (\b.*?), (.*), (.*), (\w+?)",
and still stuck, because I am still green at this stuff
please help me to solve this in regex and thank you for the answer
In two steps:
step 1: you replace the string with an array of strings when there is more than one item
search: "Categories":\s*\K("[^",]*+[^"]+")
replace: [$1]
step 2: you replace all the commas in the string
search: (\G(?!^)|"Categories":\s*\[")[^",]+?\K\s*,\s*
replace: ", "
Try:
pattern: ("Categories":) ("[^"]*")
substitute with: $1[$2]
bye

RegEx:Replace: Dequote integers

So, I have a file with a large JSON array of objects, and unfortunately, every field is wrapped in double quotes. Two fields in particular (Latitude and Longitude) needs to have the quotes removed.
I just want to use RegEx within an editors find/replace feature to remove the quotes...but I am struggling to come up with the RegEx.
This is very specific, I am just hoping there is a RegEx guru out there that could point me in the right direction on how to free the 37 and the -122 below from their quoted prisons.
{
"ClubId": "TestWith01",
"ClubName": "TestWith01",
"_DistrictNumber": "K05",
"MeetingDay1": "2nd & 4th MO",
"MeetingTime1": "6:30 PM",
"MeetingDay2": "",
"URL": "http://www.someurl.com",
"Latitude": "37",
"Longitude": "-122",
"MeetingAddress": {
"Address1": "Sample With Quotes",
"Address2": "",
"Address3": "",
"City": "Treasure Island",
"State": "FL",
"PostalCode": "33706",
"Country": "United States"
}
},
result = subject.replace(/"(-?\d+)"/g, "$1");
This should replace anything that has an optional minus, followed by 1+ digits. You did not specify your language, so I guessed javascript.