Determine if string is valid XML in Postgres - regex

I have a text field in a table that contains JSON data as well as XML data. As I want to work with XML data only if it's valid XML, I want a way to make sure I can cast the string as XML without producing an error when '{"key":"val"}'::XML is possible.
Basically I want a function select isxml('{"key":"val"}) to return false, and select isxml('<key>1</key>') to be true.
I checked existing Postgres functions such as xml_is_well_formed, but they still return true when checking JSON strings. Maybe I can catch the error and deal with it in exceptions after a bad cast? Is there a good way to do this?

One possibility would be to use the xml_is_well_formed together with a function that checks whether or not the text content is a valid json. I.e.:
create or replace function is_valid_json(content text)
returns boolean
as
$$
begin
return (content::json is not null);
exception
when others then
return false;
end;
And in your query, you do [...] xml_is_well_formed(content) and not is_valid_json(content) [...].

My temporary solution is as follows
CREATE OR REPLACE FUNCTION isjson(p_json text)
RETURNS integer
LANGUAGE plpgsql
IMMUTABLE
AS $function$
begin
perform (p_json::json is not null);
return 1;
exception
when others then
return 0;
end;
$function$;
CREATE OR REPLACE FUNCTION isxml(p_xml text)
RETURNS boolean
LANGUAGE plpgsql
IMMUTABLE
AS $function$
BEGIN
PERFORM (p_xml::XML IS NOT NULL);
IF (xml_is_well_formed(p_xml)
AND NOT (CASE WHEN isjson(p_xml) = 1 THEN TRUE ELSE false END)
AND (SELECT p_xml ~ '^<.*>') )THEN -- regex matches <>, this may have uncovered edge cases
RETURN true;
ELSE
RETURN false;
END IF;
EXCEPTION
WHEN OTHERS THEN
RETURN false;
END;
$function$;
Note: my isjson function returns integer due to other legacy compatibility reasons, it would be easier to use boolean for this specific case. This should rule out most problematic cases but have lots of limitations in the regex used, accepting suggestions for improvement.

Related

Exception handling in PostgreSQL function with Django

have write the following store procedure in Postgres. This SP simply accept the incoming parameters, insert it into the table and return current identity. More I have also declare an addition variable that will tell is the sp runs successfully or not.
I am new to Postgres and have not much knowledge about Postgres way to do this. I want some thing like BEGIN TRY, END TRY and BEGIN CATCH, END CATCH like we do in MSSQL.
CREATE OR REPLACE FUNCTION usp_save_message(msg_sub character varying(80), msg_content text, msg_type character(12), msg_category character(255),msg_created_by character(255),msg_updated_by character(255))
RETURNS msg_id character, success boolean AS
$BODY$
DECLARE
msg_id character;
success boolean;
BEGIN
BEGIN TRY:
set success = 0
set msg_id = INSERT INTO tbl_messages(
message_subject, message_content, message_type, message_category,
created_on, created_by, updated_on, updated_by)
VALUES (msg_sub, msg_cont, msg_type,msg_category, LOCALTIMESTAMP,
msg_created_by, LOCALTIMESTAMP, msg_updated_by) RETURNING message_id;
set success = 1
RETURN msg_id,success;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
I want something like this:
begin proc()
BEGIN
BEGIN TRY:
set success = 0
execute the query
set success = 1
END TRY
BEGIN CATCH:
set success = 0
END CATCH
set success = 1
END
More I have to catched both these return values in django views.
I have updated the question and it is as now;
Here is the table,
CREATE TABLE tbl_messages
(
message_subject character varying(80),
message_content text,
message_type character(12),
message_category character(255),
created_on timestamp without time zone,
created_by character(255),
updated_on timestamp without time zone,
updated_by character(255),
message_id serial NOT NULL,
CONSTRAINT tbl_messages_pkey PRIMARY KEY (message_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE tbl_messages
OWNER TO gljsxdlvpgfvui;
Here is the function i created;
CREATE FUNCTION fn_save_message(IN msg_sub character varying, IN msg_cont text, IN msg_type character varying, IN msg_category character varying, IN msg_created_by character varying, IN msg_updated_by character varying, OUT success boolean, OUT msg_id integer) RETURNS integer AS
$BODY$BEGIN
BEGIN
INSERT INTO tbl_messages
(message_subject, message_content, message_type, message_category,
created_on, created_by, updated_on, updated_by)
VALUES
(msg_sub, msg_cont, msg_type, msg_category, LOCALTIMESTAMP,
msg_created_by, LOCALTIMESTAMP, msg_updated_by)
returning message_id
into msg_id;
success := true;
EXCEPTION
WHEN others then
success := false;
msg_id := null;
END;
return msg_id,success;
END;$BODY$
LANGUAGE plpgsql VOLATILE NOT LEAKPROOF
COST 100;
ALTER FUNCTION public.fn_save_message(IN character varying, IN text, IN character varying, IN character varying, IN character varying, IN character varying)
OWNER TO gljsxdlvpgfvui;
But it is not still working... i don't know what id have done wrong now, any django/postgres expert here kindly help me out.
There are several problems with your function:
Statements need to be terminated with a ; - always
Variable assignments are done using := (see: http://www.postgresql.org/docs/current/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-ASSIGNMENT)
You can't return more than one value from a function (unless you create a set returning function, return an object or use out parameters)
Boolean values are true or false. Not 0 or 1 (those are numbers)
The result of an automatically generated ID value is better obtained using lastval() or `` INSERT ... RETURNING expressions INTO ...not through aSET` statement.
Exception handling is done using the exception clause as documented in the manual
So you need something like this:
DECLARE
....
BEGIN
BEGIN
INSERT INTO tbl_messages
(message_subject, message_content, message_type, message_category,
created_on, created_by, updated_on, updated_by)
VALUES
(msg_sub, msg_cont, msg_type,msg_category, LOCALTIMESTAMP,
msg_created_by, LOCALTIMESTAMP, msg_updated_by)
returning message_id
into msg_id;
success := true;
EXCEPTION
WHEN others then
success := false;
msg_id := null;
END;
return msg_id;
END;
But as I said: you can't return more than one value from a function. The only way to do this is to declare OUT parameters, but personally I find them a bit hard to handle in SQL clients.
You have the following options to report an error to the caller:
let the caller handle the exception/error that might arise (which is what I prefer)
define a new user defined data type that contains the message_id and the success flag and return that (but that means you lose the error message!)
return a NULL for the message_id to indicate that something went wrong (but that also means you lose the error information)
Use out parameters to pass both values. An example is available in the manual: http://www.postgresql.org/docs/current/static/xfunc-sql.html#XFUNC-OUTPUT-PARAMETERS
CREATE FUNCTION fn_save_message3(IN msg_sub character varying, IN msg_cont text, IN msg_type character varying, IN msg_category character varying, IN msg_created_by character varying, IN msg_updated_by character varying) RETURNS integer AS
$BODY$ DECLARE msg_id integer := 0;
BEGIN
INSERT INTO tbl_messages
(message_subject, message_content, message_type, message_category,
created_on, created_by, updated_on, updated_by)
VALUES
(msg_sub, msg_cont, msg_type, msg_category, LOCALTIMESTAMP,
msg_created_by, LOCALTIMESTAMP, msg_updated_by);
Select into msg_id currval('tbl_messages_message_id_seq');
return msg_id;
END;$BODY$
LANGUAGE plpgsql VOLATILE NOT LEAKPROOF
COST 100;
ALTER FUNCTION public.fn_save_message(IN character varying, IN text, IN character varying, IN character varying, IN character varying, IN character varying)
OWNER TO gljsxdlvpgfvui;
SELECT fn_save_message3('Test','fjaksdjflksadjflas','email','news','taqi#gmail.com','');

PL/SQL optimize searching a date in varchar

I have a table, that contains date field (let it be date s_date) and description field (varchar2(n) desc). What I need is to write a script (or a single query, if possible), that will parse the desc field and if it contains a valid oracle date, then it will cut this date and update the s_date, if it is null.
But there are one more condition - there are must be exactly one occurence of a date in the desc. If there are 0 or >1 - nothing should be updated.
By the time I came up with this pretty ugly solution using regular expressions:
----------------------------------------------
create or replace function to_date_single( p_date_str in varchar2 )
return date
is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
begin
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)((.|\n|\t|\s)*((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d))?';
pResStr := regexp_substr(p_date_str, pRegEx);
if not (length(pResStr) = 10)
then return null;
end if;
l_date := to_date(pResStr, 'dd.mm.yyyy');
return l_date;
exception
when others then return null;
end to_date_single;
----------------------------------------------
update myTable t
set t.s_date = to_date_single(t.desc)
where t.s_date is null;
----------------------------------------------
But it's working extremely slow (more than a second for each record and i need to update about 30000 records). Is it possible to optimize the function somehow? Maybe it is the way to do the thing without regexp? Any other ideas?
Any advice is appreciated :)
EDIT:
OK, maybe it'll be useful for someone. The following regular expression performs check for valid date (DD.MM.YYYY) taking into account the number of days in a month, including the check for leap year:
(((0[1-9]|[12]\d|3[01])\.(0[13578]|1[02])\.((19|[2-9]\d)\d{2}))|((0[1-9]|[12]\d|30)\.(0[13456789]|1[012])\.((19|[2-9]\d)\d{2}))|((0[1-9]|1\d|2[0-8])\.02\.((19|[2-9]\d)\d{2}))|(29\.02\.((1[6-9]|[2-9]\d)(0[48]|[2468][048]|[13579][26])|((16|[2468][048]|[3579][26])00))))
I used it with the query, suggested by #David (see accepted answer), but I've tried select instead of update (so it's 1 regexp less per row, because we don't do regexp_substr) just for "benchmarking" purpose.
Numbers probably won't tell much here, cause it all depends on hardware, software and specific DB design, but it took about 2 minutes to select 36K records for me. Update will be slower, but I think It'll still be a reasonable time.
I would refactor it along the lines of a single update query.
Use two regexp_instr() calls in the where clause to find rows for which a first occurrence of the match occurs and a second occurrence does not, and regexp_substr() to pull the matching characters for the update.
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where regexp_instr(desc,pattern,1,1) > 0 and
regexp_instr(desc,pattern,1,2) = 0
You might get even better performance with:
update my_table
set my_date = to_date(regexp_subtr(desc,...),...)
where case regexp_instr(desc,pattern,1,1)
when 0 then 'N'
else case regexp_instr(desc,pattern,1,2)
when 0 then 'Y'
else 'N'
end
end = 'Y'
... as it only evaluates the second regexp if the first is non-zero. The first query might also do that but the optimiser might choose to evaluate the second predicate first because it is an equality condition, under the assumption that it's more selective.
Or reordering the Case expression might be better -- it's a trade-off that's difficult to judge and probably very dependent on the data.
I think there's no way to improve this task. Actually, in order to achieve what you want it should get even slower.
Your regular expression matches text like 31.02.2013, 31.04.2013 outside the range of the month. If you put year in the game,
it gets even worse. 29.02.2012 is valid, but 29.02.2013 is not.
That's why you have to test if the result is a valid date.
Since there isn't a full regular expression for that, you would have to do it by PLSQL really.
In your to_date_single function you return null when a invalid date is found.
But that doesn't mean there won't be other valid dates forward on the text.
So you have to keep trying until you either find two valid dates or hit the end of the text:
create or replace function fn_to_date(p_date_str in varchar2) return date is
l_date date;
pRegEx varchar(150);
pResStr varchar(150);
vn_findings number;
vn_loop number;
begin
vn_findings := 0;
vn_loop := 1;
pRegEx := '((0[1-9]|[12][0-9]|3[01])[.](0[1-9]|1[012])[.](19|20)\d\d)';
loop
pResStr := regexp_substr(p_date_str, pRegEx, 1, vn_loop);
if pResStr is null then exit; end if;
begin
l_date := to_date(pResStr, 'dd.mm.yyyy');
vn_findings := vn_findings + 1;
-- your crazy requirement :)
if vn_findings = 2 then
return null;
end if;
exception when others then
null;
end;
-- you have to keep trying :)
vn_loop := vn_loop + 1;
end loop;
return l_date;
end;
Some tests:
select fn_to_date('xxxx29.02.2012xxxxx') c1 --ok
, fn_to_date('xxxx29.02.2012xxx29.02.2013xxx') c2 --ok, 2nd is invalid
, fn_to_date('xxxx29.02.2012xxx29.02.2016xxx') c2 --null, both are valid
from dual
As you are going to have to do try and error anyway one idea would be to use a simpler regular expression.
Something like \d\d[.]\d\d[.]\d\d\d\d would suffice. That would depend on your data, of course.
Using #David's idea you could filter the ammount of rows to apply your to_date_single function (because it's slow),
but regular expressions alone won't do what you want:
update my_table
set my_date = fn_to_date( )
where regexp_instr(desc,patern,1,1) > 0

How to read semicolon separated certain values from a QString?

I am developing an application using Qt/KDE. While writing code for this, I need to read a QString that contains values like ( ; delimited)
<http://example.com/example.ext.torrent>; rel=describedby; type="application/x-bittorrent"; name="differentname.ext"
I need to read every attribute like rel, type and name into a different QString. The apporach I have taken so far is something like this
if (line.contains("describedby")) {
m_reltype = "describedby" ;
}
if (line.contains("duplicate")) {
m_reltype = "duplicate";
}
That is if I need to be bothered only by the presence of an attribute (and not its value) I am manually looking for the text and setting if the attribute is present. This approach however fails for attributes like "type" and name whose actual values need to be stored in a QString. Although I know this can be done by splitting the entire string at the delimiter ; and then searching for the attribute or its value, I wanted to know is there a cleaner and a more efficient way of doing it.
As I understand, the data is not always an URL.
So,
1: Split the string
2: For each substring, separate the identifier from the value:
id = str.mid(0,str.indexOf("="));
value = str.mid(str.indexOf("=")+1);
You can also use a RegExp:
regexp = "^([a-z]+)\s*=\s*(.*)$";
id = \1 of the regexp;
value = \2 of the regexp;
I need to read every attribute like rel, type and name into a different QString.
Is there a gurantee that this string will always be a URL?
I wanted to know is there a cleaner and a more efficient way of doing it.
Don't reinvent the wheel! You can use QURL::queryItems which would parse these query variables and return a map of name-value pairs.
However, make sure that your string is a well-formed URL (so that QURL does not reject it).

Validating a Salesforce Id

Is there a way to validate a Salesforce ID, maybe using RegEx? They are normally 15 chars or 18 chars but do they follow a pattern that we can use to check that it's a valid id.
There are two levels of validating salesforce id:
check format using regular expression [a-zA-Z0-9]{15}|[a-zA-Z0-9]{18}
for 18-characted ids you can check the the 3-character checksum:
Code examples provided in comments:
C#
Go
Javascript
Ruby
Something like this should work:
[a-zA-Z0-9]{15,18}
It was suggested that this may be more correct because it prevents Ids with lengths of 16 and 17 characters to be rejected, also we try to match against 18 char length first with 15 length as a fallback:
[a-zA-Z0-9]{18}|[a-zA-Z0-9]{15}
Just use instanceOf to check if the string is an instance of Id.
String s = '1234';
if (s instanceOf Id) System.debug('valid id');
else System.debug('invalid id');
The easiest way I've come across, is to create a new ID variable and assign a String to it.
ID MyTestID = null;
try {
MyTestID = MyTestString; }
catch(Exception ex) { }
If MyTestID is null after trying to assign it, the ID was invalid.
This regex has given me the optimal results so far.
\b[a-z0-9]\w{4}0\w{12}|[a-z0-9]\w{4}0\w{9}\b
You can also check for 15 chars, and then add an extra 3 chars optional, with an expression similar to:
^[a-z0-9]{15}(?:[a-z0-9]{3})?$
on i mode, or not:
^[A-Za-z0-9]{15}(?:[A-Za-z0-9]{3})?$
Demo
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Javascript: /^(?=.*?\d)(?=.*?[a-z])[a-z\d]{18}$/i
These were the Salesforce Id validation requirements for me.
18 characters only
At least one digit
At least one alphabet
Case insensitive
Test cases
Should fail
1
a
1234
abgcde
1234aDcde
12345678901234567*
123456789012345678
abcDefghijabcdefgh
Should pass
1234567890abcDeFgh
1234abcd1234abcd12
abcd1234abcd1234ab
1abcDefhijabcdefgf
abcDefghijabcdefg1
12345678901234567a
a12345678901234567
For understanding the regex, please refer this thread
The regex provided by Daniel Sokolowski works perfectly to verify if the id is in the correct format.
If you want to verify if an id corresponds to an actual record in the database, you'll need to first find the object type from the first three characters (commonly known as prefix) and then query the object type:
boolean isValidAndExists(String key) {
Map<String, Schema.SObjectType> objTypes = Schema.getGlobalDescribe();
for (Schema.SObjectType objType : objTypes.values()) {
Schema.DescribeSObjectResult objDesc = objType.getDescribe();
if (objDesc.getKeyPrefix() == key.substring(0,3)) {
String objName = objDesc.getName();
String query = 'SELECT Id FROM ' + objName + ' WHERE Id = \'' + key + '\'';
SObject[] objs = Database.query(query);
return !objs.isEmpty();
}
}
return false;
}
Be aware that Schema.getGlobalDescribe can be an expensive operation and degrade the performance of your application if you use that often.
If you need to check that often, I recommend creating a Custom Setting or Custom Metadata to store the relation between prefixes and object types.
Assuming you want to validate Ids in Apex, there are a few approaches discussed in the other answers. Here is an alternative, with notes on the various approaches.
The try-catch method (credit to #matt_k) certainly works, but some folks worry about overhead, especially if testing many Ids.
I used instanceof Id for a long time (credit to #melani_s), until I discovered that it sometimes gives the wrong answer (e.g., '481D0B74-41CF-47E9').
Multiple answers suggest regexen. As the accepted answer correctly points out (credit to #zacheusz), 18 character Ids are only valid if their checksums are correct, which means the regex solutions can be wrong. That answer also helpfully provides code in several languages to test Id checksums. But not in Apex.
I was going to implement the checksum code in Apex, but then I realized the Salesforce had already done the work, so instead I just convert 18 digit Ids to 15 digit Ids (via .to15() which uses the checksum to fix capitalization, as opposed to truncating the string) and then back to 18 digits to let SF do the checksum calc, then I compare the original checksum and the new one. This is my method:
static Pattern ID_REGEX = Pattern.compile('[a-zA-Z0-9]{15}(?:[A-Z0-5]{3})?');
/**
* #description Determines if a string is a valid SalesforceId. Confirms checksum of 18 digit Ids.
* Works for cases where `x instanceof id` returns the wrong answer, like '481D0B74-41CF-47E9'.
* Does NOT check for the existence of a record with the given Id.
* #param s a string to validate
*
* #return true if the string `s` is a valid Salesforce Id.
*/
public static Boolean isValidId(String s) {
Matcher m = ID_REGEX.matcher(s);
if (m.matches() == false) return false; // if it doesn't match the regex it cannot be valid
if (s.length() == 15) return true; // if 15 char string matches the regex, assume it must be valid
String check = (Id)((Id)s).to15(); // Convert to 15 char Id, then to Id and back to string, giving correct 18-char Id
return s.right(3) == check.right(3); // if 18 char string matches the regex, valid if checksum correct
}
Additionally checking getSObjectType() != null would be perfect if we are dealing with Salesforce records
public static boolean isRecordId(string recordId){
try{
return string.isNotBlank(recordId) && ((Id)recordId.trim()).getSObjectType() != null;
}catch(Exception ex){
return false;
}
}

Elegant way to distinct Path or Entry key

I have an application loading CAD data (Custom format), either from the local filesystem specifing an absolute path to a drawing or from a database.
Database access is realized through a library function taking the drawings identifier as a parameter.
the identifiers have a format like ABC 01234T56-T, while my paths a typical windows Paths (eg x:\Data\cadfiles\cadfile001.bin).
I would like to write a wrapper function Taking a String as an argument which can be either a path or an identifier which calls the appropriate functions to load my data.
Like this:
Function CadLoader(nameOrPath : String):TCadData;
My Question: How can I elegantly decide wether my string is an idnetifier or a Path to a file?
Use A regexp? Or just search for '\' and ':', which are not appearing in the Identifiers?
Try this one
Function CadLoader(nameOrPath : String):TCadData;
begin
if FileExists(nameOrPath) then
<Load from file>
else
<Load from database>
end;
I would do something like this:
function CadLoader(nameOrPath : String) : TCadData;
begin
if ((Pos('\\',NameOrPath) = 1) {UNC} or (Pos(':\',NameOrPath) = 2) { Path })
and FileExists(NameOrPath) then
begin
// Load from File
end
else
begin
// Load From Name
end;
end;
The RegEx To do the same thing would be: \\\\|.:\\ I think the first one is more readable.
In my opinion, the K.I.S.S. principle applies (or Keep It Simple Stupid!). Sounds harsh, but if you're absolutely certain that the combination :\ will never be in your identifiers, I'd just look for it on the 2nd position of the string. Keeps things understandable and readable. Also, one more quote:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. - Jamie Zawinski
You should pass in an additional parameter that says exactly what the identifier actually represents, ie:
type
CadLoadType = (CadFromPath, CadFromDatabase);
Function CadLoader(aType: CadLoadType; const aIdentifier: String): TCadData;
begin
case aType of
CadFromPath: begin
// aIdentifier is a file path...
end;
CadFromDatabase: begin
// aIdentifier is a database ID ...
end;
end;
end;
Then you can do this:
Cad := CadLoader(CadFromFile, 'x:\Data\cadfiles\cadfile001.bin');
Cad := CadLoader(CadFromDatabase, 'ABC 01234T56-T');