Discussion:
Escaping post meta values
Dan Phiffer
2013-05-22 15:46:46 UTC
Permalink
Hi wp-hackers,

What's the deal with post meta value escaping? I didn't see any mention of it in the documentation, but it seems important if you're ever going to store JSON data in the postmeta table (i.e., {"key":"value with \"quotes\" in the content."})

Reduced example:

$str1 = '\\"';

update_post_meta($post->ID, "test", $str1);
$str2 = get_post_meta($post->ID, "test", true);

echo "$str1<br>$str2";

/*

Result:
\"
"

*/

Thanks,
Dan
Otto
2013-05-22 15:54:59 UTC
Permalink
Post by Dan Phiffer
Hi wp-hackers,
What's the deal with post meta value escaping? I didn't see any mention of it in the documentation, but it seems important if you're ever going to store JSON data in the postmeta table (i.e., {"key":"value with \"quotes\" in the content."})
The meta functions expect unescaped data to be sent to them.

Basically, meaning that you shouldn't be storing JSON data directly,
but instead storing the PHP form of the data. So, json_decode it
before saving it as meta, then json_encode it if you need to send it
back to a browser or elsewhere.

-Otto
Andrew Nacin
2013-05-22 15:58:22 UTC
Permalink
Post by Otto
Post by Dan Phiffer
Hi wp-hackers,
What's the deal with post meta value escaping? I didn't see any mention
of it in the documentation, but it seems important if you're ever going to
store JSON data in the postmeta table (i.e., {"key":"value with \"quotes\"
in the content."})
Post by Otto
The meta functions expect unescaped data to be sent to them.
Just to add to this, this is stupid. See
http://core.trac.wordpress.org/ticket/21767.
Post by Otto
Basically, meaning that you shouldn't be storing JSON data directly,
but instead storing the PHP form of the data. So, json_decode it
before saving it as meta, then json_encode it if you need to send it
back to a browser or elsewhere.
I'd agree this is a pretty good workaround. PHP can serialize what is
effectively superset of JSON.
Dan Phiffer
2013-05-22 16:29:05 UTC
Permalink
Post by Dan Phiffer
Post by Otto
Post by Dan Phiffer
Hi wp-hackers,
What's the deal with post meta value escaping? I didn't see any mention
of it in the documentation, but it seems important if you're ever going to
store JSON data in the postmeta table (i.e., {"key":"value with \"quotes\"
in the content."})
Post by Otto
The meta functions expect unescaped data to be sent to them.
Just to add to this, this is stupid. See
http://core.trac.wordpress.org/ticket/21767.
Post by Otto
Basically, meaning that you shouldn't be storing JSON data directly,
but instead storing the PHP form of the data. So, json_decode it
before saving it as meta, then json_encode it if you need to send it
back to a browser or elsewhere.
I'd agree this is a pretty good workaround. PHP can serialize what is
effectively superset of JSON.
"What? JSON you say? Who would ever think to use *that* for encoding metadata?"

I would argue this is insane behavior to create a workaround for, but in the meantime I think the docs should clearly explain what the deal is. As soon as I re-encode the hundreds of post metadata entries I've stored from Flickr/Instagram/Twitter I'll see about helping the next person avoid my fate.

Thanks list!
Dan
Drew
2013-05-22 16:42:37 UTC
Permalink
Dan,

A simple way for you to help others "avoid [your] fate" would be to take a
few minutes and improve the docs yourself.

The Codex is a community effort and anyone with a WP.org username can edit
it.
Post by Dan Phiffer
Post by Dan Phiffer
Post by Otto
Post by Dan Phiffer
Hi wp-hackers,
What's the deal with post meta value escaping? I didn't see any mention
of it in the documentation, but it seems important if you're ever going
to
Post by Dan Phiffer
store JSON data in the postmeta table (i.e., {"key":"value with
\"quotes\"
Post by Dan Phiffer
in the content."})
Post by Otto
The meta functions expect unescaped data to be sent to them.
Just to add to this, this is stupid. See
http://core.trac.wordpress.org/ticket/21767.
Post by Otto
Basically, meaning that you shouldn't be storing JSON data directly,
but instead storing the PHP form of the data. So, json_decode it
before saving it as meta, then json_encode it if you need to send it
back to a browser or elsewhere.
I'd agree this is a pretty good workaround. PHP can serialize what is
effectively superset of JSON.
"What? JSON you say? Who would ever think to use *that* for encoding metadata?"
I would argue this is insane behavior to create a workaround for, but in
the meantime I think the docs should clearly explain what the deal is. As
soon as I re-encode the hundreds of post metadata entries I've stored from
Flickr/Instagram/Twitter I'll see about helping the next person avoid my
fate.
Thanks list!
Dan
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-- I've kinda got a thing for WordPress > http://www.drewapicture.com
Dan Phiffer
2013-05-22 17:06:16 UTC
Permalink
Post by Drew
A simple way for you to help others "avoid [your] fate" would be to take a
few minutes and improve the docs yourself.
The Codex is a community effort and anyone with a WP.org username can edit
it.
Absolutely, that's what I was talking about. I'll write it up. But first I need to fix my site.

And to be clear, the thing that's frustrating about all of this is that I'd moved on from a bit of code that seemingly worked, only to discover certain "unexpected" content was not being handled properly. If certain things can't be stored, it's important that developers know to avoid storing them.

Dan
Post by Drew
Post by Dan Phiffer
Post by Dan Phiffer
Post by Otto
Post by Dan Phiffer
Hi wp-hackers,
What's the deal with post meta value escaping? I didn't see any mention
of it in the documentation, but it seems important if you're ever going
to
Post by Dan Phiffer
store JSON data in the postmeta table (i.e., {"key":"value with
\"quotes\"
Post by Dan Phiffer
in the content."})
Post by Otto
The meta functions expect unescaped data to be sent to them.
Just to add to this, this is stupid. See
http://core.trac.wordpress.org/ticket/21767.
Post by Otto
Basically, meaning that you shouldn't be storing JSON data directly,
but instead storing the PHP form of the data. So, json_decode it
before saving it as meta, then json_encode it if you need to send it
back to a browser or elsewhere.
I'd agree this is a pretty good workaround. PHP can serialize what is
effectively superset of JSON.
"What? JSON you say? Who would ever think to use *that* for encoding metadata?"
I would argue this is insane behavior to create a workaround for, but in
the meantime I think the docs should clearly explain what the deal is. As
soon as I re-encode the hundreds of post metadata entries I've stored from
Flickr/Instagram/Twitter I'll see about helping the next person avoid my
fate.
Thanks list!
Dan
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
--
-- I've kinda got a thing for WordPress > http://www.drewapicture.com
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
Dan Phiffer
2013-05-23 22:49:34 UTC
Permalink
Post by Dan Phiffer
Post by Drew
A simple way for you to help others "avoid [your] fate" would be to take a
few minutes and improve the docs yourself.
The Codex is a community effort and anyone with a WP.org username can edit
it.
Absolutely, that's what I was talking about. I'll write it up. But first I need to fix my site.
I've written this up here:
http://codex.wordpress.org/Function_Reference/update_post_meta#Character_Escaping

Feedback is welcomed.

Thanks,
Dan

Otto
2013-05-22 16:53:32 UTC
Permalink
Post by Dan Phiffer
"What? JSON you say? Who would ever think to use *that* for encoding metadata?"
I would argue this is insane behavior to create a workaround for, but in the meantime I think the docs should clearly explain what the deal is. As soon as I re-encode the hundreds of post metadata entries I've stored from Flickr/Instagram/Twitter I'll see about helping the next person avoid my fate.
I agree that it's not ideal (and indeed, stupid in a way), but I
wouldn't go so far as to call it insane.

The question is one of whether it makes sense to be storing
pre-encoded data or not. I agree that ideally, no matter what you pass
it, then you'd get exactly that same thing back. Due to one thing and
another, this turns out to be somewhat problematic.

But in the sense of "if it's going to have limitations", then the
current limitations are sensible and somewhat consistent ones. If you
pass it an array('whatever') in the raw PHP form, it will happily
store it and deal with it. But if you pre-encode it using some method,
then it's going to encode it again for storage, and then the
double-encoding there makes less sense.

JSON is an encoded form of data, same as PHP serialized data. Encoding
it again for storage and then decoding it twice when you pull it out
to use that data is somewhat silly. Better to decode the JSON into a
PHP variable and then tell it to store that, and thus the code only
needs to decode it once on pulling it out (which the get_meta does for
you). Then you have a raw PHP variable, and you can then only encode
to JSON again if that is actually needed. Usually, this is not needed
(you want the data, not necessarily a JSON form of it), and decoding
it only once (when you get the data from the external service) makes
more sense.

The thing is that it works fine as long as you're always storing the
basic form of the data that you're working with. Any encoded form of
that data, whether it be serialized or JSON or foo-encoded, might
cause an issue. Since you generally have to decode whatever it is to
manipulate it anyway, adding multiple layers of encode/decode is
unhelpful.

-Otto
Ryan McCue
2013-05-23 05:05:24 UTC
Permalink
Post by Otto
I agree that it's not ideal (and indeed, stupid in a way), but I
wouldn't go so far as to call it insane.
I'd say that it's definitely insane. SQL escaping should be moved down
the stack as much as possible, and it should be opaque to the point that
I'd have no idea that user meta is stored in an SQL database without
looking at the code.

Escaping the data at such a high level is definitely insane. At the
opposite end of the spectrum, you end up with magic quotes, and I think
we all know why that's a horrible idea.
--
Ryan McCue
<http://ryanmccue.info/>
Justas Butkus
2013-05-22 17:06:14 UTC
Permalink
Hello, list.

I am just to address part of this question - I think a quick test
reveals what's going on to developer.

But I have a different question.
Otto suggests, that it is viable to increase the processing by an
unknown order, just to address the behaviour of a function, that's meant
to plainly store the data, instead of modifying it.
And Andrew suggests, that PHP serialize is as good as JSON encode, or
probably better, because it can encode more complex data structures
(that being PHP objects, to some extent).
I recognize, that this is related to the nature of WordPress - having
it's consistent behaviour and all.

How do you feel about performance issue of this question?
I am not questioning the fundamental feature of WordPress (namely -
backwards compatibility), just asking, whereas this could be considered,
when talking about such functions?
--
Regards,
Justas
Post by Andrew Nacin
I'd agree this is a pretty good workaround. PHP can serialize what is
effectively superset of JSON.
Otto
2013-05-22 17:51:31 UTC
Permalink
Post by Justas Butkus
How do you feel about performance issue of this question?
I am not questioning the fundamental feature of WordPress (namely -
backwards compatibility), just asking, whereas this could be considered,
when talking about such functions?
To tackle the performance question, you have to examine the more
common behavior being used.

For the basic case, I'm storing information gained from some simple
process in the PHP. My data is generated or gathered from the user
input, and is plain text say. In that case, the storage of the meta is
simple, and has no issues.

For a more complex case, I'm storing information gained from some
complex PHP process. Say, an array, or an object. In this case, the
data is serialized on saving to the DB, and unserialized when
retrieving it. Again, no issues.

The only question of performance comes from when the data is gathered
from an external source. To pick an example, let's say I get data from
the Flickr API.

The data comes back from Flickr in a JSON format. Now, what am I doing
with that data? This is the key question that makes the answer
possible. There's two possibilities for usage of this Flickr JSON
structure:

1) I'm decoding it and using pieces of it to display something in the post.
2) I'm passing some or all of the data on to a Javascript process, or
making it otherwise available via an external API call.

For the first case, then I can either json_decode it when I receive
the data from Flickr, or every time I display the data in some manner.
Obviously, decoding takes time and though that time is small, it makes
more sense to decode it one time, get the data I actually need, and
save that in the meta storage. Saving the whole blob means I'm a)
saving data I may not use and b) having to decode it every time I need
it. Performance is better if I convert to PHP variables first and
discard the unnecessary pieces.

For the second case, then it would indeed make more sense to store the
raw data, if all that raw data is needed by the resulting final
process using it. API calls often tend to return more than we actually
need though, so in terms of space savings in the DB, it makes more
sense to decode the data and pare it down to what I need to use, then
pass that and only that data along later. We still have to json_encode
the data later, but it's probably substantially less data. This
tradeoff is difficult to measure in the general sense, and you'd need
to profile your exact case to know the faster approach.

Realistically, the first case is probably more common. The reason to
get data from an external service is to use that data, generally
speaking. It's rare that you pass that data on to some third system.
And even if you are passing it to a browser via JS, it's better to
pass small amounts of data instead of relaying large API responses.

Storing whole API responses from external calls, unaltered, rarely
makes sense from a performance viewpoint. Sometimes, yes. But not
often. Better to decode the moment you receive the data, then
manipulate it there, then store just the pieces you need. Smaller.
Faster.

-Otto
Justas Butkus
2013-05-22 20:15:35 UTC
Permalink
Hello, Otto.

Thank your for taking time and writing this extensive response.

Actually I was thinking on the lines of "more complex case", that you
described. Where complex data is generated by some PHP process. For
example if I were to store some measurement values - I would choose an
array, or an associative array. Structure simple, and complex, at the
same time.

I might pass it directly to the storage function, and I would get what I
expect: array stored, array retrieved.

But then, beneath this, there is a `serialize()`.
It is great, as it allows to implement callbacks, and allows to
serialize data of any complexity, that PHP may handle.
On the other hand - due to this it's performance is not that good, when
comparing with `json_encode()`, for example, when serializing rather
simple structures (lists, hashes, etc.).

So, given that, I might use `json_encode()` and then pass the resulting
data to meta storage function, where I expect it to be stored as-is (as
a string, what it seems like).

That's just one use case, I came up with.
To me it appears as rather legitimate, for developer to try to build
upon WordPress foundation, and try to use some sub-set of features,
where possible, to achieve better performance, even in small parts.

Closing note - thank you for sharing your view, that was really insightful.
--
Regards,
Justas
Post by Otto
Post by Justas Butkus
How do you feel about performance issue of this question?
I am not questioning the fundamental feature of WordPress (namely -
backwards compatibility), just asking, whereas this could be considered,
when talking about such functions?
To tackle the performance question, you have to examine the more
common behavior being used.
For the basic case, I'm storing information gained from some simple
process in the PHP. My data is generated or gathered from the user
input, and is plain text say. In that case, the storage of the meta is
simple, and has no issues.
For a more complex case, I'm storing information gained from some
complex PHP process. Say, an array, or an object. In this case, the
data is serialized on saving to the DB, and unserialized when
retrieving it. Again, no issues.
The only question of performance comes from when the data is gathered
from an external source. To pick an example, let's say I get data from
the Flickr API.
The data comes back from Flickr in a JSON format. Now, what am I doing
with that data? This is the key question that makes the answer
possible. There's two possibilities for usage of this Flickr JSON
1) I'm decoding it and using pieces of it to display something in the post.
2) I'm passing some or all of the data on to a Javascript process, or
making it otherwise available via an external API call.
For the first case, then I can either json_decode it when I receive
the data from Flickr, or every time I display the data in some manner.
Obviously, decoding takes time and though that time is small, it makes
more sense to decode it one time, get the data I actually need, and
save that in the meta storage. Saving the whole blob means I'm a)
saving data I may not use and b) having to decode it every time I need
it. Performance is better if I convert to PHP variables first and
discard the unnecessary pieces.
For the second case, then it would indeed make more sense to store the
raw data, if all that raw data is needed by the resulting final
process using it. API calls often tend to return more than we actually
need though, so in terms of space savings in the DB, it makes more
sense to decode the data and pare it down to what I need to use, then
pass that and only that data along later. We still have to json_encode
the data later, but it's probably substantially less data. This
tradeoff is difficult to measure in the general sense, and you'd need
to profile your exact case to know the faster approach.
Realistically, the first case is probably more common. The reason to
get data from an external service is to use that data, generally
speaking. It's rare that you pass that data on to some third system.
And even if you are passing it to a browser via JS, it's better to
pass small amounts of data instead of relaying large API responses.
Storing whole API responses from external calls, unaltered, rarely
makes sense from a performance viewpoint. Sometimes, yes. But not
often. Better to decode the moment you receive the data, then
manipulate it there, then store just the pieces you need. Smaller.
Faster.
-Otto
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
Otto
2013-05-22 20:52:18 UTC
Permalink
Post by Justas Butkus
On the other hand - due to this it's performance is not that good, when
comparing with `json_encode()`, for example, when serializing rather simple
structures (lists, hashes, etc.).
Tests show that serialize is faster for most common data. json_encode
doesn't catch up until the data gets large. Too large to store as
meta, honestly.

See http://techblog.procurios.nl/k/618/news/view/34972/14863/Cache-a-large-array-JSON-serialize-or-var_export.html
for some numbers and graphs and so forth.
Post by Justas Butkus
But I'm running into a new problem that when I pass objects straight into update_post_meta() it seems that whenever the data structure includes Emoji characters it results in a postmeta string that comes out empty from get_post_meta(). Is there a known workaround for this?
The json_encode function does have the advantage that it encodes UTF-8
characters into escape sequences and back, while serialize does not.
The serialize/unserialize functions also are non-forgiving of
malformed data, while json is quite forgiving. And depending on the
underlying MySQL version, character set, etc, it's possible that the
data being stored gets munged up by MySQL and thus doesn't get stored
properly (or undergoes a character set conversion), and so when you
get it back, the unserialize fails, and you get nothing.

This is indeed a problem. There is not a "good" workaround that I know
of. You could try using iconv() to convert the problem data to a
different character set before passing it to the meta functions.
Converting problem strings from UTF-8 to ISO-8859-1 has worked for me
in the past.

-Otto
Dan Phiffer
2013-05-22 21:43:16 UTC
Permalink
Post by Otto
Post by Dan Phiffer
But I'm running into a new problem that when I pass objects straight into update_post_meta() it seems that whenever the data structure includes Emoji characters it results in a postmeta string that comes out empty from get_post_meta(). Is there a known workaround for this?
The json_encode function does have the advantage that it encodes UTF-8
characters into escape sequences and back, while serialize does not.
The serialize/unserialize functions also are non-forgiving of
malformed data, while json is quite forgiving. And depending on the
underlying MySQL version, character set, etc, it's possible that the
data being stored gets munged up by MySQL and thus doesn't get stored
properly (or undergoes a character set conversion), and so when you
get it back, the unserialize fails, and you get nothing.
From my Sequel Pro connection the wp_postmeta table info shows up as 'utf8' encoding. I've heard of this problem, and I recall there are several potential points of failure where the encoding might be corrupted in the MySQL pipeline. The fact that json_encode is handling UTF-8 properly is a pretty strong advantage for me, so I think double-escaping the JSON encoding wins the day, despite its hackiness.
Post by Otto
This is indeed a problem. There is not a "good" workaround that I know
of. You could try using iconv() to convert the problem data to a
different character set before passing it to the meta functions.
Converting problem strings from UTF-8 to ISO-8859-1 has worked for me
in the past.
Ah, but the conversion process would likely strip out the good stuff. The Emoji are to be preserved! 😎

For what it's worth, here are the various incoming data I'm working with:
http://www.momaps1.org/expo1/instagram/
http://www.momaps1.org/expo1/flickr/
http://www.momaps1.org/expo1/twitter/

Thanks again,
Dan
Dan Phiffer
2013-05-22 20:19:11 UTC
Permalink
Thanks Otto, I appreciate your thoughtful responses here. This stuff is tricky, and I understand that WordPress doesn't attempt to cover each and every possible edge case.

I hadn't considered passing in non-string values directly, but that seems like an elegant solution on the face of it. But I'm running into a new problem that when I pass objects straight into update_post_meta() it seems that whenever the data structure includes Emoji characters it results in a postmeta string that comes out empty from get_post_meta(). Is there a known workaround for this?

I've found another solution -- go back to JSON, but double-up on my escaping. It seems to be working for me so far.
Like this: http://wordpress.stackexchange.com/a/60441

Thanks again,
Dan
Post by Otto
Post by Justas Butkus
How do you feel about performance issue of this question?
I am not questioning the fundamental feature of WordPress (namely -
backwards compatibility), just asking, whereas this could be considered,
when talking about such functions?
To tackle the performance question, you have to examine the more
common behavior being used.
For the basic case, I'm storing information gained from some simple
process in the PHP. My data is generated or gathered from the user
input, and is plain text say. In that case, the storage of the meta is
simple, and has no issues.
For a more complex case, I'm storing information gained from some
complex PHP process. Say, an array, or an object. In this case, the
data is serialized on saving to the DB, and unserialized when
retrieving it. Again, no issues.
The only question of performance comes from when the data is gathered
from an external source. To pick an example, let's say I get data from
the Flickr API.
The data comes back from Flickr in a JSON format. Now, what am I doing
with that data? This is the key question that makes the answer
possible. There's two possibilities for usage of this Flickr JSON
1) I'm decoding it and using pieces of it to display something in the post.
2) I'm passing some or all of the data on to a Javascript process, or
making it otherwise available via an external API call.
For the first case, then I can either json_decode it when I receive
the data from Flickr, or every time I display the data in some manner.
Obviously, decoding takes time and though that time is small, it makes
more sense to decode it one time, get the data I actually need, and
save that in the meta storage. Saving the whole blob means I'm a)
saving data I may not use and b) having to decode it every time I need
it. Performance is better if I convert to PHP variables first and
discard the unnecessary pieces.
For the second case, then it would indeed make more sense to store the
raw data, if all that raw data is needed by the resulting final
process using it. API calls often tend to return more than we actually
need though, so in terms of space savings in the DB, it makes more
sense to decode the data and pare it down to what I need to use, then
pass that and only that data along later. We still have to json_encode
the data later, but it's probably substantially less data. This
tradeoff is difficult to measure in the general sense, and you'd need
to profile your exact case to know the faster approach.
Realistically, the first case is probably more common. The reason to
get data from an external service is to use that data, generally
speaking. It's rare that you pass that data on to some third system.
And even if you are passing it to a browser via JS, it's better to
pass small amounts of data instead of relaying large API responses.
Storing whole API responses from external calls, unaltered, rarely
makes sense from a performance viewpoint. Sometimes, yes. But not
often. Better to decode the moment you receive the data, then
manipulate it there, then store just the pieces you need. Smaller.
Faster.
-Otto
_______________________________________________
wp-hackers mailing list
http://lists.automattic.com/mailman/listinfo/wp-hackers
Loading...