If you’ve ever posted a question into a forum, particularly about XML, and found that when it was published the main part of your question where you showed the XML had disappeared then this short article will be interesting for you! I see this all the time in ProZ.com, or when people ask me questions in the comments of this blog, and I can imagine the frustration they must be feeling as they post it again once or twice… all to no avail!
The reason for this is because many forums, and blogs, require that you handle reserved characters in HTML as HTML entities. Of course everyone knows this, and the forums and blogs in question always make it really obvious and provide guidance on how to overcome it… not!
So what we see is this sort of thing:
I have posted the problem elements here:
The Intricacies of Using HTML Entities
Reserved characters in HTML must be replaced with character entities.
Please explain how I extract the content of the attributes for
translation.
Or sometimes this:
I have posted the problem element here:
[heading author="Cameron László" ]The Intricacies of Using HTML Entities
[/heading]
[body subject="Blog Posts" ]Reserved characters in HTML must be replaced
with character entities.[/body]
Please explain how I extract the content of the attribute for translation.
PS: I used these square brackets [ ] because this stupid forum won't take
the proper less than or greater symbols in my file!
The second one gets the point across of course, but if you’ve had this problem I bet you’d like to know how to get around it wouldn’t you? It is very frustrating when the code you really wanted to show was this, especially if there’s a lot of it:
<heading author="Cameron László" >The Intricacies of Using HTML Entities
</heading>
<body subject="Blog Posts" >Reserved characters in HTML must be replaced
with character entities.</body>
The problem is that the less than and the greater than symbols are what are known as “reserved characters” in HTML and in XML. So if you want to use them in a facility that converts the text you write into HTML or XML without having a clever check, and doing the conversion for you, then you have to use character entities instead. In this case you only need two, one for the less than symbol and one for the greater than symbol. So you would replace the characters as follows:
< should be written as <
> should be written as >
The easiest way to do this is to write the code first, and then search and replace in your editor. So it should be posted into the offending forum, or comments window, like this:
<heading author="Cameron László" >The Intricacies of Using HTML
Entities</heading>
<body subject="Blog Posts" >Reserved characters in HTML must be
replaced with character entities.</body>
Looks messy, but it will do the trick. One thing to watch out for though is that some forums offer a preview feature… in fact ProZ.com is a case in point. If you use the preview it will display your text exactly as you want, and now you’ll be feeling really pleased with yourself and even beginning to enjoy the fact you can do something that others may not be able to. So you post it and refresh the page to admire your handywork…
…oh bloody hell! The damm post is still looking as bad as before! This is because the preview feature not only previews it perfectly but it also replaces all your hard work with the converted entities so when you post you get what you had before you learned this little trick. So my tip is, before you use the preview copy the entire text into your clipboard, or into a text file. Then after checking the preview looks good you can paste it back in before pressing the submit button!
Reserved Characters
These characters are not the only characters that are reserved, so now that you know why they are needed I recommend you bookmark these pages as they provide a simple, and useful guide, to what HTML Entities are all about and what other characters need to be reserved as well!
w3schools.com – HTML Entities
freeformatter.com – extensive list of html entities
wikipedia.org – List of XML and HTML character entity references
These may not be the most complete and technical for the real experts, but to help resolve the purpose of this article I think they do the job!
Ha ha! Been there, done that! 🙂
Nice one, Paul. I’m sure that for many this information will solve this mystery.
Personally, I would also recommend users of technical forums that tend to require posting of code snippets to urge the forum owners to upgrade the parser to better support HTML or at least have it support BB Code or the (X)HTML Code tag for enclosing code blocks.
This post shouldn’t be needed this day and age (but sadly it is, from a practical standpoint).
Yes, I remember Jerzy Czopik explained this to me several years ago, and I put it onto a Post It sticker on my monitor, saving many headaches.
Another good trick: if you come across a post in ProZ where a big chunk is missing (as in your 1st example), just click the “quote” button in the same post and you’ll see the missing symbols and text.
Good tip on the quote but Emma. I do this sometimes so I can correct the quoted question at least.