Junk Character Problem with ISO-8859-1 (Latin1) Encoding

Here is a rumbling about the Junk characters that appear from time to time on page due to wrong settings on editors or for other reason.

Take this text for example:

Europa im Mietwagen bis zu 15 % GÜNSTIGER
Große Tour – große Ersparnis!

Here is image of the above text, in case you cannot see the real text for some setting in your browser or others.

problem-text

The above text is taken from this page:

http://www.alamo.de/Hotdeals/732/de/autovermietung-europa/

Alamo has a another site National.de, and that is the page where I got problem. I was correcting the output and seeing the right things there but after few days the junk characters was appearing again (It was link UTF-8 encoding problem shown in 2nd image below). This can happen due to someone's Editor setting as different encoding, tool, or in transfer. Jonas was telling to encode it in HTML entities and I was thinking why should I encode it when it is already a Latin1 characters and I see it coming properly. It is not that this problem appears first time and I was not aware of the situation but I was thinking how do I know before hand that certain character can create this kind of problem.

Bad part of this trial and error is that I could not know how to know about each characters encoding exactly as I came to know that getting encoding type is best guess in PHP. It may be the same situation in other languages also. Good part is that I tried to develop a tool for encoding it properly that preserve the HTML tags so that I or others can use it even after writing HTML page. Encoding tool also tries to convert text/string in HTML entities first than numeric entity reference, which is not as readable as HTML entities and may be Search Engine can also read HTML entities better than numeric character reference. So, by trying to write a script for detecting any characters safe encoding, I have developed something else. 😉
The EditPlus encoding set is ANSI. Browser is showing Latin1 as I set it to that encoding using meta tag. and it is coming properly in output.

But when I tried to set latin1 for the above text in my EditPlus, it tells that the character may lost in it. Similarly when you try to use UTF-8 in the browser, then the text may show some junk characters.

ANSI-view-as-latin1

Again, we will copy the above code in new file and save with UTF-8 encoding. Watch the text, it will show junk characters again.

file-encoded-as-utf-8

So, you see we have problem with few characters in ISO-8859-1 and UTF-8 even if those characters fits in ANSI. UTF-8 should not create a problem as this should include all these characters! Encoding is little mysterious things. PHP present functions cannot determine the characters encoding correctly. It is just a guess. I think the same case in other languages as well.

The tool mentioned will solve those problem. It will try to encode all those problem characters. It will first try to get HTML entities for those problem characters, if possible, otherwise it will present those characters in numeric character reference. The output code should solve the above mentioned problem.

If you could not see the problem I have faced due to character set then find a little more explanation below:

I am using htmlentities() on this text (got it from nationalcar.de):

Mit über 3000 Stationen in über 80 Ländern haben wir sicher auch eine Stationen in Ihrer Nähe! Miami Airport  - Mietwagen schon ab 147, €

I have received this HTML source:

Mit über 3000 Stationen in über 80 Ländern haben wir sicher auch eine Stationen in Ihrer Nähe! Miami Airport - Mietwagen schon ab 147, €

On the above text euro sign has been left and not encoded. The text has created the problem I have described above. Euro sign was producing junk characters due someone's editor setting or during transfer.

Though this is working for the above text:

mb_convert_encoding():

<?php
$arr[] = "iso-8859-1";
$arr[] = "utf-8";
echo  mb_convert_encoding('Mit über 3000 Stationen in über 80 Ländern haben wir sicher auch eine Stationen in Ihrer Nähe! Miami Airport  - Mietwagen schon ab 147, €', 'HTML-ENTITIES', $arr);
?>

It is for long time I am trying to understand character encoding. Sometime It looks a little mystery to me for difficult problems. Encoding problem can come in email text, database also.

In the testing process I have developed a small tool or application during the journey this time - Convert character to html entity. This tool will produce &euro; for Euro symbol and not the numeric reference.

Comments are open for an year period. Please, write here on Facebook page.