I came across a mail on a ColdFusion forum by Brett Suwyn:
I am trying to read the contents of a file and then output it verbatim.
The problem is that if I use cffile to read the file into a variable and
then output it (or readBinary into a variable and then ToString to output
it), I have to specify an encoding. But I don’t necessarily know the
encoding (and if I specify the wrong one, the file contents are altered on
output) so I just want to output it exactly as it came in.
I thought this was an interesting question because Character encodings are something I haven’t a DEEP understanding of. Here was an excuse to find out more!
Usually when I want to have a deep understanding of a subject I start with first principles.
So what is character encoding? There are several good resources on the web, but I found
Jukka “Yucca” Korpela’s site (http://www.cs.tut.fi/~jkorpela/chars.html) on the issue to be very comprehensive, with many links for further explaination.
So basically the only why to determine the correct encoding of a file is if the original program that created the file specifies it in the contents of the file (in Ms Word for example or in the encoding attribute of a XML file)
< ?xml version="1.0" encoding="UTF-8"?>
An example of encoding specifed in XML
or if the information is passed along with the file when it was recieved, with a charset specified in an email for example.
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
An example of encoding specifed in Mime Header in an email
So back to Brett. In his email he didn’t specify the origin of the files that he was including. But since he is talking about a ColdFusion application I’m going to make an assumation. The files are uploaded via the browser.
RFC 1867 is the RFC that deals with Form-based File Upload in HTML
(http://www.faqs.org/rfcs/rfc1867.html)
Section 3.3 deals with encoding
The value supplied for a part may need to be
encoded and the “content-transfer-encoding” header supplied if the
value does not conform to the default encoding.
The default encoding here refers to 7BIT encoding.
Of course rfc’s are not alway conformed to, so some testing will needed to be done.
A note on the Byte Order Mark:
One other piece of information I came across in the field of Encoding determination, was the Byte-Order-Mark. Basically the Byte-Order-Mark may be used to indicate the encoding of unlabeled text in many Unicode encodings. However it is of limited use to determine encoding, in general, as the program creating the file must insert it. If there is a BOM however you know the file is unicode and you also can tell the Endianness of the text.