- HTTP responses usually contain a “Content-Type: xxxx; charset=yyy”
- Based on the charset, the encoding can be inferred
- This requires the WebServer to either
- know the encoding up-front,
- or to understand it by reading a bit of the document
- A XML document always begins with <?xml encoding=”…”>
- A HTML document would have this as part of its meta-tag itself
In the Document itself
|<meta http-equiv=“Content-Type” content=“text/html; charset=utf-8”>|
Browser Inferring the Encoding
- For many encodings, the browser tries to infer the encoding from the distribution of characters
- This applies for variants of the Code-page encodings
- Each language gets its own set of mappings, that have their own distributions in typical documents
- If the browser did not get it right, we just change the encoding manually on the browser, and read the document.
- The URL encodings are relevant in two places
- As the URL in the HTTP request ( Both GET and POST )
- For posting the contents of the form
- The encoding for URLs is
- Convert to UTF-8 first
- Then, replace all reserved characters with their %-escaped sequences
- Other sequences may also be %-escaped.
- During form-submit, the payload could be www-form-url encoded.
- This also follows the URL encoding rules for the most part.
UTF encodings have the following interesting features that make them very good encodings
- The beginning of a character has a zero in the first bit or 11 in the first two bits
- This makes it easy to synchronize the bytes
- The number of bytes occupied is specified by the number of Contiguous 1s in the first byte.
- This makes it easy to skip over this character and move to the next
- Also, it clearly shows what kind of UTF encoding is being used (JSON)