oracle 了解典型 Java Web 应用程序中的字符编码
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2534391/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Understanding character encoding in typical Java web app
提问by Marcus Leon
Some pseudocode:
一些伪代码:
String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response
When you save the Java String
(UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)?
当您将 Java String
(UTF-16)保存到 Oracle VARCHAR(15) 时,Oracle 是否也将其存储为 UTF-16?Oracle VARCHAR 的长度是否指 Unicode 字符数(而不是字节数)?
When we write b
to the ServletResponse
is this being written as UTF-16 or are we by default converting to another encoding like UTF-8?
当我们写入时b
,这ServletResponse
是被写为 UTF-16 还是我们默认转换为另一种编码,如 UTF-8?
回答by Dishayloo
Instead of UTF-16, think of 'internal representation' of your string. A string in Java is some sort of characters, you don't care which encoding is used internally. Encoding becomes relevant, if you interact with the outside of the program. In your example saveTextInDb, readTextFromDb and write do that. Every time you exchange strings with the outside, an encoding for conversion is used. saveTextInDb (and read) look like self-made methods, at least I don't know them. So you should look up, which encoding is used for this methods. The method write of a Writer always creates bytes, that represent an encoding associated with the writer. If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers).
考虑字符串的“内部表示”,而不是 UTF-16。Java 中的字符串是某种字符,您不关心内部使用哪种编码。如果您与程序的外部进行交互,则编码变得相关。在您的示例 saveTextInDb、readTextFromDb 和 write 中执行此操作。每次与外部交换字符串时,都会使用一种用于转换的编码。saveTextInDb(和 read)看起来像自制的方法,至少我不知道它们。所以你应该查一下,这个方法使用了哪种编码。Writer 的方法 write 总是创建字节,代表与作者关联的编码。如果您从 HttpServletResponse 获取 Writer,则关联的编码是用于输出响应的编码(将在标头中发送)。
response.setEncoding("UTF-8");
Writer out = response.getWriter();
This code returns with out a Writer, that translates the strings into UTF-8-encoding. Similar if you write to a file:
此代码不带 Writer 返回,该 Writer 将字符串转换为 UTF-8 编码。如果您写入文件,则类似:
Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");
If you access a DB, the framework you use should ensure a consistent exchange of strings with the database.
如果您访问数据库,您使用的框架应确保与数据库的字符串交换一致。
回答by Vineet Reynolds
The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32.
Oracle 从数据库中存储(以及稍后检索)Unicode 文本的能力仅依赖于数据库的字符集(通常在数据库创建期间指定)。建议选择 AL32UTF8 作为字符集来存储 CHAR 数据类型(包括 VARCHAR/VARCHAR2)中的 Unicode 文本,因为与其他编码(如 AL16UTF16/ AL32UTF32。
Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. This "automatic" conversion between encodings also happens when data is read from the database. To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used.
假设这样做了,则是 Oracle JDBC 驱动程序负责将 UTF-16 编码数据转换为 AL32UTF8。当从数据库读取数据时,也会发生这种编码之间的“自动”转换。为了回答关于 VARCHAR 字节长度的查询,Oracle 中 VARCHAR2 列的定义涉及字节语义 - VARCHAR2(n) 用于定义可以存储 n 个字节的列(这是默认行为,由 NLS_LENGTH_SEMANTICS 参数指定数据库);如果您需要根据要使用的字符 VARCHAR2(n CHAR) 定义大小。
The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding()or ServletResponse.setContentType()API calls. All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of
写入 ServletResponse 对象的数据的编码取决于默认字符编码,除非这是通过ServletResponse.setCharacterEncoding()或ServletResponse.setContentType()API 调用指定的。总而言之,对于涉及 Oracle 数据库的完整 Unicode 解决方案,必须具备以下知识:
- The encoding of the incoming data (i.e. the encoding of the data read via the ServletRequest object). This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute. If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding()method. This method doesn't change the existing encoding of characters in the stream. If the input stream is in ISO-Latin1, specifying a different encoding will most likely result in an exception being thrown. Knowing the encoding is important, since the Java runtime libraries will require knowledge of the original encoding of the stream, if the contents of the stream are to be treated as character primitives or Strings. Apparently, this is required when you invoke
ServletRequest.getParameter
or similar methods that will process the stream and return String objects. The decoding process will result in creation of characters in the platform encoding (this is UTF-16). The encoding of the data read from streams, as opposed to data created with in the JVM. This is quite important, since the encoding of data read from streams, cannot be changed. There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. New String objects on the other hand, can be created with a defined encoding. This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. In simpler words,
- if one is writing String objects or characters to the servlet's output stream, then one must specify the encoding that the browser must use to decode the response. Appropriate encoders might be used to encode the sequence of characters as specified in the desired response.
- if one is writing a sequence of bytes that will be interpreted as characters, then the encoding to be specified in the HTTP header must be known before hand
- if one is writing a sequence of bytes that will be parsed as a sequence of bytes (for images and other binary data), then the concept of encoding is immaterial.
- The database character set of the Oracle instance. As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8 (the database character set in this case) for CHAR and NCHAR datatypes. When you invoke
resultSet.getString()
, a String with UTF-16 characters is being returned by the JDBC driver. The converse is true, when you send data to the database too. If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver.
- 传入数据的编码(即通过 ServletRequest 对象读取的数据的编码)。这可以通过在 HTML 表单中通过accept-charset 属性指定接受的编码来完成。如果编码未知,应用程序可以尝试通过ServletRequest.setCharacterEncoding()将其设置为已知值方法。此方法不会更改流中字符的现有编码。如果输入流在 ISO-Latin1 中,指定不同的编码很可能会导致抛出异常。了解编码很重要,因为如果要将流的内容视为字符原语或字符串,Java 运行时库将需要了解流的原始编码。显然,当您调用
ServletRequest.getParameter
将处理流并返回 String 对象的类似方法时,这是必需的。解码过程将导致在平台编码(这是 UTF-16)中创建字符。 从流中读取的数据的编码,与在 JVM 中创建的数据相反。这非常重要,因为从流中读取的数据的编码无法更改。但是,只要将此类数据作为字符原语或字符串访问,就会有一个解码过程将支持编码的字符转换为 UTF-16 字符。另一方面,可以使用定义的编码创建新的 String 对象。当您将流的内容写入另一个流(例如 HttpServletResponse 对象的输出流)时,这很重要。如果输入流的内容被视为字节序列,而不是字符或字符串,则 JVM 将不进行任何解码操作。这意味着如果未创建中间字符或 String 对象,则不得更改写入输出流的字节。否则,输出流的内容很可能会被相应的解码器错误地解析和解析。简单来说,
- 如果将 String 对象或字符写入 servlet 的输出流,则必须指定浏览器必须用于解码响应的编码。可以使用适当的编码器对所需响应中指定的字符序列进行编码。
- 如果正在编写将被解释为字符的字节序列,则必须事先知道要在 HTTP 标头中指定的编码
- 如果一个人正在编写一个字节序列,这些字节序列将被解析为一个字节序列(用于图像和其他二进制数据),那么编码的概念是无关紧要的。
- Oracle 实例的数据库字符集。如前所述,数据将存储在 Oracle 数据库中,以定义的字符集(对于 CHAR 数据类型)。Oracle JDBC 驱动程序负责 CHAR 和 NCHAR 数据类型在 UTF-16 和 AL32UTF8(在本例中为数据库字符集)之间的数据转换。当您调用 时
resultSet.getString()
,JDBC 驱动程序将返回一个带有 UTF-16 字符的字符串。反过来也是如此,当您也将数据发送到数据库时。如果使用另一个数据库字符集,则 JDBC 驱动程序透明地执行附加级别的转换(从 UTF-16 到 UTF-8 到数据库字符集)。
回答by David Gelhar
The ServletResponse
will use ISO 8859-1 (Latin 1) by default. UTF-8 is the most common encoding used for HTTP responses that require Unicode, but you have to set that encoding specifically.
该ServletResponse
程序将默认使用ISO 8859-1(拉丁文1)。UTF-8 是用于需要 Unicode 的 HTTP 响应的最常用编码,但您必须专门设置该编码。
According to this documentOracle can support either UTF-8 or UTF-16 in the database. Your methods that read/write Oracle will need to use the appropriate encoding that matches how the database is set up, and translate that to/from the Java internal representation.
根据此文档,Oracle 可以在数据库中支持 UTF-8 或 UTF-16。读/写 Oracle 的方法需要使用与数据库设置方式相匹配的适当编码,并将其转换为 Java 内部表示。