node.js 将流式缓冲区转换为 utf8 字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/12121775/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
convert streamed buffers to utf8-string
提问by Biggie
I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:
我想使用 node.js 发出 HTTP 请求以从网络服务器加载一些文本。由于响应可以包含很多文本(一些兆字节),我想分别处理每个文本块。我可以使用以下代码实现这一点:
var req = http.request(reqOptions, function(res) {
...
res.setEncoding('utf8');
res.on('data', function(textChunk) {
// process utf8 text chunk
});
});
This seems to work without problems. However I want to support HTTP-compression, so I use zlib:
这似乎没有问题。但是我想支持 HTTP 压缩,所以我使用 zlib:
var zip = zlib.createUnzip();
// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
// do something like checking the number of bytes downloaded
zip.write(chunk); // give the raw bytes to zlib, s.b.
});
zip.on('data', function(chunk) {
// convert chunk to utf8 text:
var textChunk = chunk.toString('utf8');
// process utf8 text chunk
});
This can be a problem for multi-byte characters like '\u00c4'which consists of two bytes: 0xC3and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8')will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?
对于'\u00c4'由两个字节组成的多字节字符,这可能是一个问题:0xC3和0x84。如果第一个字节被第一个块 ( Buffer)覆盖,第二个字节被第二个块覆盖,那么chunk.toString('utf8')将在文本块的结尾/开头产生不正确的字符。我怎样才能避免这种情况?
Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8')like in the first example code above for non-compressed data does not suit my needs.
提示:我仍然需要缓冲区(更具体地说是缓冲区中的字节数)来限制下载的字节数。因此res.setEncoding('utf8'),在上面的第一个示例代码中对非压缩数据使用like 不适合我的需要。
回答by Biggie
Single Buffer
单缓冲器
If you have a single Bufferyou can use its toStringmethod that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8if you don't provide a parameter, but I've explicitly set the encoding in this example.
如果您有一个,Buffer您可以使用它的toString方法将所有或部分二进制内容转换为使用特定编码的字符串。utf8如果您不提供参数,则默认为,但我已在此示例中明确设置了编码。
var req = http.request(reqOptions, function(res) {
...
res.on('data', function(chunk) {
var textChunk = chunk.toString('utf8');
// process utf8 text chunk
});
});
Streamed Buffers
流式缓冲区
If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer(chunk) and the second byte in the second Bufferthen you should use a StringDecoder. :
如果您在上面的问题中流式传输缓冲区,其中多字节字符的第一个字节UTF8可能包含在第一个Buffer(块)中,而第二个字节包含在第二个字节中,Buffer那么您应该使用StringDecoder. :
var StringDecoder = require('string_decoder').StringDecoder;
var req = http.request(reqOptions, function(res) {
...
var decoder = new StringDecoder('utf8');
res.on('data', function(chunk) {
var textChunk = decoder.write(chunk);
// process utf8 text chunk
});
});
This way bytes of incompletecharacters are buffered by the StringDecoderuntil all required bytes were written to the decoder.
这样,不完整字符的字节会被缓冲,StringDecoder直到所有需要的字节都写入解码器为止。
回答by user3398092
var fs = require("fs");
function readFileLineByLine(filename, processline) {
var stream = fs.createReadStream(filename);
var s = "";
stream.on("data", function(data) {
s += data.toString('utf8');
var lines = s.split("\n");
for (var i = 0; i < lines.length - 1; i++)
processline(lines[i]);
s = lines[lines.length - 1];
});
stream.on("end",function() {
var lines = s.split("\n");
for (var i = 0; i < lines.length; i++)
processline(lines[i]);
});
}
var linenumber = 0;
readFileLineByLine(filename, function(line) {
console.log(++linenumber + " -- " + line);
});

