node.js 将流式缓冲区转换为 utf8 字符串

Question

提问by Biggie

I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:

我想使用 node.js 发出 HTTP 请求以从网络服务器加载一些文本。由于响应可以包含很多文本（一些兆字节），我想分别处理每个文本块。我可以使用以下代码实现这一点：

var req = http.request(reqOptions, function(res) {
    ...
    res.setEncoding('utf8');
    res.on('data', function(textChunk) {
        // process utf8 text chunk
    });
});

This seems to work without problems. However I want to support HTTP-compression, so I use zlib:

这似乎没有问题。但是我想支持 HTTP 压缩，所以我使用 zlib：

var zip = zlib.createUnzip();

// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
    // do something like checking the number of bytes downloaded
    zip.write(chunk); // give the raw bytes to zlib, s.b.
});

zip.on('data', function(chunk) {
    // convert chunk to utf8 text:
    var textChunk = chunk.toString('utf8');

    // process utf8 text chunk
});

This can be a problem for multi-byte characters like '\u00c4'which consists of two bytes: 0xC3and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8')will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?

对于'\u00c4'由两个字节组成的多字节字符，这可能是一个问题：0xC3和0x84。如果第一个字节被第一个块 ( Buffer)覆盖，第二个字节被第二个块覆盖，那么chunk.toString('utf8')将在文本块的结尾/开头产生不正确的字符。我怎样才能避免这种情况？

Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8')like in the first example code above for non-compressed data does not suit my needs.

提示：我仍然需要缓冲区（更具体地说是缓冲区中的字节数）来限制下载的字节数。因此res.setEncoding('utf8')，在上面的第一个示例代码中对非压缩数据使用like 不适合我的需要。

Answer 1

回答by Biggie

Single Buffer

单缓冲器

If you have a single Bufferyou can use its toStringmethod that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8if you don't provide a parameter, but I've explicitly set the encoding in this example.

如果您有一个，Buffer您可以使用它的toString方法将所有或部分二进制内容转换为使用特定编码的字符串。utf8如果您不提供参数，则默认为，但我已在此示例中明确设置了编码。

var req = http.request(reqOptions, function(res) {
    ...

    res.on('data', function(chunk) {
        var textChunk = chunk.toString('utf8');
        // process utf8 text chunk
    });
});

Streamed Buffers

流式缓冲区

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer(chunk) and the second byte in the second Bufferthen you should use a StringDecoder. :

如果您在上面的问题中流式传输缓冲区，其中多字节字符的第一个字节UTF8可能包含在第一个Buffer（块）中，而第二个字节包含在第二个字节中，Buffer那么您应该使用StringDecoder. ：

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
    ...
    var decoder = new StringDecoder('utf8');

    res.on('data', function(chunk) {
        var textChunk = decoder.write(chunk);
        // process utf8 text chunk
    });
});

This way bytes of incompletecharacters are buffered by the StringDecoderuntil all required bytes were written to the decoder.

这样，不完整字符的字节会被缓冲，StringDecoder直到所有需要的字节都写入解码器为止。

Answer 2

回答by user3398092

var fs = require("fs");

function readFileLineByLine(filename, processline) {
    var stream = fs.createReadStream(filename);
    var s = "";
    stream.on("data", function(data) {
        s += data.toString('utf8');
        var lines = s.split("\n");
        for (var i = 0; i < lines.length - 1; i++)
            processline(lines[i]);
        s = lines[lines.length - 1];
    });

    stream.on("end",function() {
        var lines = s.split("\n");
        for (var i = 0; i < lines.length; i++)
            processline(lines[i]);
    });
}

var linenumber = 0;
readFileLineByLine(filename, function(line) {
    console.log(++linenumber + " -- " + line);
});

node.js 将流式缓冲区转换为 utf8 字符串

提问by Biggie

回答by Biggie

Single Buffer

单缓冲器

Streamed Buffers

流式缓冲区

回答by user3398092

相关推荐

最近更新

标签

node.js 将流式缓冲区转换为 utf8 字符串

提问by Biggie

回答by Biggie

Single Buffer

单缓冲器

Streamed Buffers

流式缓冲区

回答by user3398092

相关推荐

node.js AngularJS/Jade 错误：参数“MyController”不是函数，未定义（MEAN）

node.js gruntjs 服务器任务的目的是什么？

Node.js：如果方法抛出异常，则不会显示 console.log 消息......为什么？

node.js npm - 如何显示包的最新版本

相关推荐

最近更新

标签