javascript nodejs同步逐行读取大文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7545147/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-26 00:29:37  来源:igfitidea点击:

nodejs synchronization read large file line by line?

javascriptnode.jsfilesystemsmojibake

提问by nroe

I have a large file (utf8). I know fs.createReadStreamcan create stream to read a large file, but not synchronized. So i try to use fs.readSync, but read text is broken like "迈?".

我有一个大文件(utf8)。我知道fs.createReadStream可以创建流来读取大文件,但不同步。所以我尝试使用fs.readSync,但阅读文本被破坏了"迈?"

var fs = require('fs');
var util = require('util');
var textPath = __dirname + '/people-daily.txt';   
var fd = fs.openSync(textPath, "r");
var text = fs.readSync(fd, 4, 0, "utf8");
console.log(util.inspect(text, true, null));

回答by Peace Makes Plenty

For large files, readFileSynccan be inconvenient, as it loads the whole file in memory. A different synchronous approach is to iteratively call readSync, reading small bits of data at a time, and processing the lines as they come. The following bit of code implements this approach and synchronously processes one line at a time from the file 'test.txt':

对于大文件,readFileSync可能不方便,因为它会将整个文件加载到内存中。另一种不同的同步方法是迭代调用readSync,一次读取少量数据,并在行出现时对其进行处理。以下代码实现了这种方法,并从文件“test.txt”中一次同步处理一行:

var fs = require('fs');
var filename = 'test.txt'

var fd = fs.openSync(filename, 'r');
var bufferSize = 1024;
var buffer = new Buffer(bufferSize);

var leftOver = '';
var read, line, idxStart, idx;
while ((read = fs.readSync(fd, buffer, 0, bufferSize, null)) !== 0) {
  leftOver += buffer.toString('utf8', 0, read);
  idxStart = 0
  while ((idx = leftOver.indexOf("\n", idxStart)) !== -1) {
    line = leftOver.substring(idxStart, idx);
    console.log("one line read: " + line);
    idxStart = idx + 1;
  }
  leftOver = leftOver.substring(idxStart);
}

回答by Divam Gupta

use https://github.com/nacholibre/node-readlines

使用https://github.com/nacholibre/node-readlines

var lineByLine = require('n-readlines');
var liner = new lineByLine('./textFile.txt');

var line;
var lineNumber = 0;
while (line = liner.next()) {
    console.log('Line ' + lineNumber + ': ' + line.toString('ascii'));
    lineNumber++;
}

console.log('end of line reached');

回答by Tom

Use readFileSync:

使用readFileSync

fs.readFileSync(filename, [encoding]) Synchronous version of fs.readFile. Returns the contents of the filename.

If encoding is specified then this function returns a string. Otherwise it returns a buffer.

fs.readFileSync(filename, [encoding]) fs.readFile 的同步版本。返回文件名的内容。

如果指定了编码,则此函数返回一个字符串。否则它返回一个缓冲区。

On a side note, since you are using node, I'd recommend using asynchronous functions.

附带说明一下,由于您使用的是节点,因此我建议您使用异步函数。

回答by srkleiman

I built a simpler version JB Kohn's answer that uses split() on the buffer. It works on the larger files I tried.

我构建了一个更简单的版本 JB Kohn 的答案,它在缓冲区上使用 split() 。它适用于我尝试过的较大文件。

/*
 * Synchronously call fn(text, lineNum) on each line read from file descriptor fd.
 */
function forEachLine (fd, fn) {
    var bufSize = 64 * 1024;
    var buf = new Buffer(bufSize);
    var leftOver = '';
    var lineNum = 0;
    var lines, n;

    while ((n = fs.readSync(fd, buf, 0, bufSize, null)) !== 0) {
        lines = buf.toString('utf8', 0 , n).split('\n');
        lines[0] = leftOver+lines[0];       // add leftover string from previous read
        while (lines.length > 1) {          // process all but the last line
            fn(lines.shift(), lineNum);
            lineNum++;
        }
        leftOver = lines.shift();           // save last line fragment (may be '')
    }
    if (leftOver) {                         // process any remaining line
        fn(leftOver, lineNum);
    }
}

回答by user943702

two potential problems,

两个潜在的问题,

  1. 3bytes BOM at the beginning you did not skip
  2. first 4bytes cannot be well format to UTF8's chars( utf8 is not fixed length )
  1. 开头的 3bytes BOM 你没跳过
  2. 前 4 个字节不能很好地格式化为 UTF8 的字符(utf8 不是固定长度)