如何在Java中获取文本文件的随机行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2218005/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to get a random line of a text file in Java?
提问by Fluffy
Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
假设有一个文件太大而无法放入内存。我怎样才能从中得到一条随机线?谢谢。
Update: I want to the probabilities of getting each line to be equal.
更新:我想要让每条线相等的概率。
采纳答案by Itay Maman
Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
这是一个解决方案。看一看choose() 方法,它做了真实的事情(main() 方法反复练习choose(),表明分布确实相当均匀)。
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
这个想法很简单:当您阅读第一行时,它有 100% 的机会被选中作为结果。当您阅读第二行时,它有 50% 的机会替换第一行作为结果。当您阅读第三行时,它有 33% 的机会成为结果。第四行有25%,依此类推....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
回答by ZeissS
Use a BufferedReader and read line wise. Use the java.util.Random object to stop randomly ;)
使用 BufferedReader 并按行读取。使用 java.util.Random 对象随机停止;)
回答by Will
Either you
要么你
read the file twice - once to count the number of lines, the second time to extract a random line, or
读取文件两次 - 一次计算行数,第二次提取随机行,或
使用水库取样
回答by meriton
Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:
如果您只想要一行,则读取整个文件似乎有点过分。以下应该更有效:
- Use RandomAccessFileto seek to a random byte position in the file.
- Seek left and right to the next line terminator. Let L the line between them.
- With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.
- 使用RandomAccessFile寻找文件中的随机字节位置。
- 向左和向右寻找下一行终止符。让 L 成为他们之间的界线。
- 以概率 (MIN_LINE_LENGTH / L.length) 返回 L。否则,从步骤 1 重新开始。
This is a variant of rejection sampling.
这是拒绝抽样的一种变体。
Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).
行长包括行终止符,因此 MIN_LINE_LENGTH >= 1。(如果你知道行长有更严格的界限就更好了)。
It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.
值得注意的是,该算法的运行时间不依赖于文件大小,只依赖于行长,即它比读取整个文件要好得多。
回答by Pureferret
Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservtheitroad sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservtheitroad sampling, based on thisand the various descriptions on the web.
查看 Itay 的答案,它看起来好像在对一行代码进行采样后读取文件一千次,而真正的水库采样应该只遍历“磁带”一次。基于此以及网络上的各种描述,我设计了一些代码来对真实储层采样进行一次代码检查。
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservtheitroadSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservtheitroadSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservtheitroadList is where our selected lines stored
List <String> reservtheitroadList= new ArrayList<String>(reservtheitroadSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservtheitroadSize)
{
reservtheitroadList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservtheitroadSize)
{
reservtheitroadList.set(randomNumber, currentLine);
}
}
return reservtheitroadList;
}
}
The basic premise is that you fill up the reservtheitroad, and then go back to it and fill in random lines with a 1/ReservtheitroadSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.
基本前提是您填满水库,然后返回到它并以 1/ReservtheitroadSize 的机会填充随机线。我希望这提供了更有效的代码。如果这对您不起作用,请告诉我,因为我已经在半小时内完成了。
回答by NBCurrieLL
Use RandomAccessFile:
使用RandomAccessFile:
- Construct a RandomAccessFile, file
- Get the length of that file, filelen, by calling file.length()
- Generate a random number, pos, between 0 and filelen
- Call file.seek(pos)to seek to the random position
- Call file.readLine()to get to the end of the current line
- Read the next line by calling file.readLine()again
- 构造一个RandomAccessFile,文件
- 通过调用file.length()获取该文件filelen的长度
- 生成一个随机数pos,介于 0 和filelen之间
- 调用file.seek(pos)寻找随机位置
- 调用file.readLine()到达当前行的末尾
- 再次调用file.readLine()读取下一行
Using this method, I've been sampling lines from the Brown Corpus at random, and can easily retrieve a 1000 random samples from randomly chosen files in a few seconds. If I tried to do the same by reading through each file line-by-line it would take me much longer.
使用这种方法,我已经从布朗语料库中随机采样行,并且可以在几秒钟内轻松地从随机选择的文件中检索 1000 个随机样本。如果我试图通过逐行阅读每个文件来做同样的事情,那将花费我更长的时间。
The same principle can be used for selecting random elements from a list. Rather than reading through the list and stopping at a random place, if you generate a random number between 0 and the length of the list, then you can index directly into the list.
相同的原理可用于从列表中选择随机元素。如果您生成一个介于 0 和列表长度之间的随机数,而不是通读列表并在随机位置停止,那么您可以直接索引到列表中。