如何在 Python 中快速搜索 .csv 文件

Question

提问by Iceland_Hyman

I'm reading a 6 million entry .csv file with Python, and I want to be able to search through this file for a particular entry.

我正在使用 Python 读取 600 万个条目 .csv 文件，并且我希望能够在此文件中搜索特定条目。

Are there any tricks to search the entire file? Should you read the whole thing into a dictionary or should you perform a search every time? I tried loading it into a dictionary but that took ages so I'm currently searching through the whole file every time which seems wasteful.

有没有什么技巧可以搜索整个文件？您应该将整个内容读入字典还是每次都进行搜索？我尝试将它加载到字典中，但这需要很长时间，所以我目前每次都在搜索整个文件，这似乎很浪费。

Could I possibly utilize that the list is alphabetically ordered? (e.g. if the search word starts with "b" I only search from the line that includes the first word beginning with "b" to the line that includes the last word beginning with "b")

我可以利用列表按字母顺序排列吗？（例如，如果搜索词以“b”开头，我只从包含以“b”开头的第一个词的行到包含以“b”开头的最后一个词的行进行搜索）

I'm using import csv.

我正在使用import csv.

(a side question: it is possible to make csvgo to a specific line in the file? I want to make the program start at a random line)

（附带问题：可以csv转到文件中的特定行吗？我想让程序从随机行开始）

Edit: I already have a copy of the list as an .sql file as well, how could I implement that into Python?

编辑：我已经有一个列表的副本作为 .sql 文件，我怎么能在 Python 中实现它？

Answer 1

回答by JimB

If the csv file isn't changing, load in it into a database, where searching is fast and easy. If you're not familiar with SQL, you'll need to brush up on that though.

如果 csv 文件没有变化，请将其加载到数据库中，在那里搜索既快速又容易。如果您不熟悉 SQL，则需要复习一下。

Here is a rough example of inserting from a csv into a sqlite table. Example csv is ';' delimited, and has 2 columns.

这是从 csv 插入到 sqlite 表的粗略示例。示例 csv 是 ';' 分隔，并有 2 列。

import csv
import sqlite3

con = sqlite3.Connection('newdb.sqlite')
cur = con.cursor()
cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));')

f = open('stuff.csv')
csv_reader = csv.reader(f, delimiter=';')

cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader)
cur.close()
con.commit()
con.close()
f.close()

Answer 2

回答by ghostdog74

you can use memory mapping for really big files

您可以对非常大的文件使用内存映射

import mmap,os,re
reportFile = open( "big_file" )
length = os.fstat( reportFile.fileno() ).st_size
try:
    mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ )
except AttributeError:
    mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ )
data = mapping.read(length)
pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here.
print pat.findall(data)

Answer 3

回答by dan04

You can't go directly to a specific line in the file because lines are variable-length, so the only way to know when line #n starts is to search for the first n newlines. And it's not enough to just look for '\n' characters because CSV allows newlines in table cells, so you really do have to parse the file anyway.

您不能直接转到文件中的特定行，因为行是可变长度的，因此知道第 #n 行何时开始的唯一方法是搜索前 n 个换行符。仅仅查找 '\n' 字符是不够的，因为 CSV 允许表格单元格中的换行符，所以无论如何你确实必须解析文件。

Answer 4

回答by Justin Peel

Well, if your words aren't too big (meaning they'll fit in memory), then here is a simple way to do this (I'm assuming that they are all words).

好吧，如果你的单词不太大（意味着它们会适合记忆），那么这里有一个简单的方法来做到这一点（我假设它们都是单词）。

from bisect import bisect_left

f = open('myfile.csv')

words = []
for line in f:
    words.extend(line.strip().split(','))

wordtofind = 'bacon'
ind = bisect_left(words,wordtofind)
if words[ind] == wordtofind:
    print '%s was found!' % wordtofind

It might take a minute to load in all of the values from the file. This uses binary search to find your words. In this case I was looking for bacon (who wouldn't look for bacon?). If there are repeated values you also might want to use bisect_right to find the the index of 1 beyond the rightmost element that equals the value you are searching for. You can still use this if you have key:value pairs. You'll just have to make each object in your words list be a list of [key, value].

加载文件中的所有值可能需要一分钟。这使用二进制搜索来查找您的单词。在这种情况下，我正在寻找培根（谁不会寻找培根？）。如果有重复的值，您可能还想使用 bisect_right 来查找与您要搜索的值相等的最右边元素之外的 1 索引。如果您有键：值对，您仍然可以使用它。您只需要将单词列表中的每个对象设为 [key, value] 列表。

Side Note

边注

I don't think that you can really go from line to line in a csv file very easily. You see, these files are basically just long strings with \n characters that indicate new lines.

我不认为你真的可以很容易地在 csv 文件中从一行到另一行。您会看到，这些文件基本上只是带有 \n 字符表示换行的长字符串。

Answer 5

回答by vicky

my idea is to use python zodb module to store dictionaty type data and then create new csv file using that data structure. do all your operation at that time.

我的想法是使用 python zodb 模块来存储字典类型数据，然后使用该数据结构创建新的 csv 文件。那个时候做你所有的手术。

Answer 6

回答by TheOneWhoLikesToKnow

There is a fairly simple way to do this.Depending on how many columns you want python to print then you may need to add or remove some of the print lines.

有一种相当简单的方法可以做到这一点。根据您希望 python 打印的列数，您可能需要添加或删除一些打印行。

import csv
search=input('Enter string to search: ')
stock=open ('FileName.csv', 'wb')
reader=csv.reader(FileName)
for row in reader:
    for field in row:
        if field==code:
            print('Record found! \n')
            print(row[0])
            print(row[1])
            print(row[2])

I hope this managed to help.

我希望这能有所帮助。

如何在 Python 中快速搜索 .csv 文件

提问by Iceland_Hyman

回答by JimB

回答by ghostdog74

回答by dan04

回答by Justin Peel

回答by vicky

回答by TheOneWhoLikesToKnow

相关推荐

最近更新

标签

如何在 Python 中快速搜索 .csv 文件

提问by Iceland_Hyman

回答by JimB

回答by ghostdog74

回答by dan04

回答by Justin Peel

回答by vicky

回答by TheOneWhoLikesToKnow

相关推荐

使用 Python 从 ascii 转换为 utf-8

哪些静态类型语言与 Python 相似？

Python：如何通过派生类实例访问父类对象？

python 除了 ManyToMany 之外，是否有允许多种选择的 Django ModelField？

相关推荐

最近更新

标签