如何锁定对 MySQL 表的读/写，以便我可以在没有其他程序读/写数据库的情况下选择然后插入？

Question

提问by T. Brian Jones

I am running many instances of a webcrawler in parallel.

我正在并行运行许多网络爬虫实例。

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then starts crawling the domain.

每个爬虫从表中选择一个域，将该 url 和开始时间插入到日志表中，然后开始对域进行爬行。

Other parallel crawlers check the log table to see what domains are already being crawled before selecting their own domain to crawl.

其他并行爬虫会在选择自己的要爬网的域之前检查日志表以查看哪些域已被爬网。

I need to prevent other crawlers from selecting a domain that has just been selected by another crawler but doesn't have a log entry yet. My best guess at how to do this is to lock the database from all other read/writes while one crawler selects a domain and inserts a row in the log table (two queries).

我需要阻止其他爬虫选择刚刚被另一个爬虫选中但还没有日志条目的域。我对如何做到这一点的最佳猜测是在一个爬虫选择一个域并在日志表中插入一行（两个查询）时锁定数据库以防止所有其他读/写。

How the heck does one do this? I'm afraid this is terribly complex and relies on many other things. Please help get me started.

这到底是怎么做到的？恐怕这是非常复杂的，并且依赖于许多其他事情。请帮助我开始。

This code seems like a good solution (see the error below, however):

这段代码似乎是一个很好的解决方案（但是，请参阅下面的错误）：

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT companies.id FROM companies
        LEFT OUTER JOIN crawlLog
        ON companies.id = crawlLog.companyId
        WHERE crawlLog.companyId IS NULL
        LIMIT 1
    ),
    now()
)

but I keep getting the following mysql error:

但我不断收到以下 mysql 错误：

You can't specify target table 'crawlLog' for update in FROM clause

Is there a way to accomplish the same thing without this problem? I've tried a couple different ways. Including this:

有没有办法在没有这个问题的情况下完成同样的事情？我尝试了几种不同的方法。包括这个：

INSERT INTO crawlLog (companyId, timeStartCrawling)
VALUES
(
    (
        SELECT id
        FROM companies
        WHERE id NOT IN (SELECT companyId FROM crawlLog) LIMIT 1
    ),
    now()
)

Answer 1

采纳答案by T. Brian Jones

I got some inspiration from @Eljakim's answer and started this new threadwhere I figured out a great trick. It doesn't involve locking anything and is very simple.

我从@Eljakim 的回答中获得了一些灵感，并开始了这个新线程，在那里我找到了一个绝妙的技巧。它不涉及锁定任何东西，而且非常简单。

INSERT INTO crawlLog (companyId, timeStartCrawling)
SELECT id, now()
FROM companies
WHERE id NOT IN
(
    SELECT companyId
    FROM crawlLog AS crawlLogAlias
)
LIMIT 1

Answer 2

回答by qbert220

You can lock tables using the MySQL LOCK TABLEScommand like this:

您可以使用 MySQLLOCK TABLES命令锁定表，如下所示：

LOCK TABLES tablename WRITE;

# Do other queries here

UNLOCK TABLES;

See:

看：

http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

Answer 3

回答by wonk0

Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATEwithin a transaction.

好吧，表锁是解决这个问题的一种方法；但这使得并行请求变得不可能。如果表是 InnoDB，则可以在事务中使用SELECT ... FOR UPDATE强制行锁。

BEGIN;

SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE

# do whatever you have to do

COMMIT;

Please note that you will need an index on domainname(or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

请注意，您需要一个索引domainname（或您在 WHERE 子句中使用的任何列）才能使其工作，但这通常是有道理的，我认为无论如何您都会拥有它。

Answer 4

回答by ratsbane

You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

您可能不想锁定表。如果您这样做，您将不得不担心在其他爬虫尝试写入数据库时捕获错误 - 这就是您说“......非常复杂并且依赖于许多其他事情”时的想法。

Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

相反，您应该将查询组包装在 MySQL 事务中（请参阅http://dev.mysql.com/doc/refman/5.0/en/commit.html），如下所示：

START TRANSACTION;
SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
INSERT INTO loggingtable SET url=@URL;
COMMIT;

Or something close to that.

或者接近那个的东西。

[edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

[编辑] 我刚刚意识到 - 您可能可以在单个查询中完成所需的一切，甚至不必担心事务。像这样的东西：

INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.

Answer 5

回答by Eljakim

I wouldn't use locking, or transactions.

我不会使用锁定或事务。

The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

最简单的方法是在日志记录表中插入一条记录（如果它不存在），然后检查该记录。

Assume you have tblcrawels (cra_id)that is filled with your crawlers and tblurl (url_id)that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id)for your logfile.

假设你有tblcrawels (cra_id)你的爬虫程序和tblurl (url_id)URLs，以及一个tbllogging (log_cra_id, log_url_id)日志文件表。

You would run the following query if crawler 1 wants to start crawling url 2:

如果爬虫 1 想要开始爬取 url 2，您将运行以下查询：

INSERT INTO tbllogging (log_cra_id, log_url_id) 
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
WHERE url_id=2 AND log_url_id IS NULL;

The next step is to check whether this record has been inserted.

下一步是检查是否已插入此记录。

SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1

If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

如果你得到任何结果，那么爬虫 1 就可以爬取这个 url。如果您没有得到任何结果，这意味着另一个爬虫已插入到同一行中并且已经在爬行。

如何锁定对 MySQL 表的读/写，以便我可以在没有其他程序读/写数据库的情况下选择然后插入？

提问by T. Brian Jones

采纳答案by T. Brian Jones

回答by qbert220

回答by wonk0

回答by ratsbane

回答by Eljakim

相关推荐

最近更新

标签

如何锁定对 MySQL 表的读/写，以便我可以在没有其他程序读/写数据库的情况下选择然后插入？

提问by T. Brian Jones

采纳答案by T. Brian Jones

回答by qbert220

回答by wonk0

回答by ratsbane

回答by Eljakim

相关推荐

将 tinyint 默认值更改为 1 mysql

MySQL SQL：选择列值从上一行更改的行

MySQL 添加一个 NOT NULL 列

如何在 MySQL 中调度存储过程

相关推荐

最近更新

标签