php 如何检查值是否已存在以避免重复?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/61033/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-24 21:20:40  来源:igfitidea点击:

How to check if a value already exists to avoid duplicates?

phpsqlmysql

提问by Gilean

I've got a table of URLs and I don't want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

我有一个 URL 表,我不想要任何重复的 URL。如何使用 PHP/MySQL 检查给定的 URL 是否已经在表中?

回答by aku

If you don't want to have duplicates you can do following:

如果您不想重复,可以执行以下操作:

If multiple users can insert data to DB, method suggested by @Jeremy Ruten, can lead to an error: after you performed a check someone can insert similar data to the table.

如果多个用户可以向数据库插入数据,@Jeremy Ruten 建议的方法可能会导致错误:在您执行检查后,有人可以向表中插入类似的数据。

回答by Mez

To answer your initial question, the easiest way to check whether there is a duplicate is to run an SQL query against what you're trying to add!

要回答您最初的问题,检查是否存在重复项的最简单方法是针对您要添加的内容运行 SQL 查询!

For example, were you to want to check for the url http://www.example.com/in the table links, then your query would look something like

例如,如果您要检查http://www.example.com/表中的 url links,那么您的查询将类似于

SELECT * FROM links WHERE url = 'http://www.example.com/';

Your PHP code would look something like

你的 PHP 代码看起来像

$conn = mysql_connect('localhost', 'username', 'password');
if (!$conn)
{
    die('Could not connect to database');
}
if(!mysql_select_db('mydb', $conn))
{
    die('Could not select database mydb');
}

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    die('This URL already exists in the database');
}

I've written this out longhand here, with all the connecting to the database, etc. It's likely that you'll already have a connection to a database, so you should use that rather than starting a new connection (replace $connin the mysql_querycommand and remove the stuff to do with mysql_connectand mysql_select_db)

我已经写了这一点手写在这里,所有的连接到数据库,等等。这可能是因为你已经有一个数据库的连接,所以你应该使用,而不是开始一个新的连接(取代$connmysql_query命令和删除与mysql_connect和有关的东西mysql_select_db

Of course, there are other ways of connecting to the database, like PDO, or using an ORM, or similar, so if you're already using those, this answer may not be relevant (and it's probably a bit beyond the scope to give answers related to this here!)

当然,还有其他方法可以连接到数据库,例如 PDO,或使用 ORM 或类似方法,因此,如果您已经在使用这些方法,则此答案可能不相关(并且可能有点超出了给出的范围)与此相关的答案在这里!)

However, MySQL provides many ways to prevent this from happening in the first place.

但是,MySQL 提供了许多方法来首先防止这种情况发生。

Firstly, you can mark a field as "unique".

首先,您可以将字段标记为“唯一”。

Lets say I have a table where I want to just store all the URLs that are linked to from my site, and the last time they were visited.

假设我有一个表,我只想存储从我的网站链接到的所有 URL,以及它们上次访问的时间。

My definition might look something like this:-

我的定义可能是这样的:-

CREATE TABLE links
(
    url VARCHAR(255) NOT NULL,
    last_visited TIMESTAMP
)

This would allow me to add the same URL over and over again, unless I wrote some PHP code similar to the above to stop this happening.

这将允许我一遍又一遍地添加相同的 URL,除非我编写了一些类似于上述的 PHP 代码来阻止这种情况发生。

However, were my definition to change to

但是,我的定义是否要更改为

CREATE TABLE links
(
  url VARCHAR(255)  NOT NULL,
  last_visited TIMESTAMP,
  PRIMARY KEY (url)
)

Then this would make mysql throw an error when I tried to insert the same value twice.

然后,当我尝试两次插入相同的值时,这会使 mysql 抛出错误。

An example in PHP would be

PHP 中的一个例子是

$result = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result)
{
    die('Could not Insert Row 1');
}

$result2 = mysql_query("INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW()", $conn);

if (!$result2)
{
    die('Could not Insert Row 2');
}

If you ran this, you'd find that on the first attempt, the script would die with the comment Could not Insert Row 2. However, on subsequent runs, it'd die with Could not Insert Row 1.

如果你运行它,你会发现在第一次尝试时,脚本会随着注释而消失Could not Insert Row 2。但是,在随后的运行中,它会随着Could not Insert Row 1.

This is because MySQL knows that the url is the PrimaryKey of the table. A Primary key is a unique identifier for that row. Most of the time, it's useful to set the unique identifier for a row to be a number. This is because MySQL is quicker at looking up numbers than it is looking up text. Within MySQL, keys (and espescially Primary Keys) are used to define relationships between two tables. For example, if we had a table for users, we could define it as

这是因为 MySQL 知道 url 是表的键。主键是该行的唯一标识符。大多数情况下,将行的唯一标识符设置为数字很有用。这是因为 MySQL 查找数字比查找文本更快。在 MySQL 中,键(尤其是主键)用于定义两个表之间的关系。例如,如果我们有一个用户表,我们可以将它定义为

CREATE TABLE users (
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40) NOT NULL,
  PRIMARY KEY (username)
)

However, when we wanted to store information about a post the user had made, we'd have to store the username with that post to identify that the post belonged to that user.

但是,当我们想要存储有关用户发布的帖子的信息时,我们必须存储该帖子的用户名以标识该帖子属于该用户。

I've already mentioned that MySQL is faster at looking up numbers than strings, so this would mean we'd be spending time looking up strings when we didn't have to.

我已经提到 MySQL 在查找数字方面比字符串更快,所以这意味着我们会花时间在不需要的时候查找字符串。

To solve this, we can add an extra column, user_id, and make that the primary key (so when looking up the user record based on a post, we can find it quicker)

为了解决这个问题,我们可以添加一个额外的列,user_id,并将其作为主键(这样在根据帖子查找用户记录时,我们可以更快地找到它)

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (`user_id`)
)

You'll notice that I've also added something new here - AUTO_INCREMENT. This basically allows us to let that field look after itself. Each time a new row is inserted, it adds 1 to the previous number, and stores that, so we don't have to worry about numbering, and can just let it do this itself.

您会注意到我还在此处添加了一些新内容 - AUTO_INCREMENT。这基本上允许我们让该字段自行处理。每次插入新行时,它都会在前一个数字上加 1 并存储它,所以我们不必担心编号,可以让它自己完成。

So, with the above table, we can do something like

所以,有了上表,我们可以做类似的事情

INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');

and then

进而

INSERT INTO users (username, password) VALUES('User', '988881adc9fc3655077dc2d4d757d480b5ea0e11');

When we select the records from the database, we get the following:-

当我们从数据库中选择记录时,我们得到以下信息:-

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
+---------+----------+------------------------------------------+
2 rows in set (0.00 sec)

However, here - we have a problem - we can still add another user with the same username! Obviously, this is something we don't want to do!

但是,在这里 - 我们有一个问题 - 我们仍然可以添加另一个具有相同用户名的用户!显然,这是我们不想做的事情!

mysql> SELECT * FROM users;
+---------+----------+------------------------------------------+
| user_id | username | password                                 |
+---------+----------+------------------------------------------+
|       1 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
|       2 | User     | 988881adc9fc3655077dc2d4d757d480b5ea0e11 |
|       3 | Mez      | d3571ce95af4dc281f142add33384abc5e574671 |
+---------+----------+------------------------------------------+
3 rows in set (0.00 sec)

Lets change our table definition!

让我们改变我们的表定义!

CREATE TABLE users (
  user_id INT(10)  NOT NULL AUTO_INCREMENT,
  username VARCHAR(255)  NOT NULL,
  password VARCHAR(40)  NOT NULL,
  PRIMARY KEY (user_id),
  UNIQUE KEY (username)
)

Lets see what happens when we now try and insert the same user twice.

让我们看看当我们现在尝试插入同一个用户两次时会发生什么。

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO users (username, password) VALUES('Mez', 'd3571ce95af4dc281f142add33384abc5e574671');
ERROR 1062 (23000): Duplicate entry 'Mez' for key 'username'

Huzzah!! We now get an error when we try and insert the username for the second time. Using something like the above, we can detect this in PHP.

哈扎!!现在,当我们第二次尝试插入用户名时出现错误。使用类似上面的东西,我们可以在 PHP 中检测到这一点。

Now, lets go back to our links table, but with a new definition.

现在,让我们回到我们的链接表,但有一个新的定义。

CREATE TABLE links
(
    link_id INT(10)  NOT NULL AUTO_INCREMENT,
    url VARCHAR(255)  NOT NULL,
    last_visited TIMESTAMP,
    PRIMARY KEY (link_id),
    UNIQUE KEY (url)
)

and let's insert "http://www.example.com" into the database.

让我们将“http://www.example.com”插入到数据库中。

INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());

If we try and insert it again....

如果我们尝试再次插入它....

ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'

But what happens if we want to update the time it was last visited?

但是如果我们想更新它上次访问的时间会发生什么?

Well, we could do something complex with PHP, like so:-

好吧,我们可以用 PHP 做一些复杂的事情,就像这样:-

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $result = mysql_query("UPDATE links SET last_visited = NOW() WHERE url = 'http://www.example.com/'", $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

Or, even grab the id of the row in the database and use that to update it.

或者,甚至获取数据库中行的 id 并使用它来更新它。

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

$result = mysql_query("SELECT * FROM links WHERE url = 'http://www.example.com/'", $conn);

if (!$result)
{
    die('There was a problem executing the query');
}

$number_of_rows = mysql_num_rows($result);

if ($number_of_rows > 0)
{
    $row = mysql_fetch_assoc($result);

    $result = mysql_query('UPDATE links SET last_visited = NOW() WHERE link_id = ' . intval($row['link_id'], $conn);

    if (!$result)
    {
        die('There was a problem updating the links table');
    }
}

But, MySQL has a nice built in feature called REPLACE INTO

但是,MySQL 有一个很好的内置功能,称为 REPLACE INTO

Let's see how it works.

让我们看看它是如何工作的。

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       1 | http://www.example.com/ | 2011-08-19 23:48:03 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
ERROR 1062 (23000): Duplicate entry 'http://www.example.com/' for key 'url'
mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.example.com/', NOW());
Query OK, 2 rows affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------+---------------------+
| link_id | url                     | last_visited        |
+---------+-------------------------+---------------------+
|       2 | http://www.example.com/ | 2011-08-19 23:55:55 |
+---------+-------------------------+---------------------+
1 row in set (0.00 sec)

Notice that when using REPLACE INTO, it's updated the last_visited time, and not thrown an error!

请注意,在使用时REPLACE INTO,它会更新 last_visited 时间,并且不会抛出错误!

This is because MySQL detects that you're attempting to replace a row. It knows the row that you want, as you've set url to be unique. MySQL figures out the row to replace by using the bit that you passed in that should be unique (in this case, the url) and updating for that row the other values. It's also updated the link_id - which is a bit unexpected! (In fact, I didn't realise this would happen until I just saw it happen!)

这是因为 MySQL 检测到您正在尝试替换一行。它知道您想要的行,因为您已将 url 设置为唯一的。MySQL 通过使用您传入的应该是唯一的位(在本例中为 url)并为该行更新其他值来确定要替换的行。它还更新了 link_id - 这有点出乎意料!(事实上​​,直到我看到它发生时,我才意识到会发生这种情况!)

But what if you wanted to add a new URL? Well, REPLACE INTOwill happily insert a new row if it can't find a matching unique row!

但是如果你想添加一个新的 URL 呢?好吧,REPLACE INTO如果找不到匹配的唯一行,将很乐意插入新行!

mysql> REPLACE INTO links (url, last_visited) VALUES ('http://www.stackoverflow.com/', NOW());
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM links;
+---------+-------------------------------+---------------------+
| link_id | url                           | last_visited        |
+---------+-------------------------------+---------------------+
|       2 | http://www.example.com/       | 2011-08-20 00:00:07 |
|       3 | http://www.stackoverflow.com/ | 2011-08-20 00:01:22 |
+---------+-------------------------------+---------------------+
2 rows in set (0.00 sec)

I hope this answers your question, and gives you a bit more information about how MySQL works!

我希望这能回答您的问题,并为您提供有关 MySQL 如何工作的更多信息!

回答by Mike Sherrill 'Cat Recall'

First, prepare the database.

首先,准备数据库

  • Domain names aren't case-sensitive, but you have to assume the rest of a URL is. (Not all web servers respect case in URLs, but most do, and you can't easily tell by looking.)
  • Assuming you need to store more than a domain name, use a case-sensitive collation.
  • If you decide to store the URL in two columns--one for the domain name and one for the resource locator--consider using a case-insensitive collation for the domain name, and a case-sensitive collation for the resource locator. If I were you, I'd test both ways (URL in one column vs. URL in two columns).
  • Put a UNIQUE constraint on the URL column. Or on the pair of columns, if you store the domain name and resource locator in separate columns, as UNIQUE (url, resource_locator).
  • Use a CHECK() constraint to keep encoded URLs out of the database. This CHECK() constraint is essential to keep bad data from coming in through a bulk copy or through the SQL shell.
  • 域名不区分大小写,但您必须假设 URL 的其余部分是。(并非所有 Web 服务器都尊重 URL 中的大小写,但大多数都这样做,而且您无法通过查看轻易分辨。)
  • 假设您需要存储的不仅仅是一个域名,请使用区分大小写的排序规则。
  • 如果您决定将 URL 存储在两列中——一列用于域名,另一列用于资源定位器——考虑对域名使用不区分大小写的排序规则,对资源定位器使用区分大小写的排序规则。如果我是你,我会测试两种方式(一列中的 URL 与两列中的 URL)。
  • 在 URL 列上放置一个 UNIQUE 约束。或者在这对列上,如果您将域名和资源定位器存储在单独的列中,则为UNIQUE (url, resource_locator).
  • 使用 CHECK() 约束将编码的 URL 保留在数据库之外。这个 CHECK() 约束对于防止坏数据通过批量复制或通过 SQL shell 进入是必不可少的。

Second, prepare the URL.

其次,准备 URL

  • Domain names aren't case-sensitive. If you store the full URL in one column, lowercase the domain name on all URLs. But be aware that some languages have uppercase letters that have no lowercase equivalent.
  • Think about trimming trailing characters. For example, these two URLs from amazon.com point to the same product. You probably want to store the second version, not the first.

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X/ref=sr_1_1?ie=UTF8&qid=1313583998&sr=8-1

    http://www.amazon.com/Systemantics-Systems-Work-Especially-They/dp/070450331X

  • Decode encoded URLs. (See php's urldecode() function. Note carefully its shortcomings, as described in that page's comments.) Personally, I'd rather handle these kinds of transformations in the database rather than in client code. That would involve revoking permissions on the tables and views, and allowing inserts and updates only through stored procedures; the stored procedures handle all the string operations that put the URL into a canonical form. But keep an eye on performance when you try that. CHECK() constraints (see above) are your safety net.

Third, if you're inserting only the URL, don't test for its existence first. Instead, try to insert and trap the error that you'll get if the value already exists. Testing and inserting hits the database twice for every new URL. Insert-and-trap just hits the database once. Note carefully that insert-and-trap isn't the same thing as insert-and-ignore-errors. Only one particular error means you violated the unique constraint; other errors mean there are other problems.

第三,如果您只插入 URL,则不要先测试其是否存在。相反,如果值已经存在,请尝试插入并捕获您将得到的错误。对于每个新 URL,测试和插入都会命中数据库两次。插入和陷阱只是命中数据库一次。请注意,插入并陷阱与插入并忽略错误不同。只有一个特定错误意味着您违反了唯一约束;其他错误意味着存在其他问题。

On the other hand, if you're inserting the URL along with some other data in the same row, you need to decide ahead of time whether you'll handle duplicate urls by

另一方面,如果您在同一行中插入 URL 以及其他一些数据,则需要提前决定是否通过以下方式处理重复的 url

REPLACE eliminates the need to trap duplicate key errors, but it might have unfortunate side effects if there are foreign key references.

REPLACE 消除了捕获重复键错误的需要,但如果存在外键引用,它可能会产生不幸的副作用。

回答by Rob Walker

Are you concerned purely about URLs that are the exact same string .. if so there is a lot of good advice in other answers. Or do you also have to worry about canonization?

您是否只关心完全相同的字符串的 URL .. 如果是这样,其他答案中有很多很好的建议。或者你还需要担心封圣?

For example: http://google.comand http://go%4fgle.comare the exact same URL, but would be allowed as duplicates by any of the database only techniques. If this is an issue you should preprocess the URLs to resolve and character escape sequences.

例如:http: //google.comhttp://go%4fgle.com是完全相同的 URL,但任何仅数据库技术都允许重复。如果这是一个问题,您应该预处理 URL 以解析和字符转义序列。

Depending where the URLs are coming from you will also have to worry about parameters and whether they are significant in your application.

根据 URL 的来源,您还必须担心参数以及它们在您的应用程序中是否重要。

回答by Joe Mahoney

To guarantee uniqueness you need to add a unique constraint. Assuming your table name is "urls" and the column name is "url", you can add the unique constraint with this alter table command:

为了保证唯一性,您需要添加唯一约束。假设你的表名是“urls”,列名是“url”,你可以使用这个alter table命令添加唯一约束:

alter table urls add constraint unique_url unique (url);

The alter table will probably fail (who really knows with MySQL) if you've already got duplicate urls in your table already.

如果您的表中已经有重复的 url,alter 表可能会失败(谁真正知道 MySQL)。

回答by Steve Buzonas

The simple SQL solutions require a unique field; the logic solutions do not.

简单的 SQL 解决方案需要一个唯一的字段;逻辑解决方案没有。

You should normalize your urls to ensure there is no duplication. Functions in PHP such as strtolower()and urldecode()or rawurldecode().

您应该规范化您的网址以确保没有重复。PHP 中的函数,例如strtolower()urldecode()rawurldecode()

Assumptions: Your table name is 'websites', the column name for your url is 'url', and the arbitrary data to be associated with the url is in the column 'data'.

假设:你的表名是“websites”,你的url的列名是“url”,与url相关联的任意数据在“data”列中。

Logic Solutions

逻辑解决方案

SELECT COUNT(*) AS UrlResults FROM websites WHERE url='http://www.domain.com'

Test the previous query with if statements in SQL or PHP to ensure that it is 0 before you continue with an INSERT statement.

在继续执行 INSERT 语句之前,使用 SQL 或 PHP 中的 if 语句测试前面的查询以确保它为 0。

Simple SQL Statements

简单的 SQL 语句

Scenario 1:Your db is a first come first serve table and you have no desire to have duplicate entries in the future.

场景 1:您的数据库是先到先得的表,您不希望将来有重复的条目。

ALTER TABLE websites ADD UNIQUE (url)

This will prevent any entries from being able to be entered in to the database if the url value already exists in that column.

如果 url 值已存在于该列中,这将阻止任何条目能够输入到数据库中。

Scenario 2:You want the most up to date information for each url and don't want to duplicate content. There are two solutions for this scenario. (These solutions also require 'url' to be unique so the solution in Scenario 1will also need to be carried out.)

场景 2:您想要每个 url 的最新信息并且不想重复内容。这种情况有两种解决方案。(这些解决方案还要求 'url' 是唯一的,因此还需要执行场景 1 中的解决方案。)

REPLACE INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')

This will trigger a DELETE action if a row exists followed by an INSERT in all cases, so be careful with ON DELETE declarations.

如果在所有情况下都存在一行,则这将触发 DELETE 操作,后跟 INSERT,因此要小心 ON DELETE 声明。

INSERT INTO websites (url, data) VALUES ('http://www.domain.com', 'random data')
ON DUPLICATE KEY UPDATE data='random data'

This will trigger an UPDATE action if a row exists and an INSERT if it does not.

如果一行存在,这将触发 UPDATE 操作,如果不存在则触发 INSERT。

回答by Daniel Trebbien

In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalizethe URLs before adding them to the database.

在考虑解决此问题时,您需要首先定义“重复 URL”对您的项目意味着什么。这将决定如何在将 URL 添加到数据库之前对其进行规范化

There are at least two definitions:

至少有两个定义:

  1. Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
    • The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.STACKOVERFLOW.COM/is the same as http://www.stackoverflow.com/.
    • If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.stackoverflow.com/and http://www.stackoverflow.com:80/).
    • If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=testand http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
    • If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
    • A shortened IPv6 address can be expanded.
    • Append a trailing forward slash to the authority onlyif it is missing.
    • Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84(%C3%84represents '?' in UTF-8) is the same as http://google.com/?q=A%CC%88(%CC%88represents U+0308, COMBINING DIAERESIS).
    • If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the HostHTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
  2. Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), andtake into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization FormCanonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uricode, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).
  1. 如果两个 URL 代表相同的资源,则它们被认为是重复的,因为它们对生成相应内容的相应 Web 服务一无所知。一些考虑因素包括:
  2. 应用基本的 URL 规范化(例如小写方案和域名,提供默认端口,按参数名称稳定排序查询参数,在 HTTP 和 HTTPS 的情况下删除哈希部分,...),考虑到知识网络服务。也许您会假设所有 Web 服务都足够智能以规范化 Unicode 输入(例如维基百科),因此您可以应用Unicode 规范化形式规范组合 (NFC)。您www.将从所有 Stack Overflow URL 中去除 ' '。您可以使用移植到 PHP 的PostRank 的postrank-uri代码来删除各种不必要的 URL 片段(例如&utm_source=...)。

Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.

定义 1 导致了一个稳定的解决方案(即没有可以执行的进一步规范化并且 URL 的规范化不会改变)。定义 2,我认为是人类认为的 URL 规范化定义,导致规范化例程可以在不同的时刻产生不同的结果。

Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.

无论您选择哪种定义,我都建议您为方案、登录名、主机、端口和路径部分使用单独的列。这将允许您智能地使用索引。scheme 和 host 的列可以使用字符排序规则(在 MySQL 中所有字符排序规则都不区分大小写),但 login 和 path 的列需要使用二进制、不区分大小写的排序规则。此外,如果您使用定义 2,您需要保留原始方案、权限和路径部分,因为可能会不时添加或删除某些规范化规则。

EDIT:Here are example table definitions:

编辑:以下是示例表定义:

CREATE TABLE `urls1` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';


CREATE TABLE `urls2` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `canonical_scheme` VARCHAR(20) NOT NULL,
    `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `port` INT UNSIGNED,
    `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    `orig_scheme` VARCHAR(20) NOT NULL, 
    `orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
    `orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
    `orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',

    PRIMARY KEY (`id`),
    INDEX (`canonical_host`(10), `canonical_scheme`),
    INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';

Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.

表 urls1 用于存储根据定义 1 的规范 URL。表 urls2 用于存储根据定义 2 的规范 URL。

Unfortunately you will not be able to specify a UNIQUEconstraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

不幸的是,您将无法UNIQUE在元组上指定约束(`scheme`/`canonical_scheme`、`canonical_login`、`canonical_host`、`port`、`canonical_path`),因为 MySQL 将 InnoDB 键的长度限制为 767 字节。

回答by roman m

i don't know the syntax for MySQL, but all you need to do is wrap your INSERT with IF statement that will query the table and see if the record with given url EXISTS, if it exists - don't insert a new record.

我不知道 MySQL 的语法,但您需要做的就是用 IF 语句包装您的 INSERT,该语句将查询表并查看具有给定 url 的记录是否存在,如果存在 - 不要插入新记录。

if MSSQL you can do this:

如果 MSSQL 你可以这样做:

IF NOT EXISTS (SELECT 1 FROM YOURTABLE WHERE URL = 'URL')
INSERT INTO YOURTABLE (...) VALUES (...)

回答by Kibbee

First things first. If you haven't already created the table, or you created a table but do not have data in in then you need to add a unique constriant, or a unique index. More information about choosing between index or constraints follows at the end of the post. But they both accomplish the same thing, enforcing that the column only contains unique values.

先说第一件事。如果您还没有创建表,或者您创建了一个表但没有数据,那么您需要添加唯一约束或唯一索引。有关在索引或约束之间进行选择的更多信息,请参见文章末尾。但它们都完成相同的事情,强制该列只包含唯一值。

To create a table with a unique index on this column, you can use.

要在此列上创建具有唯一索引的表,您可以使用。

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,UNIQUE INDEX IDX_URL(URL)
);

If you just want a unique constraint, and no index on that table, you can use

如果您只想要一个唯一约束,并且该表上没有索引,则可以使用

CREATE TABLE MyURLTable(
ID INTEGER NOT NULL AUTO_INCREMENT
,URL VARCHAR(512)
,PRIMARY KEY(ID)
,CONSTRAINT UNIQUE UNIQUE_URL(URL)
);

Now, if you already have a table, and there is no data in it, then you can add the index or constraint to the table with one of the following pieces of code.

现在,如果您已经有一个表,并且其中没有数据,那么您可以使用以下代码之一将索引或约束添加到该表中。

ALTER TABLE MyURLTable
ADD UNIQUE INDEX IDX_URL(URL);

ALTER TABLE MyURLTable
ADD CONSTRAINT UNIQUE UNIQUE_URL(URL);

Now, you may already have a table with some data in it. In that case, you may already have some duplicate data in it. You can try creating the constriant or index shown above, and it will fail if you already have duplicate data. If you don't have duplicate data, great, if you do, you'll have to remove the duplicates. You can see a lit of urls with duplicates using the following query.

现在,您可能已经有一个包含一些数据的表。在这种情况下,您可能已经有一些重复的数据。您可以尝试创建上面显示的约束或索引,如果您已经有重复数据,它将失败。如果您没有重复数据,那太好了,如果有,您将不得不删除重复数据。您可以使用以下查询看到一连串重复的 url。

SELECT URL,COUNT(*),MIN(ID) 
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1;

To delete rows that are duplicates, and keep one, do the following:

要删除重复的行并保留一个,请执行以下操作:

DELETE RemoveRecords
FROM MyURLTable As RemoveRecords
LEFT JOIN 
(
SELECT MIN(ID) AS ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) > 1
UNION
SELECT ID
FROM MyURLTable
GROUP BY URL
HAVING COUNT(*) = 1
) AS KeepRecords
ON RemoveRecords.ID = KeepRecords.ID
WHERE KeepRecords.ID IS NULL;

Now that you have deleted all the records, you can go ahead and create you index or constraint. Now, if you want to insert a value into your database, you should use something like.

现在您已经删除了所有记录,您可以继续创建索引或约束。现在,如果你想在你的数据库中插入一个值,你应该使用类似的东西。

INSERT IGNORE INTO MyURLTable(URL)
VALUES('http://www.example.com');

That will attempt to do the insert, and if it finds a duplicate, nothing will happen. Now, lets say you have other columns, you can do something like this.

这将尝试进行插入,如果发现重复,则不会发生任何事情。现在,假设你有其他列,你可以做这样的事情。

INSERT INTO MyURLTable(URL,Visits) 
VALUES('http://www.example.com',1)
ON DUPLICATE KEY UPDATE Visits=Visits+1;

That will look try to insert the value, and if it finds the URL, then it will update the record by incrementing the visits counter. Of course, you can always do a plain old insert, and handle the resulting error in your PHP Code. Now, as for whether or not you should use constraints or indexes, that depends on a lot of factors. Indexes make for faster lookups, so your performance will be better as the table gets bigger, but storing the index will take up extra space. Indexes also usually make inserts and updates take longer as well, because it has to update the index. However, since the value will have to be looked up either way, to enforce the uniqueness, in this case, It may be quicker to just have the index anyway. As for anything performance related, the answer is try both options and profile the results to see which works best for your situation.

这将看起来尝试插入值,如果它找到了 URL,那么它将通过增加访问计数器来更新记录。当然,你总是可以做一个普通的旧插入,并在你的 PHP 代码中处理由此产生的错误。现在,至于是否应该使用约束或索引,这取决于很多因素。索引可以加快查找速度,因此随着表变大,性能会更好,但存储索引会占用额外的空间。索引通常也会使插入和更新花费更长的时间,因为它必须更新索引。但是,由于必须以任何一种方式查找该值以强制执行唯一性,在这种情况下,无论如何只拥有索引可能会更快。至于任何与性能相关的东西,

回答by Jean Paul Galea

If you want to insert urls into the table, but only those that don't exist already you can add a UNIQUE contraint on the column and in your INSERT query add IGNORE so that you don't get an error.

如果您想将 url 插入表中,但只有那些不存在的 url,您可以在列上添加 UNIQUE 约束,并在您的 INSERT 查询中添加 IGNORE 以便您不会收到错误。

Example: INSERT IGNORE INTO urlsSET url = 'url-to-insert'

示例:INSERT IGNORE INTO urlsSET url = 'url-to-insert'