MySQL 对所有基于文本的字段使用通用 varchar(255) 是否有缺点?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/262238/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Are there disadvantages to using a generic varchar(255) for all text-based fields?
提问by Olly
I have a contacts
table which contains fields such as postcode
, first name
, last name
, town
, country
, phone number
etc, all of which are defined as VARCHAR(255)
even though none of these fields will ever come close to having 255 characters. (If you're wondering, it's this way because Ruby on Rails migrations map String fields to VARCHAR(255)
by default and I never bothered to override it).
我有一个contacts
包含字段,如表postcode
,first name
,last name
,town
,country
,phone number
等等,所有这些都被定义为VARCHAR(255)
即使没有这些领域都不会接近有255个字符。(如果您想知道,这是因为 Ruby on Rails 迁移VARCHAR(255)
默认将 String 字段映射到,而我从不费心去覆盖它)。
Since VARCHAR will only store the number of actual characters of the field (along with the field length), is there any distinct advantage (performance or otherwise) to using, say, VARCHAR(16)
over VARCHAR(255)
?
由于VARCHAR只会领域的实际字符的数目存储(连同字段长度),是否有任何明显的优势(性能或其他方式)使用,比方说,VARCHAR(16)
在VARCHAR(255)
?
Additionally, most of these fields have indexes on them. Does a larger VARCHAR size on the field affect the size or performance of the index at all?
此外,这些字段中的大多数都有索引。字段上较大的 VARCHAR 大小是否会影响索引的大小或性能?
FYI I'm using MySQL 5.
仅供参考,我正在使用 MySQL 5。
回答by Bill Karwin
In storage, VARCHAR(255)
is smart enough to store only the length you need on a given row, unlike CHAR(255)
which would always store 255 characters.
在存储中,VARCHAR(255)
它足够聪明,可以在给定的行中仅存储您需要的长度,CHAR(255)
这与始终存储 255 个字符不同。
But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: as rows are copied from the storage engine layer to the SQL layer, VARCHAR
fields are converted to CHAR
to gain the advantage of working with fixed-width rows. So the strings in memory become padded out to the maximum lengthof your declared VARCHAR
column.
但是既然你用 MySQL 标记了这个问题,我会提到一个 MySQL 特定的提示:当行从存储引擎层复制到 SQL 层时,VARCHAR
字段被转换为CHAR
以获得使用固定宽度行的优势。所以内存中的字符串被填充到你声明的VARCHAR
列的最大长度。
When your query implicitly generates a temporary table, for instance while sorting or GROUP BY
, this can use a lot of memory. If you use a lot of VARCHAR(255)
fields for data that doesn't need to be that long, this can make the temporary table very large.
当您的查询隐式生成临时表时,例如在排序 or 时GROUP BY
,这可能会使用大量内存。如果您将大量VARCHAR(255)
字段用于不需要那么长的数据,这会使临时表变得非常大。
You may also like to know that this "padding out" behavior means that a string declared with the utf8 character set pads out to three bytes per character even for strings you store with single-byte content (e.g. ascii or latin1 characters). And likewise utf8mb4 character set causes the string to pad out to four bytes per character in memory.
您可能还想知道,这种“填充”行为意味着使用 utf8 字符集声明的字符串会填充为每个字符三个字节,即使对于您存储的带有单字节内容(例如 ascii 或 latin1 字符)的字符串也是如此。同样 utf8mb4 字符集会导致字符串在内存中填充为每个字符四个字节。
So a VARCHAR(255)
in utf8 storing a short string like "No opinion" takes 11 bytes on disk (ten lower-charset characters, plus one byte for length) but it takes 765 bytes in memory, and thus in temp tables or sorted results.
因此,VARCHAR(255)
在 utf8 中存储像“No opinion”这样的短字符串在磁盘上占用 11 个字节(十个较低字符集字符,加上一个字节长度),但在内存中占用 765 个字节,因此在临时表或排序结果中。
I have helped MySQL users who unknowingly created 1.5GB temp tables frequently and filled up their disk space. They had lots of VARCHAR(255)
columns that in practice stored very short strings.
我帮助过经常在不知不觉中创建 1.5GB 临时表并填满磁盘空间的 MySQL 用户。他们有很多VARCHAR(255)
列实际上存储了非常短的字符串。
It's best to define the column based on the type of data that you intend to store. It has benefits to enforce application-related constraints, as other folks have mentioned. But it has the physical benefits to avoid the memory waste I described above.
最好根据要存储的数据类型定义列。正如其他人提到的那样,强制执行与应用程序相关的约束有好处。但它具有避免我上面描述的内存浪费的物理优势。
It's hard to know what the longest postal address is, of course, which is why many people choose a long VARCHAR
that is certainly longer than any address. And 255 is customary because it is the maximum length of a VARCHAR
for which the length can be encoded with one byte. It was also the maximum VARCHAR
length in MySQL older than 5.0.
当然,很难知道最长的邮政地址是什么,这就是为什么许多人选择VARCHAR
肯定比任何地址都长的原因。而 255 是惯用的,因为它是 a 的最大长度,VARCHAR
可以用一个字节对其进行编码。它也是VARCHAR
MySQL 5.0 之前的最大长度。
回答by shufler
In addition to the size and performance considerations of setting the size of a varchar (and possibly more important, as storage and processing get cheaper every second), the disadvantage of using varchar(255) "just because" is reduced data integrity.
除了设置 varchar 大小的大小和性能考虑(可能更重要的是,因为存储和处理每秒变得更便宜),使用 varchar(255) 的缺点“仅仅因为”降低了数据完整性。
Defining maximum limits for strings is a good thing to doto prevent longer than expected strings from entering the RDBMS and causing buffer overruns or exceptions/errors later when retrieving and parsing values from the database that are longer (more bytes) than expected.
为字符串定义最大限制是一件好事,可以防止长于预期的字符串进入 RDBMS 并在稍后从数据库中检索和解析比预期更长(更多字节)的值时导致缓冲区溢出或异常/错误。
For example, if you have a field that accepts two-character strings for country abbreviations then you have no conceivable reason to expect your users (in this context, programmers) to input full country names. Since you don't want them to enter "Antigua and Barbuda" (AG) or "Heard Island and McDonald Islands" (HM), you don't allow it at the database layer. Also, it is likely some programmers have not yet RTFMed the design documentation (which surely exists) to know not to do this.
例如,如果您有一个接受国家缩写的两个字符字符串的字段,那么您没有理由期望您的用户(在这种情况下,程序员)输入完整的国家名称。由于您不希望他们输入“安提瓜和巴布达”(AG) 或“赫德岛和麦克唐纳群岛”(HM),因此您不允许在数据库层输入。此外,很可能一些程序员还没有对设计文档(肯定存在)进行RTFM ,知道不要这样做。
Set the field to accept two characters and let the RDBMS deal with it (either gracefully by truncating or ungracefully by rejecting their SQL with an error).
将字段设置为接受两个字符并让 RDBMS 处理它(通过截断或不优雅地通过错误拒绝其 SQL 来优雅地处理)。
Examples of real data that has no reason to exceed a certain length:
没有理由超过一定长度的真实数据示例:
- Canadian Postal Codesare of the format A1A1A1 and are always 6 characters in length, even for Santa Claus(6 characters excludes the space that can be specified for legibility).
- email addresses- up to 64 bytes before the @, up to 255 bytes after. Never more, lest you break the Internet.
- North American Phone Numbers are never more than 10 digits (excluding the country code).
- Computers running (recent versions of) Windows cannot have computer names longer than 63 bytes, though more than 15 is not recommended and will break your Windows NT server farm.
- State abbreviations are 2 characters (like the country codes exampled above)
- UPS tracking numbersare either 18-, 12-, 11-, or 9-characters long. The 18-character numbers start with "1Z" and the 11-character numbers start with "T" which makes you wonder how they deliver all those packages if they don't know the difference between letters and numbers.
- 加拿大邮政编码的格式为 A1A1A1,长度始终为 6 个字符,即使是圣诞老人也是如此(6 个字符不包括为便于阅读而指定的空格)。
- 电子邮件地址- @ 前最多 64 个字节,@ 后最多 255 个字节。再也不会了,以免您破坏互联网。
- 北美电话号码不得超过 10 位数字(不包括国家/地区代码)。
- 运行(最新版本)Windows 的计算机的计算机名称不能超过 63 个字节,但不建议超过 15个字节,并且会破坏您的 Windows NT 服务器群。
- 州缩写为 2 个字符(如上面示例的国家/地区代码)
- UPS 跟踪号的长度为 18、12、11 或 9 个字符。18 个字符的数字以“1Z”开头,11 个字符的数字以“T”开头,这让您想知道如果他们不知道字母和数字之间的区别,他们是如何交付所有这些包裹的。
And so on...
等等...
Take the time to think about your data and its limits. If you're a architect, developer, or programmer, it's your job, after all.
花点时间考虑一下您的数据及其限制。如果您是架构师、开发人员或程序员,这毕竟是您的工作。
By using a varchar(n) instead of varchar(255) you eliminate the problem where users (end-users, programmers, other programs) enter unexpectedly long data that willcome back to haunt your code later.
通过使用 varchar(n) 而不是 varchar(255) 可以消除用户(最终用户、程序员、其他程序)输入意外长数据的问题,这些数据稍后会回来困扰您的代码。
And I didn't say you shouldn't also implement this restriction in the business logic code used by your application.
我并没有说你不应该在你的应用程序使用的业务逻辑代码中实现这个限制。
回答by S.Lott
I'm with you. Fussy attention to detail is a pain in the neck and has limited value.
我和你在一起。对细节的挑剔是一种痛苦,而且价值有限。
Once upon a time, disk was a precious commodity and we used to sweat bullets to optimize it. The price of storage has fallen by a factor of 1,000, making the time spent on squeezing every byte less valuable.
曾几何时,磁盘是一种宝贵的商品,我们过去常常用子弹来优化它。存储的价格下降了 1,000 倍,使得花在压缩每个字节上的时间变得不那么有价值。
If you use only CHAR fields, you can get fixed-length rows. This can save some disk real-restate if you picked accurate sizes for fields. You might get more densely-packed data (fewer I/O's for table scans) and faster updates (easier to locate open spaces in a block for updates and inserts.)
如果仅使用 CHAR 字段,则可以获得固定长度的行。如果您为字段选择了准确的大小,这可以节省一些磁盘实际重述。您可能会获得更密集的数据(用于表扫描的 I/O 更少)和更快的更新(更容易在块中定位开放空间以进行更新和插入。)
However, if you over-estimate your sizes, or your actual data sizes are variable, you'll wind up wasting space with CHAR fields. The data will wind up less densely packed (leading to more I/O's for big retrievals).
但是,如果您高估了您的大小,或者您的实际数据大小是可变的,那么您最终会浪费 CHAR 字段的空间。数据最终会变得不那么密集(导致大检索的更多 I/O)。
Generally, the performance benefits from attempting to put a size on variable fields are minor. You can easily benchmark by using VARCHAR(255) compared with CHAR(x) to see if you can measure the difference.
通常,尝试在变量字段上设置大小的性能优势很小。通过使用 VARCHAR(255) 与 CHAR(x) 进行比较,您可以轻松地进行基准测试,以查看是否可以测量差异。
However, sometimes, I need to provide a "small", "medium", "large" hint. So I use 16, 64, and 255 for the sizes.
但是,有时,我需要提供“小”、“中”、“大”提示。所以我使用 16、64 和 255 作为尺寸。
回答by Will Hartung
Nowadays, i can't imagine it really matters any more.
如今,我无法想象这真的很重要。
There's a computational overhead to using variable length fields, but with the excesses of CPUs today, it's not even worth considering. The I/O system are so slow as to make any computational costs to handle varchars effectively non-existent. In fact, the price of a varchar computationally is probably a net win over the amount of diskspace saved by using variable length fields over fixed length fields. You most likely have greater row density.
使用可变长度字段会产生计算开销,但由于当今 CPU 数量过多,甚至不值得考虑。I/O 系统非常缓慢,以至于根本不存在有效处理 varchars 的任何计算成本。事实上,在计算上,varchar 的价格可能是通过使用可变长度字段而不是固定长度字段节省的磁盘空间量的净胜出。您很可能有更大的行密度。
Now, the complexity of varchar fields is that you can't easily locate a record via it's record number. When you have a fixed length row size (with fixed length fields), it's trivial to compute the disk block that a row id points to. With a variable length rowsize, that kind of goes out the window.
现在,varchar 字段的复杂性在于您无法通过记录号轻松定位记录。当您有固定长度的行大小(具有固定长度的字段)时,计算行 id 指向的磁盘块是微不足道的。使用可变长度的 rowsize,这种情况就消失了。
So, now you need to maintain some kind of record number index, just like any other primary key, OR you need to make a robust row identifier that encodes details (such as the block, etc.) in to the identifier. If you do that, though, the id would have to be recalculated if ever the row is moved on persistent storage. No big deal, just need to rewrite all of the index entries and make sure the you either a) never expose it to the consumer or b) never assert that the number is reliable.
因此,现在您需要维护某种记录编号索引,就像任何其他主键一样,或者您需要制作一个健壮的行标识符,将详细信息(例如块等)编码到标识符中。但是,如果您这样做,则如果该行在持久存储上移动,则必须重新计算 id。没什么大不了的,只需要重写所有索引条目并确保您 a) 永远不会将其暴露给消费者或 b) 永远不会断言该数字是可靠的。
But since we have varchar fields today, the only value of varchar(16) over varchar(255) is that the DB will enforce the 16 char limit on the varchar(16). If the DB model is supposed to be actually representative of the physical data model, then having fields lengths can be of value. If, however, it's simply "storage" rather than a "model AND storage", there's no need whatsoever.
但是由于我们今天有 varchar 字段,因此 varchar(16) 对 varchar(255) 的唯一值是 DB 将对 varchar(16) 强制执行 16 个字符限制。如果 DB 模型应该实际代表物理数据模型,那么具有字段长度可能是有价值的。但是,如果它只是“存储”而不是“模型和存储”,则没有任何必要。
Then you simply need to discern between a text field that is indexable (such varchar) vs something that is not (like a text or CLOB field). The indexable fields tend to have a limit on size to facilitate the index whereas the CLOB fields do not (within reason).
然后,您只需区分可索引的文本字段(如 varchar)与不可索引的文本字段(如文本或 CLOB 字段)。可索引字段倾向于限制大小以方便索引,而 CLOB 字段则没有(在合理范围内)。
回答by BradC
In my experience, if you allow a datatype of 255 characters, some stupid user (or some experienced tester) will actually fill that up.
根据我的经验,如果你允许 255 个字符的数据类型,一些愚蠢的用户(或一些有经验的测试人员)实际上会填满它。
Then you have all sorts of problems, including how much space you allow for those fields in reports and on-screen displays in your application. Not to mention the possibility of exceeding the per-row limit for data in your database (if you had more than a few of these 255 character fields).
然后您会遇到各种各样的问题,包括您为报告中的这些字段和应用程序的屏幕显示提供了多少空间。更不用说超过数据库中数据的每行限制的可能性(如果您有多个 255 个字符的字段)。
Much easier to pick a reasonable limit at the beginning, then enforce that through the application and database.
在开始时选择一个合理的限制要容易得多,然后通过应用程序和数据库强制执行。
回答by staabm
In a mysql context it can get important when working with indexes on said varchar columns, as mysql has a max. limit of 767bytes per index-row.
在 mysql 上下文中,在处理所述 varchar 列上的索引时它会变得很重要,因为 mysql 有一个最大值。每个索引行 767 字节的限制。
This means that when adding a index across several varchar 255 columns you can get to this limit rather quickly / even faster on utf8 or utf8mb4 columns as pointed out in the answers above
这意味着,当在多个 varchar 255 列中添加索引时,您可以在 utf8 或 utf8mb4 列上相当快/甚至更快地达到此限制,如上述答案中所述
回答by dove
It's good practice to allocate only a little over what you need. Phone numbers would never go this large.
只在您需要的地方分配一点是一种很好的做法。电话号码永远不会这么大。
One reason is that unless you validate against large entries, no doubt someone will use all there is. Then you might run out of space in your row. I'm not sure about MySQL limit but 8060 is the max rowsize in MS SQL.
一个原因是,除非您针对大型条目进行验证,否则毫无疑问有人会使用所有条目。然后,您的行中可能会用完空间。我不确定 MySQL 的限制,但 8060 是 MS SQL 中的最大行大小。
A more normal default would be 50 imho, and then increase where need proves it.
更正常的默认值是 50 imho,然后在需要证明的地方增加。