database 纯文本数据库文件的最佳分隔符/分隔符是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6319551/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What's the best separator/delimiter character(s) for a plaintext db file?
提问by Meng Lu
What's the best separator/delimiter character(s) for a plaintext db file?
纯文本数据库文件的最佳分隔符/分隔符是什么?
I considered using |, ,, <TAB>, ;, etc. But they all seem to be possible to break when the nearby entries have special enough characters.
我考虑过使用|, ,, <TAB>,;等。但是当附近的条目具有足够特殊的字符时,它们似乎都有可能被破坏。
So, the experienced database users, what delimiter character(s) do you suggest to use?
那么,有经验的数据库用户,您建议使用什么分隔符?
采纳答案by p.campbell
No matter which character you choose as your separator, you'll want to escape any instance of that character in your data.
无论您选择哪个字符作为分隔符,您都需要转义数据中该字符的任何实例。
Perhaps tilde(~), or go to a high-ASCII character.
也许波浪号(~),或转到高位 ASCII 字符。
Either way, if there's any chance that it could sneak into your data, you'd want to escape it before writing to your plaintext file.
无论哪种方式,如果它有可能潜入您的数据中,您都希望在写入纯文本文件之前对其进行转义。
回答by Emis
I think the best way to join string with a three cherries '@@@'.
我认为最好的方法是用三个樱桃'@@@'加入字符串。
回答by Michas
Well, there are few separator charactersin US-ASCII, hex 1c, 1d, 1eand 1f. The plain text shouldn't contain them.
嗯,有几个分隔字符在US-ASCII,十六进制1c,1d,1e和1f。纯文本不应包含它们。
1c FS ? ^\ File Separator
1d GS ? ^] Group Separator
1e RS ? ^^ Record Separator
1f US ? ^_ Unit Separator
回答by japage
For a particular data warehousing situation where we had control over the source file, but escaping and qualifying were onerous, we were able to make the business decision that one extended ASCII character would be stripped from the data (if it ever occurs, which it hasn't).
对于我们可以控制源文件的特定数据仓库情况,但转义和限定是繁重的,我们能够做出从数据中删除一个扩展 ASCII 字符的业务决策(如果它曾经发生,它没有't)。
On creation of the delimited source file, we stripped out any instances of █ (alt+219) in the data and use that character for the delimiter. Bonus, that character is really easy to spot.
在创建分隔的源文件时,我们去除了数据中所有 █ (alt+219) 的实例,并使用该字符作为分隔符。奖励,那个角色真的很容易被发现。
回答by Cesar Bourdain Costa
Personally I like using ? as a delimiter character to split data in CSV files, I don't think I've ever found a naturally occurring instance of ? and ? personally, so here are my two cents about it.
我个人喜欢使用 ? 作为在 CSV 文件中拆分数据的分隔符,我认为我从未发现过自然发生的 ? 和 ?就个人而言,这是我的两分钱。
回答by Wouter
You could use the special separator characters (hex 1c -> 1f), yet they are non-printable, and some technologies have issues processing data containing them.
您可以使用特殊分隔符(十六进制 1c -> 1f),但它们是不可打印的,并且某些技术在处理包含它们的数据时存在问题。
So, plan B, if your data is in UTF-8, you could pick a random UTF-8 character that is extremelyunlikely to appear in any source data you receive.
因此,在 B 计划中,如果您的数据采用 UTF-8 格式,您可以选择一个随机的 UTF-8 字符,该字符极不可能出现在您收到的任何源数据中。
Yet, even then, if you want to be sure you'll not run into issues, you better always scan your entire dataset for this character, and if it appears, simply pick another UTF-8 character.
然而,即便如此,如果您想确保不会遇到问题,您最好始终扫描整个数据集以查找此字符,如果出现,只需选择另一个 UTF-8 字符即可。
I tend to hate encapsulation with a passion, and avoid it whenever possible, as explained in my post under the chapter 'encapsulation' here: https://theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-everyone-working-with-data-absolutely-positively-must-know-about-file-types-encoding-delimiters-and-data-types-no-excuses/
我倾向于热情地讨厌封装,并尽可能避免封装,正如我在“封装”一章下的帖子中所解释的:https: //theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-每个人都与数据一起工作绝对肯定必须知道文件类型编码分隔符和数据类型没有借口/
回答by dim_user
I usually prefer non-printable characters like "\u0001", for instance I use this as a column delimiter in most of my Azure Data Analytics U-SQL Scripts. That is assuming you can use a multi-character custom delimiter
我通常更喜欢不可打印的字符,例如“\u0001”,例如,我在大多数 Azure 数据分析 U-SQL 脚本中使用它作为列分隔符。那是假设您可以使用多字符自定义分隔符
回答by Coder Absolute
Actually, it depends on the type of data you are trying to separate, we needed a separator for the machine events data and a couple of them were proposed:
实际上,这取决于您尝试分离的数据类型,我们需要为机器事件数据设置一个分隔符,并提出了其中几个:
=)or ^_^.
=)或^_^。
We chose ^_^because it actually worked based on the number of samples tested and it also looks cute!
我们选择^_^它是因为它实际上是根据测试的样本数量来工作的,而且它看起来也很可爱!
回答by Fierascu Gheorghe
I propose the interrobang character "?". More details: https://en.wikipedia.org/wiki/Interrobang
我建议使用 interrobang 字符“?”。更多详情:https: //en.wikipedia.org/wiki/Interrobang
回答by svargh
If you have the option of a string as column separator, use "" as delimiter. You can make up any string for that matter and gives you flexibility.
如果您可以选择使用字符串作为列分隔符,请使用 "" 作为分隔符。您可以为此编写任何字符串,并为您提供灵活性。

