在 Java 中,有一些 URL 解析器吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5869295/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-30 13:12:49  来源:igfitidea点击:

In Java, there is some URL parser?

javaparsingurlextract

提问by Renato Dinhani

I know there is a URL class in Java, but I need methods to get the file extension of the page (html, php, asp, etc), the country of the domain (ca, au, br, jp, fr, etc), the type of the page(.net, .org, .gov, etc) and others. Some of these methods, I did using String handling, but I think that a class done only for this can be more confiable.

我知道 Java 中有一个 URL 类,但我需要一些方法来获取页面的文件扩展名(html、php、asp 等)、域的国家(ca、au、br、jp、fr 等) 、页面类型(.net、.org、.gov 等)等。其中一些方法,我确实使用了字符串处理,但我认为仅为此完成的类可能更可靠。

采纳答案by Codemwnci

I am not sure there is a specific class to do what you are asking. Take a look at the URL class first, and the post below.

我不确定是否有特定的课程可以完成您的要求。先看一下 URL 类,然后看下面的帖子。

Could you share a link to an URL parsing implementation?

你能分享一个 URL 解析实现的链接吗?

I think you will need to combine the data returned from the URL class, and your own parsing algorithm to get the small bits of data that are not available. This should be pretty simple to do though, as it sounds like it is everything after the last index of dot for the host and the path (if they actually exist, which is not guaranteed).

我认为您需要结合从 URL 类返回的数据和您自己的解析算法来获取不可用的少量数据。不过,这应该很简单,因为听起来它是主机和路径的最后一个点索引之后的所有内容(如果它们确实存在,则不能保证)。

回答by nostromo

I created a simple Java class that makes URL parsing in Java much easier.

我创建了一个简单的 Java 类,它使 Java 中的 URL 解析更加容易。

https://github.com/juliuss/urlplus

https://github.com/juliuss/urlplus

It can be used to build urls and modify them programmatically. It also handles relative urls.

它可用于构建 url 并以编程方式修改它们。它还处理相对 url。

You can see from the unit test it's very comprehensive:

您可以从单元测试中看到它非常全面:

// build a URL
URL u = new URL("http://www.shopobot.com/?search=ipod");

// check the parts of the url were set correctly
assertEquals(u.getProtocol().name(), "http");

u.setFragment("login");
assertEquals(u, "http://www.shopobot.com/?search=ipod#login");

// add a parameter
u.addParameter("abc", "123");
assertEquals(u, "http://www.shopobot.com/?search=ipod&abc=123#login");

// add a duplicate parameter
u.addParameter("abc", "456");
assertEquals(u, "http://www.shopobot.com/?search=ipod&abc=123&abc=456#login");

// remove a parameter
u.removeParameter("search");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456#login");

// reset fragment
u.setFragment("");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456");

// test an encoded paramter
u.addParameter("encoding", "this code = awesome!");
assertEquals(u, "http://www.shopobot.com/?abc=123&abc=456&encoding=this+code+%3D+awesome%21");

// remove both duplicate parameters
u.removeParameter("abc");
assertEquals(u, "http://www.shopobot.com/?encoding=this+code+%3D+awesome%21");

// change host and port
u.setHost("localhost").setPort(8080);
assertEquals(u, "http://localhost:8080/?encoding=this+code+%3D+awesome%21");

// remove a parameter and add a page number (int parameter)
u.removeParameter("encoding").addParameter("page", 2);
assertEquals(u, "http://localhost:8080/?page=2");

// set the path
u.setPath("electronics/");
assertEquals(u, "http://localhost:8080/electronics/?page=2");
u.setPath("/electronics/");
assertEquals(u, "http://localhost:8080/electronics/?page=2");

// increment a parameter 3 times
u.incrementParameter("page").incrementParameter("page").incrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/?page=5");
// make sure the correct page number is returned
assertEquals(u.getParameter("page", 1), 5);

// set the page number to 2 and remove it -- setting it to 1
// since 1 is considered default, it is removed completely
u.setParameter("page", 2).decrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/");

// make sure that page will not be decremented since we're at 1
u.decrementParameter("page");
assertEquals(u, "http://localhost:8080/electronics/");

// test that defaults work
assertEquals(u.getParameter("page", 1), 1);
assertEquals(u.getParameter("page", 10), 10);

// test relative paths
u.setPath("/electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/")), "electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/")), "photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/photography/")), "");
assertEquals(u.toStringRelative(new URL("http://localhost:8080/electronics/mp3-players/")), "../photography/");
// make sure when paths match, but authority doesn't results in full url return
assertEquals(u.toStringRelative(new URL("http://www.shopobot.com/electronics/photography/")), "http://localhost:8080/electronics/photography/");
assertEquals(u.toStringRelative(new URL("http://localhost:80/electronics/photography/")), "http://localhost:8080/electronics/photography/");
assertEquals(u.toStringRelative(new URL("https://localhost:8080/electronics/photography/")), "http://localhost:8080/electronics/photography/");

// try some more complicated relative paths
u.setHost("x.com").setPath("/a/b/c/d/e.html").setPort(80);
assertEquals(u.toStringRelative(new URL("http://x.com/")),"a/b/c/d/e.html");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b")),"c/d/e.html");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b?q=1")),"/a/b/c/d/e.html");
u.addParameter("q", 1);
assertEquals(u.toStringRelative(new URL("http://x.com/a/b/c/d/e.html")),"?q=1");
assertEquals(u.toStringRelative(new URL("http://x.com/a/b/c/d/e/f/g/h")),"../../../../e.html?q=1");
assertEquals(u.toStringRelative(new URL("x.com/x/y/z/")),"../../../a/b/c/d/e.html?q=1");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1");
u.addParameter("f", "a b c");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1&f=a+b+c");
u.setFragment("hello").removeParameter("f");
assertEquals(u.toStringRelative(new URL("x.com/a/b/c/d/x/y/e.html")),"../../../e.html?q=1#hello");
assertEquals(u.toStringFull(),"/a/b/c/d/e.html?q=1#hello");

//test parameters with relative paths
u = new URL("facebook.com");
u.addParameter("test", "hi");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi")),"");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey")),"?test=hi");
u.addParameter("hello", "hey");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey")),"");
assertEquals(u.toStringRelative(new URL("facebook.com/?test=hi&hello=hey#wow")),"?test=hi&hello=hey");
assertEquals(u.toStringRelative(new URL("facebook.com/")),"?test=hi&hello=hey");
assertEquals(u.toStringRelative(new URL("facebook.com/#yo")),"?test=hi&hello=hey");
u = new URL("facebook.com/#yo");
assertEquals(u.toStringRelative(new URL("facebook.com/")),"#yo");

//test relative paths with parameter changes
u = new URL("example.com/?param=1");
assertEquals(u.toStringRelative(new URL("example.com/?param=2")),"?param=1");
u = new URL("example.com/?param=1&param=2");
assertEquals(u.toStringRelative(new URL("example.com/?param=1&param=4")),"?param=1&param=2");
u.removeParameter("param");
assertEquals(u.toStringRelative(new URL("example.com/?param=1&param=4")),"/");

// build a new URL to test empty and null parameter values
u = new URL("http://www.google.com/");
u.addParameter("test", "");
assertEquals(u, "http://www.google.com/?test");
assertEquals(u.getParameter("test", "this is not returned"), "");
u.addParameter("this is a test", null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter("", "");
assertEquals(u, "http://www.google.com/?test&this+is+a+test");    
u.addParameter(null, "");
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter("", null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.addParameter(null, null);
assertEquals(u, "http://www.google.com/?test&this+is+a+test");
u.removeParameter("this is a test");
assertEquals(u, "http://www.google.com/?test");
u.removeParameter("");
assertEquals(u, "http://www.google.com/?test");
String[] nullGuy = null;
u.removeParameter(nullGuy);
assertEquals(u, "http://www.google.com/?test");
u.removeParameter("test");
assertEquals(u, "http://www.google.com/");
u.addParameter(" "," ");
assertEquals(u, "http://www.google.com/?+=+");
u.addParameter("+","+");
assertEquals(u.getParameter("+", ""), "+");
assertEquals(u, "http://www.google.com/?+=+&%2B=%2B");
u.removeParameter(" ").removeParameter("+");
assertEquals(u, "http://www.google.com/");

//test fragment encoding
u.setFragment("short");
assertEquals(u, "http://www.google.com/#short");
assertEquals(u.getFragment(),"short");
u.setFragment("/this/is/a/#/<long>/( fragment )/");
assertEquals(u, "http://www.google.com/#/this/is/a/%23/%3Clong%3E/(+fragment+)/");
u.setFragment(null);
assertEquals(u, "http://www.google.com/");

u = new URL("www.wikipedia.org/wiki/USA");
assertEquals(u.matchesAuthority("org"), true);
assertEquals(u.matchesAuthority(".org"), true);
assertEquals(u.matchesAuthority("pedia.com"), false);
assertEquals(u.matchesAuthority("wikipedia.org"), true);
assertEquals(u.matchesAuthority("uwikipedia.org"), false);
assertEquals(u.matchesAuthority(".wikipedia.org"), true);
assertEquals(u.matchesAuthority("en.wikipedia.org"), false);
u.setHost("sub.en.wiki.com");
assertEquals(u.matchesAuthority("com"), true);
assertEquals(u.matchesAuthority("wiki.com"), true);
assertEquals(u.matchesAuthority("en.wiki.com"), true);
assertEquals(u.matchesAuthority("sub.en.wiki.com"), true);
assertEquals(u.matchesAuthority("asub.en.wiki.com"), false);
assertEquals(u.matchesAuthority("a.sub.en.wiki.com"), false);
assertEquals(u.matchesAuthority("sub.en.wiki.com","asub.en.wiki.com"), true);
assertEquals(u.matchesAuthority("a.sub.en.wiki.com","asub.en.wiki.com"), false);

//test no protocol on factory style methods
u = URL.get("www.wikipedia.org/wiki/USA");
u = URL.get("www.wikipedia.org/wiki/USA", u);

u = new URL("shopobot.com");
u.setParameter("will this <#> be encoded?","we've gone batshit crazy! seriously!");
u.setFragment("what's our # again?");
assertEquals(u.toString(),"http://shopobot.com/?will+this+%3C%23%3E+be+encoded%3F=we%27ve+gone+batshit+crazy%21+seriously%21#what's+our+%23+again?");
assertEquals(u.getParameter("will this <#> be encoded?", ""), "we've gone batshit crazy! seriously!");
assertEquals(u.getFragment(), "what's our # again?");

u = new URL("www.en.shopobot.com");
assertEquals(u.getAuthoritySize(), 4);
assertEquals(u.getAuthority(-1),"");
assertEquals(u.getAuthority(0),"");
assertEquals(u.getAuthority(1),"com");
assertEquals(u.getAuthority(2),"shopobot.com");
assertEquals(u.getAuthority(3),"en.shopobot.com");
assertEquals(u.getAuthority(4),"www.en.shopobot.com");
assertEquals(u.getAuthority(5),"www.en.shopobot.com");

u = new URL("en.wikipedia.org:90210/a/b/c/d/e.html?test=true");
assertEquals(u.getChildDirectory("a"),"b");
assertEquals(u.getChildDirectory("b"),"c");
assertEquals(u.getChildDirectory("c"),"d");
assertEquals(u.getChildDirectory("d"),"e.html");
assertEquals(u.getChildDirectory("e.html"),"");
assertEquals(u.getChildDirectory("g"),"");

assertEquals(u.getParentDirectory("a"),"");
assertEquals(u.getParentDirectory("b"),"a");
assertEquals(u.getParentDirectory("c"),"b");
assertEquals(u.getParentDirectory("d"),"c");
assertEquals(u.getParentDirectory("e.html"),"d");
assertEquals(u.getParentDirectory("e"),"");

//test relative url creation
u = new URL("http://www.example.com");
URL u2 = u.resolveRelative("q.html");
assertEquals(u2.toString(), "http://www.example.com/q.html");
u2 = u.resolveRelative("/q.html");
assertEquals(u2.toString(), "http://www.example.com/q.html");
u = new URL("http://www.example.com/abc/");
u2 = u.resolveRelative("q.html");
assertEquals(u2.toString(), "http://www.example.com/abc/q.html"); 

回答by Ernest Friedman-Hill

No, there's no such class. Some of these things (country code) are ill-posed and ambiguous, and often can't be determined from the URL alone. They're not parsing so much as lookup or inference. Other things (file extension) are not defined for most pages.

不,没有这样的课。其中一些内容(国家/地区代码)是不适定的和模棱两可的,通常无法仅从 URL 中确定。他们并没有像查找或推理那样进行解析。大多数页面没有定义其他东西(文件扩展名)。