JSoup: get wikipedia page summary
I used MediaWiki API to get a wikipedia page, after getting html content I tried using
p:not(h2 ~ p)
to get page summary paragraphs, it should be paragraphs before table of contents element, it gets the wanted part but has additional paragraphs, where is the problem ?
html css-selectors jsoup
add a comment |
I used MediaWiki API to get a wikipedia page, after getting html content I tried using
p:not(h2 ~ p)
to get page summary paragraphs, it should be paragraphs before table of contents element, it gets the wanted part but has additional paragraphs, where is the problem ?
html css-selectors jsoup
add a comment |
I used MediaWiki API to get a wikipedia page, after getting html content I tried using
p:not(h2 ~ p)
to get page summary paragraphs, it should be paragraphs before table of contents element, it gets the wanted part but has additional paragraphs, where is the problem ?
html css-selectors jsoup
I used MediaWiki API to get a wikipedia page, after getting html content I tried using
p:not(h2 ~ p)
to get page summary paragraphs, it should be paragraphs before table of contents element, it gets the wanted part but has additional paragraphs, where is the problem ?
html css-selectors jsoup
html css-selectors jsoup
edited Nov 24 '18 at 14:27
Amr Lotfy
asked Nov 23 '18 at 2:23
Amr LotfyAmr Lotfy
1,52121534
1,52121534
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
p:not(h2 ~ p)
gets every single paragraph on the page that doesn't have h2
before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2
itself. You don't want those; you only want the paragraphs that appear just before h2
elements within their parent element.
For that, you want to anchor the outer p
selector to the parent element. The parent element you want is .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
After testing I found that changing#toc
to beh2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used#toc
)
– Amr Lotfy
Nov 24 '18 at 14:34
add a comment |
Code:
public static void main(String args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
Run output:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
Process finished with exit code 0
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439949%2fjsoup-get-wikipedia-page-summary%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
p:not(h2 ~ p)
gets every single paragraph on the page that doesn't have h2
before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2
itself. You don't want those; you only want the paragraphs that appear just before h2
elements within their parent element.
For that, you want to anchor the outer p
selector to the parent element. The parent element you want is .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
After testing I found that changing#toc
to beh2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used#toc
)
– Amr Lotfy
Nov 24 '18 at 14:34
add a comment |
p:not(h2 ~ p)
gets every single paragraph on the page that doesn't have h2
before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2
itself. You don't want those; you only want the paragraphs that appear just before h2
elements within their parent element.
For that, you want to anchor the outer p
selector to the parent element. The parent element you want is .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
After testing I found that changing#toc
to beh2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used#toc
)
– Amr Lotfy
Nov 24 '18 at 14:34
add a comment |
p:not(h2 ~ p)
gets every single paragraph on the page that doesn't have h2
before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2
itself. You don't want those; you only want the paragraphs that appear just before h2
elements within their parent element.
For that, you want to anchor the outer p
selector to the parent element. The parent element you want is .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
p:not(h2 ~ p)
gets every single paragraph on the page that doesn't have h2
before it in the same parent. This includes nested paragraphs, paragraphs outside the main content altogether, etc, because none of those paragraphs share the same parent element as h2
itself. You don't want those; you only want the paragraphs that appear just before h2
elements within their parent element.
For that, you want to anchor the outer p
selector to the parent element. The parent element you want is .mw-parser-output
:
.mw-parser-output > p:not(h2 ~ p)
edited Nov 24 '18 at 21:24
Amr Lotfy
1,52121534
1,52121534
answered Nov 23 '18 at 5:55
BoltClock♦BoltClock
521k12911581195
521k12911581195
After testing I found that changing#toc
to beh2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used#toc
)
– Amr Lotfy
Nov 24 '18 at 14:34
add a comment |
After testing I found that changing#toc
to beh2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used#toc
)
– Amr Lotfy
Nov 24 '18 at 14:34
After testing I found that changing
#toc
to be h2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used #toc
)– Amr Lotfy
Nov 24 '18 at 14:34
After testing I found that changing
#toc
to be h2
will cover extracting summary from broader range of pages like en.wikipedia.org/wiki/Philosophy (which would not be perfectly covered if we used #toc
)– Amr Lotfy
Nov 24 '18 at 14:34
add a comment |
Code:
public static void main(String args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
Run output:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
Process finished with exit code 0
add a comment |
Code:
public static void main(String args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
Run output:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
Process finished with exit code 0
add a comment |
Code:
public static void main(String args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
Run output:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
Process finished with exit code 0
Code:
public static void main(String args){
Document doc = null;
String url = "https://en.wikipedia.org/wiki/Nico_Ditch";
try {
doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
} catch (IOException e) {
e.printStackTrace();
}
Elements els = doc.select(".mw-parser-output > p:not(h2 ~ p)");
System.out.println(els);
// System.out.println(doc);
}
Run output:
<p class="mw-empty-elt"> </p>
<p><b>Nico Ditch</b> is a six-mile (9.7 km) long linear <a href="/wiki/Earthworks_(archaeology)" title="Earthworks (archaeology)">earthwork</a> between <a href="/wiki/Ashton-under-Lyne" title="Ashton-under-Lyne">Ashton-under-Lyne</a> and <a href="/wiki/Stretford" title="Stretford">Stretford</a> in Greater Manchester, England. It was dug as a defensive fortification, or possibly a boundary marker, between the 5th and 11th centuries. </p>
<p>The ditch is still visible in short sections, such as a 330-yard (300 m) stretch in <a href="/wiki/Denton,_Greater_Manchester" title="Denton, Greater Manchester">Denton</a> Golf Course. In the parts which survive, the ditch is 4–5 yards (3.7–4.6 m) wide and up to 5 feet (1.5 m) deep. Part of the earthwork is protected as a <a href="/wiki/Scheduled_Ancient_Monument" class="mw-redirect" title="Scheduled Ancient Monument">Scheduled Ancient Monument</a>. </p>
Process finished with exit code 0
edited Nov 28 '18 at 9:17
Amr Lotfy
1,52121534
1,52121534
answered Nov 23 '18 at 11:33
bluetatabluetata
476
476
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439949%2fjsoup-get-wikipedia-page-summary%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown