Jsoup Trouble extracting formatting from html tables
<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.
java html jsoup
add a comment |
<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.
java html jsoup
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37
add a comment |
<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.
java html jsoup
<tr>
<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>
</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>
</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>
</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>
</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>
</th></tr>
So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.
If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?
So I was attempting to extract just the bgcolor and
I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.
java html jsoup
java html jsoup
asked Nov 25 '18 at 18:05
Beat.Beat.
72
72
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37
add a comment |
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37
add a comment |
1 Answer
1
active
oldest
votes
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470365%2fjsoup-trouble-extracting-formatting-from-html-tables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
add a comment |
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
add a comment |
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.
You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().
Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");
for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();
System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}
For any further information feel free to ask me! Hope this helped you!
Updated Answer to comment:
To do that, you'll need to use this line inside the for each loop:
String fullText = element.text();
That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.
That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.
edited Nov 26 '18 at 15:33
answered Nov 26 '18 at 11:04
alvarobarttalvarobartt
12418
12418
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
add a comment |
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.
– Beat.
Nov 26 '18 at 15:18
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
@Beat. I solved your new question. Look at it! Hope that helped you!
– alvarobartt
Nov 26 '18 at 15:34
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470365%2fjsoup-trouble-extracting-formatting-from-html-tables%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.
– alvarobartt
Nov 26 '18 at 15:37