Jsoup Trouble extracting formatting from html tables












0















<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>


So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.



If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?



So I was attempting to extract just the bgcolor and



I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.










share|improve this question























  • For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

    – alvarobartt
    Nov 26 '18 at 15:37
















0















<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>


So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.



If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?



So I was attempting to extract just the bgcolor and



I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.










share|improve this question























  • For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

    – alvarobartt
    Nov 26 '18 at 15:37














0












0








0








<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>


So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.



If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?



So I was attempting to extract just the bgcolor and



I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.










share|improve this question














<tr>

<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order
Theorems</span>

</th><th bgcolor="PINK"> <em><a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>

</th><th bgcolor="SKYBLUE"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax--
-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>

</th><th bgcolor="LIME"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III--
-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>

</th><th bgcolor="YELLOW"> <a href="
[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0]
(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II--
-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>

</th></tr>


So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.



If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?



So I was attempting to extract just the bgcolor and



I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.







java html jsoup






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 25 '18 at 18:05









Beat.Beat.

72




72













  • For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

    – alvarobartt
    Nov 26 '18 at 15:37



















  • For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

    – alvarobartt
    Nov 26 '18 at 15:37

















For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37





For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37












1 Answer
1






active

oldest

votes


















0














You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().



    Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");

for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();

System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}


For any further information feel free to ask me! Hope this helped you!



Updated Answer to comment:



To do that, you'll need to use this line inside the for each loop:



String fullText = element.text();


That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.



That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.






share|improve this answer


























  • The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

    – Beat.
    Nov 26 '18 at 15:18











  • @Beat. I solved your new question. Look at it! Hope that helped you!

    – alvarobartt
    Nov 26 '18 at 15:34











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470365%2fjsoup-trouble-extracting-formatting-from-html-tables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().



    Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");

for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();

System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}


For any further information feel free to ask me! Hope this helped you!



Updated Answer to comment:



To do that, you'll need to use this line inside the for each loop:



String fullText = element.text();


That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.



That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.






share|improve this answer


























  • The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

    – Beat.
    Nov 26 '18 at 15:18











  • @Beat. I solved your new question. Look at it! Hope that helped you!

    – alvarobartt
    Nov 26 '18 at 15:34
















0














You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().



    Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");

for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();

System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}


For any further information feel free to ask me! Hope this helped you!



Updated Answer to comment:



To do that, you'll need to use this line inside the for each loop:



String fullText = element.text();


That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.



That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.






share|improve this answer


























  • The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

    – Beat.
    Nov 26 '18 at 15:18











  • @Beat. I solved your new question. Look at it! Hope that helped you!

    – alvarobartt
    Nov 26 '18 at 15:34














0












0








0







You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().



    Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");

for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();

System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}


For any further information feel free to ask me! Hope this helped you!



Updated Answer to comment:



To do that, you'll need to use this line inside the for each loop:



String fullText = element.text();


That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.



That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.






share|improve this answer















You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().



    Document document = Jsoup.parse(YOUT HTML GOES HERE);
System.out.println(document);
Elements elements = document.select("tr > th");

for (Element element : elements) {
String align = element.attr("align");
String color = element.attr("bgcolor");
String spanText = element.select("span").text();

System.out.println("Align is " + align +
"nBackground Color is " + color +
"nSpan Text is " + spanText);
}


For any further information feel free to ask me! Hope this helped you!



Updated Answer to comment:



To do that, you'll need to use this line inside the for each loop:



String fullText = element.text();


That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.



That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 26 '18 at 15:33

























answered Nov 26 '18 at 11:04









alvarobarttalvarobartt

12418




12418













  • The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

    – Beat.
    Nov 26 '18 at 15:18











  • @Beat. I solved your new question. Look at it! Hope that helped you!

    – alvarobartt
    Nov 26 '18 at 15:34



















  • The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

    – Beat.
    Nov 26 '18 at 15:18











  • @Beat. I solved your new question. Look at it! Hope that helped you!

    – alvarobartt
    Nov 26 '18 at 15:34

















The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18





The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18













@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34





@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470365%2fjsoup-trouble-extracting-formatting-from-html-tables%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin