Jsoup Trouble extracting formatting from html tables

<tr>



<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 

Theorems</span>



</th><th bgcolor="PINK"> <em><a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>



</th><th bgcolor="SKYBLUE"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>



</th><th bgcolor="LIME"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 

-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>



</th><th bgcolor="YELLOW"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 

-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>



</th></tr>

So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.

If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?

So I was attempting to extract just the bgcolor and

I've tried doc.select("tr:contains([bgcolor]"), doc.select(th, [bgcolor), doc.select([bgcolor]), doc.select(tr:containsdata(bgcolor) , as well as doc.select([style]) and all have either returned no output or returned a parse error. I can extract the stuff in the span class just fine but it is more of a problem of also extracting bgcolor and align.

asked Nov 25 '18 at 18:05

Beat.

For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37

add a comment |

<tr>



<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 

Theorems</span>



</th><th bgcolor="PINK"> <em><a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>



</th><th bgcolor="SKYBLUE"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>



</th><th bgcolor="LIME"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 

-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>



</th><th bgcolor="YELLOW"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 

-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>



</th></tr>

So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.

If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?

So I was attempting to extract just the bgcolor and

asked Nov 25 '18 at 18:05

Beat.

For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37

add a comment |

<tr>



<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 

Theorems</span>



</th><th bgcolor="PINK"> <em><a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>



</th><th bgcolor="SKYBLUE"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>



</th><th bgcolor="LIME"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 

-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>



</th><th bgcolor="YELLOW"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 

-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>



</th></tr>

So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.

If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?

So I was attempting to extract just the bgcolor and

asked Nov 25 '18 at 18:05

Beat.

<tr>



<th align="LEFT" bgcolor="GREY"> <span class="smallfont">Higher-order 

Theorems</span>



</th><th bgcolor="PINK"> <em><a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.2] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.2)">Satallax</a><br><span class="xxsmallfont">3.2</span></em>



</th><th bgcolor="SKYBLUE"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax---3.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Satallax-- 

-3.3)">Satallax</a><br><span class="xxsmallfont">3.3</span>



</th><th bgcolor="LIME"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III---1.3] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#Leo-III-- 

-1.3)">Leo‑III</a><br><span class="xxsmallfont">1.3</span>



</th><th bgcolor="YELLOW"> <a href=" 

[http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II---1.7.0] 

(http://www.tptp.org/CASC/J9/SystemDescriptions.html#LEO-II-- 

-1.7.0)">LEO‑II</a><br><span class="xxsmallfont">1.7.0</span>



</th></tr>

So lets say I want to extract bgcolor, align, and what is contained in the span class. So for example GREY,LEFT,Higher-order Theorems.

If I just wanted to extract at the very least bgcolor, but ideally all 3, how would i do so?

So I was attempting to extract just the bgcolor and

java html jsoup

asked Nov 25 '18 at 18:05

Beat.

asked Nov 25 '18 at 18:05

Beat.

asked Nov 25 '18 at 18:05

Beat.

asked Nov 25 '18 at 18:05

Beat.

asked Nov 25 '18 at 18:05

Beat.

For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37

add a comment |

For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37

For further information or a more detailed answer you should update your question and add the URL of the site you are trying to scrap.

– alvarobartt
Nov 26 '18 at 15:37

add a comment |

1 Answer
1

active

oldest

votes

You just need to parse the HTML code you want to scrap into JSOUP and then select the attributes of the HTML tags you want, using the attr selector from JSOUP Elements, and that gives you the value of that attribute for every th tag in the HTML. To retrieve also the text contained between the span tags you need to select the nested span in the th and get the .text().

    Document document = Jsoup.parse(YOUT HTML GOES HERE);

    System.out.println(document);

    Elements elements = document.select("tr > th");



    for (Element element : elements) {

        String align = element.attr("align");

        String color = element.attr("bgcolor");

        String spanText = element.select("span").text();



        System.out.println("Align is " + align +

                "nBackground Color is " + color +

                "nSpan Text is " + spanText);

    }

For any further information feel free to ask me! Hope this helped you!

Updated Answer to comment:

To do that, you'll need to use this line inside the for each loop:

String fullText = element.text();

That way you can get all the text contained between the selected Element tags, but you should look up this blog and fit you desired query to it. I guess you will also need to check if the String is empty or not, and do separate queries for each possible case, using IF conditionals.

That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470365%2fjsoup-trouble-extracting-formatting-from-html-tables%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

    Document document = Jsoup.parse(YOUT HTML GOES HERE);

    System.out.println(document);

    Elements elements = document.select("tr > th");



    for (Element element : elements) {

        String align = element.attr("align");

        String color = element.attr("bgcolor");

        String spanText = element.select("span").text();



        System.out.println("Align is " + align +

                "nBackground Color is " + color +

                "nSpan Text is " + spanText);

    }

For any further information feel free to ask me! Hope this helped you!

Updated Answer to comment:

To do that, you'll need to use this line inside the for each loop:

String fullText = element.text();

That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

add a comment |

    Document document = Jsoup.parse(YOUT HTML GOES HERE);

    System.out.println(document);

    Elements elements = document.select("tr > th");



    for (Element element : elements) {

        String align = element.attr("align");

        String color = element.attr("bgcolor");

        String spanText = element.select("span").text();



        System.out.println("Align is " + align +

                "nBackground Color is " + color +

                "nSpan Text is " + spanText);

    }

For any further information feel free to ask me! Hope this helped you!

Updated Answer to comment:

To do that, you'll need to use this line inside the for each loop:

String fullText = element.text();

That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

add a comment |

    Document document = Jsoup.parse(YOUT HTML GOES HERE);

    System.out.println(document);

    Elements elements = document.select("tr > th");



    for (Element element : elements) {

        String align = element.attr("align");

        String color = element.attr("bgcolor");

        String spanText = element.select("span").text();



        System.out.println("Align is " + align +

                "nBackground Color is " + color +

                "nSpan Text is " + spanText);

    }

For any further information feel free to ask me! Hope this helped you!

Updated Answer to comment:

To do that, you'll need to use this line inside the for each loop:

String fullText = element.text();

That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

    Document document = Jsoup.parse(YOUT HTML GOES HERE);

    System.out.println(document);

    Elements elements = document.select("tr > th");



    for (Element element : elements) {

        String align = element.attr("align");

        String color = element.attr("bgcolor");

        String spanText = element.select("span").text();



        System.out.println("Align is " + align +

                "nBackground Color is " + color +

                "nSpan Text is " + spanText);

    }

For any further information feel free to ask me! Hope this helped you!

Updated Answer to comment:

To do that, you'll need to use this line inside the for each loop:

String fullText = element.text();

That implies having one for this structure: tr > th > span, another for this one: tr > th > em, and another for: tr > th.

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

edited Nov 26 '18 at 15:33

answered Nov 26 '18 at 11:04

alvarobartt

12418

answered Nov 26 '18 at 11:04

alvarobartt

12418

answered Nov 26 '18 at 11:04

alvarobartt

12418

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

add a comment |

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

The only problem this seems to have is that for example Satallax 3.2 is only extracting 3.2, and idea why? Other than that the rest works fine. Thank you for your response.

– Beat.
Nov 26 '18 at 15:18

@Beat. I solved your new question. Look at it! Hope that helped you!

– alvarobartt
Nov 26 '18 at 15:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

GmHD,o,ZuZ3b6X0bItmOwRJO

搜尋此網誌

Nsryjdtyk