lxml/python reading xml with CDATA section












0















In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?



Default does not work:



$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:



$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


Using strip_cdata=True, which should be the default, yields the same:



$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '









share|improve this question

























  • If you add enough of the relevant XML, we might able to test.

    – usr2564301
    Nov 23 '18 at 23:19











  • Is that example not enough? I can add more.

    – Sudipta Basak
    Nov 23 '18 at 23:27











  • Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

    – usr2564301
    Nov 23 '18 at 23:28
















0















In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?



Default does not work:



$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:



$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


Using strip_cdata=True, which should be the default, yields the same:



$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '









share|improve this question

























  • If you add enough of the relevant XML, we might able to test.

    – usr2564301
    Nov 23 '18 at 23:19











  • Is that example not enough? I can add more.

    – Sudipta Basak
    Nov 23 '18 at 23:27











  • Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

    – usr2564301
    Nov 23 '18 at 23:28














0












0








0








In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?



Default does not work:



$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:



$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


Using strip_cdata=True, which should be the default, yields the same:



$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '









share|improve this question
















In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?



Default does not work:



$ from io import StringIO
$ from lxml import etree
$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文? 船<![CDATA[é]]>€ </Subject>'
$ tree = etree.parse(StringIO(xml))
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:



$ parser=etree.XMLParser(strip_cdata=False)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '


Using strip_cdata=True, which should be the default, yields the same:



$ parser=etree.XMLParser(strip_cdata=True)
$ tree = etree.parse(StringIO(xml), parser=parser)
$ tree.getroot().text
' My Subject: 美海軍研究船勘查台海水文? 船é€ '






python python-3.x lxml elementtree cdata






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 24 '18 at 1:02







Sudipta Basak

















asked Nov 23 '18 at 23:17









Sudipta BasakSudipta Basak

2,22821112




2,22821112













  • If you add enough of the relevant XML, we might able to test.

    – usr2564301
    Nov 23 '18 at 23:19











  • Is that example not enough? I can add more.

    – Sudipta Basak
    Nov 23 '18 at 23:27











  • Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

    – usr2564301
    Nov 23 '18 at 23:28



















  • If you add enough of the relevant XML, we might able to test.

    – usr2564301
    Nov 23 '18 at 23:19











  • Is that example not enough? I can add more.

    – Sudipta Basak
    Nov 23 '18 at 23:27











  • Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

    – usr2564301
    Nov 23 '18 at 23:28

















If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19





If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19













Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27





Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27













Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28





Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28












1 Answer
1






active

oldest

votes


















1














CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.



CDATA sections are preserved in these cases:





  1. When serializing with tostring():



    print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())



  2. When writing to a file:



    tree.write("subject.xml", encoding="UTF-8")







share|improve this answer
























  • Thanks for that. I read that part, but did not realise etree.tostring serialises.

    – Sudipta Basak
    Nov 24 '18 at 14:26











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453791%2flxml-python-reading-xml-with-cdata-section%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.



CDATA sections are preserved in these cases:





  1. When serializing with tostring():



    print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())



  2. When writing to a file:



    tree.write("subject.xml", encoding="UTF-8")







share|improve this answer
























  • Thanks for that. I read that part, but did not realise etree.tostring serialises.

    – Sudipta Basak
    Nov 24 '18 at 14:26
















1














CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.



CDATA sections are preserved in these cases:





  1. When serializing with tostring():



    print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())



  2. When writing to a file:



    tree.write("subject.xml", encoding="UTF-8")







share|improve this answer
























  • Thanks for that. I read that part, but did not realise etree.tostring serialises.

    – Sudipta Basak
    Nov 24 '18 at 14:26














1












1








1







CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.



CDATA sections are preserved in these cases:





  1. When serializing with tostring():



    print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())



  2. When writing to a file:



    tree.write("subject.xml", encoding="UTF-8")







share|improve this answer













CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.



CDATA sections are preserved in these cases:





  1. When serializing with tostring():



    print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())



  2. When writing to a file:



    tree.write("subject.xml", encoding="UTF-8")








share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 24 '18 at 7:02









mzjnmzjn

31.9k669155




31.9k669155













  • Thanks for that. I read that part, but did not realise etree.tostring serialises.

    – Sudipta Basak
    Nov 24 '18 at 14:26



















  • Thanks for that. I read that part, but did not realise etree.tostring serialises.

    – Sudipta Basak
    Nov 24 '18 at 14:26

















Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26





Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453791%2flxml-python-reading-xml-with-cdata-section%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ottavio Pratesi

Tricia Helfer

15 giugno