lxml/python reading xml with CDATA section

In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?

Default does not work:

$ from io import StringIO

$ from lxml import etree

$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[é]]>€ </Subject>'

$ tree = etree.parse(StringIO(xml))

$ tree.getroot().text

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:

$ parser=etree.XMLParser(strip_cdata=False)

$ tree = etree.parse(StringIO(xml), parser=parser)

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

Using strip_cdata=True, which should be the default, yields the same:

$ parser=etree.XMLParser(strip_cdata=True)

$ tree = etree.parse(StringIO(xml), parser=parser)    

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19

Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27

Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28

add a comment |

In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?

Default does not work:

$ from io import StringIO

$ from lxml import etree

$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[é]]>€ </Subject>'

$ tree = etree.parse(StringIO(xml))

$ tree.getroot().text

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:

$ parser=etree.XMLParser(strip_cdata=False)

$ tree = etree.parse(StringIO(xml), parser=parser)

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

Using strip_cdata=True, which should be the default, yields the same:

$ parser=etree.XMLParser(strip_cdata=True)

$ tree = etree.parse(StringIO(xml), parser=parser)    

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19

Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27

Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28

add a comment |

In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?

Default does not work:

$ from io import StringIO

$ from lxml import etree

$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[é]]>€ </Subject>'

$ tree = etree.parse(StringIO(xml))

$ tree.getroot().text

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:

$ parser=etree.XMLParser(strip_cdata=False)

$ tree = etree.parse(StringIO(xml), parser=parser)

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

Using strip_cdata=True, which should be the default, yields the same:

$ parser=etree.XMLParser(strip_cdata=True)

$ tree = etree.parse(StringIO(xml), parser=parser)    

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

In my xml I have a CDATA section. I want to keep the CDATA part, and then strip it. Can someone help with the following?

Default does not work:

$ from io import StringIO

$ from lxml import etree

$ xml = '<Subject> My Subject: 美海軍研究船勘查台海水文？ 船<![CDATA[é]]>€ </Subject>'

$ tree = etree.parse(StringIO(xml))

$ tree.getroot().text

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

This post seems to suggest that a parser option strip_cdata=False may keep the cdata, but it has no effect:

$ parser=etree.XMLParser(strip_cdata=False)

$ tree = etree.parse(StringIO(xml), parser=parser)

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

Using strip_cdata=True, which should be the default, yields the same:

$ parser=etree.XMLParser(strip_cdata=True)

$ tree = etree.parse(StringIO(xml), parser=parser)    

$ tree.getroot().text    

' My Subject: 美海軍研究船勘查台海水文？ 船é€ '

python python-3.x lxml elementtree cdata

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

edited Nov 24 '18 at 1:02

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

asked Nov 23 '18 at 23:17

Sudipta Basak

2,22821112

If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19

Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27

Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28

add a comment |

If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19

Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27

Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28

If you add enough of the relevant XML, we might able to test.

– usr2564301
Nov 23 '18 at 23:19

Is that example not enough? I can add more.

– Sudipta Basak
Nov 23 '18 at 23:27

Ah, sorry. It's hard to read, with those numbers before your actual code and data. If they are not an important part of your question, remove them.

– usr2564301
Nov 23 '18 at 23:28

add a comment |

1 Answer
1

active

oldest

votes

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53453791%2flxml-python-reading-xml-with-cdata-section%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

add a comment |

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

add a comment |

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

CDATA sections are not preserved in the text property of an element, even if strip_cdata=False is used when the XML content is parsed, as you have noticed. See https://lxml.de/api.html#cdata.

CDATA sections are preserved in these cases:

When serializing with tostring():

print(etree.tostring(tree.getroot(), encoding="UTF-8").decode())

When writing to a file:

tree.write("subject.xml", encoding="UTF-8")

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

answered Nov 24 '18 at 7:02

mzjn

31.9k669155

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

add a comment |

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

Thanks for that. I read that part, but did not realise etree.tostring serialises.

– Sudipta Basak
Nov 24 '18 at 14:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk