Scrapy xml pipelineHow does one parse XML files?Pretty printing XML in PythonWhat characters do I need to escape in XML documents?How do I parse XML in Python?How do I comment out a block of tags in XML?What does <![CDATA[]]> in XML mean?How do you parse and process HTML/XML in PHP?Scrapy Pipeline loads but doesn't workHow to access scrapy settings from item PipelineScrapy pipeline html parsing

How do you cope with rejection?

Shortest amud or daf in Shas?

How does this piece of code determine array size without using sizeof( )?

Taylor series leads to two different functions - why?

How would fantasy dwarves exist, realistically?

Can ThermodynamicData be used with NSolve?

Why didn't Daenerys' advisers suggest assassinating Cersei?

Why is choosing a suitable thermodynamic potential important?

Why would you put your input amplifier in front of your filtering for an ECG signal?

Why does the setUID bit work inconsistently?

French equivalent of the German expression "flöten gehen"

I recently started my machine learning PhD and I have absolutely no idea what I'm doing

Can more than one instance of Bend Luck be applied to the same roll by multiple Wild Magic sorcerers?

Lock out of Oracle based on Windows username

Good examples of "two is easy, three is hard" in computational sciences

Windows reverting changes made by Linux to FAT32 partion

What technology would Dwarves need to forge titanium?

How was the blinking terminal cursor invented?

Error when running ((x++)) as root

Why do academics prefer Mac/Linux?

Is there any deeper thematic meaning to the white horse that Arya finds in The Bells (S08E05)?

Told to apply for UK visa before other visas, on UK-Spain-etc. visit

How can sister protect herself from impulse purchases with a credit card?

Is it a good idea to teach algorithm courses using pseudocode?

Scrapy xml pipeline

How does one parse XML files?Pretty printing XML in PythonWhat characters do I need to escape in XML documents?How do I parse XML in Python?How do I comment out a block of tags in XML?What does <![CDATA[]]> in XML mean?How do you parse and process HTML/XML in PHP?Scrapy Pipeline loads but doesn't workHow to access scrapy settings from item PipelineScrapy pipeline html parsing

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I need to make a spider that which must output a xml file for any article.

The pipeline.py:

from scrapy.exporters import XmlItemExporter
from datetime import datetime

class CommonPipeline(object):
 def process_item(self, item, spider):
 return item

class XmlExportPipeline(object):
 def __init__(self):
 self.files = 

 def process_item(self, item, spider):
 file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
 self.files[spider] = file
 self.exporter = XmlItemExporter(file)
 self.exporter.start_exporting()
 self.exporter.export_item(item)
 self.exporter.finish_exporting()
 file = self.files.pop(spider)
 file.close()
 return item

The output:

<?xml version="1.0" encoding="utf-8"?>
 <items>
 <item>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </item>
 </items>

But I need a output like this:

<?xml version="1.0" encoding="iso-8859-1"?>
 <article>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </article>

The settings.py:

ITEM_PIPELINES = 
 'common.pipelines.XmlExportPipeline': 300,

FEED_EXPORTERS_BASE = 
 'xml': 'scrapy.contrib.exporter.XmlItemExporter',

I tried adding in settings.py:

FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]

But don't works.

I use Scrapy 1.4.0

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

Try (inside process_item) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles"). See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter

– balderman
Mar 24 at 8:06

Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.

– Juan Manuel
Mar 24 at 15:38

So use only the 'item_element'

– balderman
Mar 24 at 15:42

It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.

– Juan Manuel
Mar 24 at 18:16

add a comment |

I need to make a spider that which must output a xml file for any article.

The pipeline.py:

from scrapy.exporters import XmlItemExporter
from datetime import datetime

class CommonPipeline(object):
 def process_item(self, item, spider):
 return item

class XmlExportPipeline(object):
 def __init__(self):
 self.files = 

 def process_item(self, item, spider):
 file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
 self.files[spider] = file
 self.exporter = XmlItemExporter(file)
 self.exporter.start_exporting()
 self.exporter.export_item(item)
 self.exporter.finish_exporting()
 file = self.files.pop(spider)
 file.close()
 return item

The output:

<?xml version="1.0" encoding="utf-8"?>
 <items>
 <item>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </item>
 </items>

But I need a output like this:

<?xml version="1.0" encoding="iso-8859-1"?>
 <article>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </article>

The settings.py:

ITEM_PIPELINES = 
 'common.pipelines.XmlExportPipeline': 300,

FEED_EXPORTERS_BASE = 
 'xml': 'scrapy.contrib.exporter.XmlItemExporter',

I tried adding in settings.py:

FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]

But don't works.

I use Scrapy 1.4.0

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

Try (inside process_item) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles"). See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter

– balderman
Mar 24 at 8:06

Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.

– Juan Manuel
Mar 24 at 15:38

So use only the 'item_element'

– balderman
Mar 24 at 15:42

It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.

– Juan Manuel
Mar 24 at 18:16

add a comment |

I need to make a spider that which must output a xml file for any article.

The pipeline.py:

from scrapy.exporters import XmlItemExporter
from datetime import datetime

class CommonPipeline(object):
 def process_item(self, item, spider):
 return item

class XmlExportPipeline(object):
 def __init__(self):
 self.files = 

 def process_item(self, item, spider):
 file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
 self.files[spider] = file
 self.exporter = XmlItemExporter(file)
 self.exporter.start_exporting()
 self.exporter.export_item(item)
 self.exporter.finish_exporting()
 file = self.files.pop(spider)
 file.close()
 return item

The output:

<?xml version="1.0" encoding="utf-8"?>
 <items>
 <item>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </item>
 </items>

But I need a output like this:

<?xml version="1.0" encoding="iso-8859-1"?>
 <article>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </article>

The settings.py:

ITEM_PIPELINES = 
 'common.pipelines.XmlExportPipeline': 300,

FEED_EXPORTERS_BASE = 
 'xml': 'scrapy.contrib.exporter.XmlItemExporter',

I tried adding in settings.py:

FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]

But don't works.

I use Scrapy 1.4.0

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

I need to make a spider that which must output a xml file for any article.

The pipeline.py:

from scrapy.exporters import XmlItemExporter
from datetime import datetime

class CommonPipeline(object):
 def process_item(self, item, spider):
 return item

class XmlExportPipeline(object):
 def __init__(self):
 self.files = 

 def process_item(self, item, spider):
 file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
 self.files[spider] = file
 self.exporter = XmlItemExporter(file)
 self.exporter.start_exporting()
 self.exporter.export_item(item)
 self.exporter.finish_exporting()
 file = self.files.pop(spider)
 file.close()
 return item

The output:

<?xml version="1.0" encoding="utf-8"?>
 <items>
 <item>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </item>
 </items>

But I need a output like this:

<?xml version="1.0" encoding="iso-8859-1"?>
 <article>
 <text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
 <title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
 <url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
 <content> Nelson Argaña, hijo de Luis María Arg ...</content>
 <sum_content>4805</sum_content>
 <time>14:30:06</time>
 <date>20190323</date>
 </article>

The settings.py:

ITEM_PIPELINES = 
 'common.pipelines.XmlExportPipeline': 300,

FEED_EXPORTERS_BASE = 
 'xml': 'scrapy.contrib.exporter.XmlItemExporter',

I tried adding in settings.py:

FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]

But don't works.

I use Scrapy 1.4.0

python xml scrapy pipeline

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

edited Mar 23 at 17:39

asked Mar 23 at 17:29

Juan Manuel

133

asked Mar 23 at 17:29

Juan Manuel

133

asked Mar 23 at 17:29

Juan Manuel

133

Try (inside process_item) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles"). See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter

– balderman
Mar 24 at 8:06

Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.

– Juan Manuel
Mar 24 at 15:38

So use only the 'item_element'

– balderman
Mar 24 at 15:42

It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.

– Juan Manuel
Mar 24 at 18:16

add a comment |

Try (inside process_item) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles"). See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter

– balderman
Mar 24 at 8:06

Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.

– Juan Manuel
Mar 24 at 15:38

So use only the 'item_element'

– balderman
Mar 24 at 15:42

It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.

– Juan Manuel
Mar 24 at 18:16

Try (inside process_item) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles"). See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter

– balderman
Mar 24 at 8:06

Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.

– Juan Manuel
Mar 24 at 15:38

So use only the 'item_element'

– balderman
Mar 24 at 15:42

It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.

– Juan Manuel
Mar 24 at 18:16

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55316485%2fscrapy-xml-pipeline%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현