Scrapy xml pipelineHow does one parse XML files?Pretty printing XML in PythonWhat characters do I need to escape in XML documents?How do I parse XML in Python?How do I comment out a block of tags in XML?What does <![CDATA[]]> in XML mean?How do you parse and process HTML/XML in PHP?Scrapy Pipeline loads but doesn't workHow to access scrapy settings from item PipelineScrapy pipeline html parsing
How do you cope with rejection?
Shortest amud or daf in Shas?
How does this piece of code determine array size without using sizeof( )?
Taylor series leads to two different functions - why?
How would fantasy dwarves exist, realistically?
Can ThermodynamicData be used with NSolve?
Why didn't Daenerys' advisers suggest assassinating Cersei?
Why is choosing a suitable thermodynamic potential important?
Why would you put your input amplifier in front of your filtering for an ECG signal?
Why does the setUID bit work inconsistently?
French equivalent of the German expression "flöten gehen"
I recently started my machine learning PhD and I have absolutely no idea what I'm doing
Can more than one instance of Bend Luck be applied to the same roll by multiple Wild Magic sorcerers?
Lock out of Oracle based on Windows username
Good examples of "two is easy, three is hard" in computational sciences
Windows reverting changes made by Linux to FAT32 partion
What technology would Dwarves need to forge titanium?
How was the blinking terminal cursor invented?
Error when running ((x++)) as root
Why do academics prefer Mac/Linux?
Is there any deeper thematic meaning to the white horse that Arya finds in The Bells (S08E05)?
Told to apply for UK visa before other visas, on UK-Spain-etc. visit
How can sister protect herself from impulse purchases with a credit card?
Is it a good idea to teach algorithm courses using pseudocode?
Scrapy xml pipeline
How does one parse XML files?Pretty printing XML in PythonWhat characters do I need to escape in XML documents?How do I parse XML in Python?How do I comment out a block of tags in XML?What does <![CDATA[]]> in XML mean?How do you parse and process HTML/XML in PHP?Scrapy Pipeline loads but doesn't workHow to access scrapy settings from item PipelineScrapy pipeline html parsing
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I need to make a spider that which must output a xml file for any article.
The pipeline.py:
from scrapy.exporters import XmlItemExporter
from datetime import datetime
class CommonPipeline(object):
def process_item(self, item, spider):
return item
class XmlExportPipeline(object):
def __init__(self):
self.files =
def process_item(self, item, spider):
file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
self.exporter.export_item(item)
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
return item
The output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</item>
</items>
But I need a output like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<article>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</article>
The settings.py:
ITEM_PIPELINES =
'common.pipelines.XmlExportPipeline': 300,
FEED_EXPORTERS_BASE =
'xml': 'scrapy.contrib.exporter.XmlItemExporter',
I tried adding in settings.py:
FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]
But don't works.
I use Scrapy 1.4.0
python xml scrapy pipeline
add a comment |
I need to make a spider that which must output a xml file for any article.
The pipeline.py:
from scrapy.exporters import XmlItemExporter
from datetime import datetime
class CommonPipeline(object):
def process_item(self, item, spider):
return item
class XmlExportPipeline(object):
def __init__(self):
self.files =
def process_item(self, item, spider):
file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
self.exporter.export_item(item)
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
return item
The output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</item>
</items>
But I need a output like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<article>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</article>
The settings.py:
ITEM_PIPELINES =
'common.pipelines.XmlExportPipeline': 300,
FEED_EXPORTERS_BASE =
'xml': 'scrapy.contrib.exporter.XmlItemExporter',
I tried adding in settings.py:
FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]
But don't works.
I use Scrapy 1.4.0
python xml scrapy pipeline
Try (insideprocess_item
) -self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter
– balderman
Mar 24 at 8:06
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
So use only the 'item_element'
– balderman
Mar 24 at 15:42
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16
add a comment |
I need to make a spider that which must output a xml file for any article.
The pipeline.py:
from scrapy.exporters import XmlItemExporter
from datetime import datetime
class CommonPipeline(object):
def process_item(self, item, spider):
return item
class XmlExportPipeline(object):
def __init__(self):
self.files =
def process_item(self, item, spider):
file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
self.exporter.export_item(item)
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
return item
The output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</item>
</items>
But I need a output like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<article>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</article>
The settings.py:
ITEM_PIPELINES =
'common.pipelines.XmlExportPipeline': 300,
FEED_EXPORTERS_BASE =
'xml': 'scrapy.contrib.exporter.XmlItemExporter',
I tried adding in settings.py:
FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]
But don't works.
I use Scrapy 1.4.0
python xml scrapy pipeline
I need to make a spider that which must output a xml file for any article.
The pipeline.py:
from scrapy.exporters import XmlItemExporter
from datetime import datetime
class CommonPipeline(object):
def process_item(self, item, spider):
return item
class XmlExportPipeline(object):
def __init__(self):
self.files =
def process_item(self, item, spider):
file = open((spider.name + datetime.now().strftime("_%H%M%S%f.xml")), 'w+b')
self.files[spider] = file
self.exporter = XmlItemExporter(file)
self.exporter.start_exporting()
self.exporter.export_item(item)
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
return item
The output:
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</item>
</items>
But I need a output like this:
<?xml version="1.0" encoding="iso-8859-1"?>
<article>
<text_img> Nelson Argaña. Foto: Gustavo Velázquez 970AM. Hace 1 hora </text_img>
<title>Nelson Argaña lamentó que Mario Abdo esté rodeado de corruptos </title>
<url>https://www.lanacion.com.py/politica/2019/03/23/nelson-argana-lamento-que-mario-abdo-este-rodeado-de-corruptos/</url>
<content> Nelson Argaña, hijo de Luis María Arg ...</content>
<sum_content>4805</sum_content>
<time>14:30:06</time>
<date>20190323</date>
</article>
The settings.py:
ITEM_PIPELINES =
'common.pipelines.XmlExportPipeline': 300,
FEED_EXPORTERS_BASE =
'xml': 'scrapy.contrib.exporter.XmlItemExporter',
I tried adding in settings.py:
FEED_EXPORT_ENCODING = 'iso-8859-1'
FEED_EXPORT_FIELDS = ["article"]
But don't works.
I use Scrapy 1.4.0
python xml scrapy pipeline
python xml scrapy pipeline
edited Mar 23 at 17:39
Juan Manuel
asked Mar 23 at 17:29
Juan ManuelJuan Manuel
133
133
Try (insideprocess_item
) -self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter
– balderman
Mar 24 at 8:06
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
So use only the 'item_element'
– balderman
Mar 24 at 15:42
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16
add a comment |
Try (insideprocess_item
) -self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter
– balderman
Mar 24 at 8:06
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
So use only the 'item_element'
– balderman
Mar 24 at 15:42
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16
Try (inside
process_item
) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter– balderman
Mar 24 at 8:06
Try (inside
process_item
) - self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter– balderman
Mar 24 at 8:06
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
So use only the 'item_element'
– balderman
Mar 24 at 15:42
So use only the 'item_element'
– balderman
Mar 24 at 15:42
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55316485%2fscrapy-xml-pipeline%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55316485%2fscrapy-xml-pipeline%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Try (inside
process_item
) -self.exporter = XmlItemExporter(file, item_element="article", root_element="articles")
. See docs.scrapy.org/en/latest/topics/exporters.html#xmlitemexporter– balderman
Mar 24 at 8:06
Thanks for your comment. I tried that option but I need only the tag <article> not <articles><article>. This is a mandatory request and the encoding too.
– Juan Manuel
Mar 24 at 15:38
So use only the 'item_element'
– balderman
Mar 24 at 15:42
It does not work. The root_element appears by default as <items>. I tried root_element=False, root_element=None, root_element='' but it does not work. The same happens in reverse.
– Juan Manuel
Mar 24 at 18:16