What's the correct Scrapy XPath for elements incorrectly placed within tags?Extracting p within h1 with Python/ScrapyBehavior of the scrapy xpath selector on h1-h6 tagsWhen is a CDATA section necessary within a script tag?Trouble extracting XPath element from eBay - for Scrapy projectXpath for div tag exclude span tag and return textScrapy, python: Unable to extract data using xpath seen in firebugPython scrapy xpath can't reach dataHTML XPath: Extracting text mixed in with multiple level and complex tags?Confused about scrapy and XpathHow to extract text data from multiple tags using response.XPath in Scrapy?How to get XPath value from nested divUnable to get text from parent and child nodes/tags with Scrapy

Being paid less than a "junior" colleague

Why agni is known as jaatavedas?

What are good ways to spray paint a QR code on a footpath?

Does the Pi 4 resolve the Ethernet+USB bottleneck issue of past versions?

Where can I get macOS Catalina Beta version?

Does Anosov geodesic flow imply asphericity?

How can I get edges to bend to avoid crossing?

How exactly is a normal force exerted, at the molecular level?

Details of video memory access arbitration in Space Invaders

How hard is it to sell a home which is currently mortgaged?

Why does a brace command group need spaces after the opening brace in POSIX Shell Grammar?

Is there reliable evidence that depleted uranium from the 1999 NATO bombing is causing cancer in Serbia?

Miss Toad and her frogs

Do space suits measure "methane" levels or other biological gases?

Who gets an Apparition licence?

How can I reduce the sound of rain on a range hood vent?

I hit a pipe with a mower and now it won't turn

Is there a way for presidents to legally extend their terms beyond the maximum of four years?

Why isn’t the tax system continuous rather than bracketed?

Can I ask to speak to my future colleagues before accepting an offer?

Can a police officer film me on their personal device in my own home?

Was it really unprofessional of me to leave without asking for a raise first?

One folder two different locations on ubuntu 18.04

Is there any problem with this camera not having a lens cover?

What's the correct Scrapy XPath for

elements incorrectly placed within tags?

Extracting p within h1 with Python/ScrapyBehavior of the scrapy xpath selector on h1-h6 tagsWhen is a CDATA section necessary within a script tag?Trouble extracting XPath element from eBay - for Scrapy projectXpath for div tag exclude span tag and return textScrapy, python: Unable to extract data using xpath seen in firebugPython scrapy xpath can't reach dataHTML XPath: Extracting text mixed in with multiple level and complex tags?Confused about scrapy and XpathHow to extract text data from multiple tags using response.XPath in Scrapy?How to get XPath value from nested divUnable to get text from parent and child nodes/tags with Scrapy

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.

My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.

I have tried the solution in the link above, and also here, to no avail.

def parse(self, response):
 chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
 englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
 chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
 productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
 chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
 yield 
 'chinesetitle': chinesetitle,
 'englishtitle': englishtitle,
 'chinesereleasedate': chinesereleasedate,
 'productionregions': productionregions,
 'chineseboxoffice': chineseboxoffice

When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.

Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!

EDIT

Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)

def parse(self, response):
 chinesetitle = response.css('.cont h2::text').extract_first()
 englishtitle = response.css('.cont h2 + p::text').extract_first()
 chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first() 
 chinaboxoffice = chinaboxoffice.split('万')[0]
 chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
 chinareleasedate = chinareleasedate.split('：')[1].split('（')[0]
 countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
 countryoforigin = countryoforigin.split('：')[1]
 genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
 genre = genre.split('：')[1]
 director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13

add a comment |

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.

I have tried the solution in the link above, and also here, to no avail.

def parse(self, response):
 chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
 englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
 chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
 productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
 chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
 yield 
 'chinesetitle': chinesetitle,
 'englishtitle': englishtitle,
 'chinesereleasedate': chinesereleasedate,
 'productionregions': productionregions,
 'chineseboxoffice': chineseboxoffice

When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.

Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!

EDIT

def parse(self, response):
 chinesetitle = response.css('.cont h2::text').extract_first()
 englishtitle = response.css('.cont h2 + p::text').extract_first()
 chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first() 
 chinaboxoffice = chinaboxoffice.split('万')[0]
 chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
 chinareleasedate = chinareleasedate.split('：')[1].split('（')[0]
 countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
 countryoforigin = countryoforigin.split('：')[1]
 genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
 genre = genre.split('：')[1]
 director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13

add a comment |

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.

I have tried the solution in the link above, and also here, to no avail.

def parse(self, response):
 chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
 englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
 chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
 productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
 chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
 yield 
 'chinesetitle': chinesetitle,
 'englishtitle': englishtitle,
 'chinesereleasedate': chinesereleasedate,
 'productionregions': productionregions,
 'chineseboxoffice': chineseboxoffice

When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.

Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!

EDIT

def parse(self, response):
 chinesetitle = response.css('.cont h2::text').extract_first()
 englishtitle = response.css('.cont h2 + p::text').extract_first()
 chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first() 
 chinaboxoffice = chinaboxoffice.split('万')[0]
 chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
 chinareleasedate = chinareleasedate.split('：')[1].split('（')[0]
 countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
 countryoforigin = countryoforigin.split('：')[1]
 genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
 genre = genre.split('：')[1]
 director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.

I have tried the solution in the link above, and also here, to no avail.

def parse(self, response):
 chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
 englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
 chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
 productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
 chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
 yield 
 'chinesetitle': chinesetitle,
 'englishtitle': englishtitle,
 'chinesereleasedate': chinesereleasedate,
 'productionregions': productionregions,
 'chineseboxoffice': chineseboxoffice

When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.

Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!

EDIT

def parse(self, response):
 chinesetitle = response.css('.cont h2::text').extract_first()
 englishtitle = response.css('.cont h2 + p::text').extract_first()
 chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first() 
 chinaboxoffice = chinaboxoffice.split('万')[0]
 chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
 chinareleasedate = chinareleasedate.split('：')[1].split('（')[0]
 countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
 countryoforigin = countryoforigin.split('：')[1]
 genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
 genre = genre.split('：')[1]
 director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()

python html xpath web-scraping scrapy

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

edited Mar 26 at 13:58

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

asked Mar 25 at 12:35

Eric Johnson

284 bronze badges

Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13

add a comment |

Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13

Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13

add a comment |

2 Answers
2

active

oldest

votes

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like 'rn 上映时间：2017-7-27（中国）rn ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

add a comment |

You don't have to torture yourself with xpath by the way, you can use css:

response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337928%2fwhats-the-correct-scrapy-xpath-for-p-elements-incorrectly-placed-within-h-t%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like 'rn 上映时间：2017-7-27（中国）rn ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

add a comment |

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like 'rn 上映时间：2017-7-27（中国）rn ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

add a comment |

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like 'rn 上映时间：2017-7-27（中国）rn ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.

chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()

To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.

Edit : To answer the question in the comments,

response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()

returns a string like 'rn 上映时间：2017-7-27（中国）rn ' which is not what you are looking for. You can clean it up like:

chinesereleasedate = chinesereleasedate.split('：')[1].split('（')[0]

This gives us the correct date.

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

edited Mar 25 at 16:31

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

answered Mar 25 at 13:00

Nihal Sangeeth

1,0396 silver badges18 bronze badges

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

add a comment |

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

– Eric Johnson
Mar 25 at 13:27

Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

– Nihal Sangeeth
Mar 25 at 16:32

Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

– Eric Johnson
Mar 26 at 13:56

add a comment |

You don't have to torture yourself with xpath by the way, you can use css:

response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

add a comment |

You don't have to torture yourself with xpath by the way, you can use css:

response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

add a comment |

You don't have to torture yourself with xpath by the way, you can use css:

response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

You don't have to torture yourself with xpath by the way, you can use css:

response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

answered Mar 26 at 1:17

pguardiario

37.7k11 gold badges82 silver badges118 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers
2

2 Answers
2

2 Answers
2