What's the correct Scrapy XPath for elements incorrectly placed within tags?Extracting p within h1 with Python/ScrapyBehavior of the scrapy xpath selector on h1-h6 tagsWhen is a CDATA section necessary within a script tag?Trouble extracting XPath element from eBay - for Scrapy projectXpath for div tag exclude span tag and return textScrapy, python: Unable to extract data using xpath seen in firebugPython scrapy xpath can't reach dataHTML XPath: Extracting text mixed in with multiple level and complex tags?Confused about scrapy and XpathHow to extract text data from multiple tags using response.XPath in Scrapy?How to get XPath value from nested divUnable to get text from parent and child nodes/tags with Scrapy

Being paid less than a "junior" colleague

Why agni is known as jaatavedas?

What are good ways to spray paint a QR code on a footpath?

Does the Pi 4 resolve the Ethernet+USB bottleneck issue of past versions?

Where can I get macOS Catalina Beta version?

Does Anosov geodesic flow imply asphericity?

How can I get edges to bend to avoid crossing?

How exactly is a normal force exerted, at the molecular level?

Details of video memory access arbitration in Space Invaders

How hard is it to sell a home which is currently mortgaged?

Why does a brace command group need spaces after the opening brace in POSIX Shell Grammar?

Is there reliable evidence that depleted uranium from the 1999 NATO bombing is causing cancer in Serbia?

Miss Toad and her frogs

Do space suits measure "methane" levels or other biological gases?

Who gets an Apparition licence?

How can I reduce the sound of rain on a range hood vent?

I hit a pipe with a mower and now it won't turn

Is there a way for presidents to legally extend their terms beyond the maximum of four years?

Why isn’t the tax system continuous rather than bracketed?

Can I ask to speak to my future colleagues before accepting an offer?

Can a police officer film me on their personal device in my own home?

Was it really unprofessional of me to leave without asking for a raise first?

One folder two different locations on ubuntu 18.04

Is there any problem with this camera not having a lens cover?



What's the correct Scrapy XPath for


elements incorrectly placed within tags?


Extracting p within h1 with Python/ScrapyBehavior of the scrapy xpath selector on h1-h6 tagsWhen is a CDATA section necessary within a script tag?Trouble extracting XPath element from eBay - for Scrapy projectXpath for div tag exclude span tag and return textScrapy, python: Unable to extract data using xpath seen in firebugPython scrapy xpath can't reach dataHTML XPath: Extracting text mixed in with multiple level and complex tags?Confused about scrapy and XpathHow to extract text data from multiple tags using response.XPath in Scrapy?How to get XPath value from nested divUnable to get text from parent and child nodes/tags with Scrapy






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.



My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.



I have tried the solution in the link above, and also here, to no avail.



def parse(self, response):
chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice



When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.



Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!



EDIT



Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)



def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()









share|improve this question
























  • Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

    – pguardiario
    Mar 26 at 1:13

















0















I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.



My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.



I have tried the solution in the link above, and also here, to no avail.



def parse(self, response):
chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice



When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.



Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!



EDIT



Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)



def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()









share|improve this question
























  • Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

    – pguardiario
    Mar 26 at 1:13













0












0








0


1






I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.



My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.



I have tried the solution in the link above, and also here, to no avail.



def parse(self, response):
chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice



When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.



Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!



EDIT



Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)



def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()









share|improve this question
















I am setting up my first Scrapy Spider, and I'm having some difficulty using xpath to extract certain elements.



My target is http://www.cbooo.cn/m/641515 (a Chinese website similar to Box Office Mojo). I can extract the Chinese name of the film 阿龙浴血记 with no problem , but I can't figure out how to get the information below it. I believe this is because the HTML is not standard, as discussed here. There are several paragraph elements nested beneath the header.



I have tried the solution in the link above, and also here, to no avail.



def parse(self, response):
chinesetitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/text()').extract()
englishtitle = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/h2/p').extract()
chinesereleasedate = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[4]').extract()
productionregions = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[6]').extract()
chineseboxoffice = response.xpath('//*[@id="top"]/div[3]/div[2]/div/div[1]/div[2]/div[1]/p[1]/span/text()[2]').extract()
yield
'chinesetitle': chinesetitle,
'englishtitle': englishtitle,
'chinesereleasedate': chinesereleasedate,
'productionregions': productionregions,
'chineseboxoffice': chineseboxoffice



When I run the spider in the Scrapy shell, the spider finds the Chinese title as expected. However, the remaining items return either a [], or a weird mish-mash of text on the page.



Any advice? This is my first amature programming project, so I appreciate your patience with my ignorance and your help. Thank you!



EDIT



Tried implement the text cleaning method in the comments. The example in the comments worked, but when I tried to reimplement it I got an "Attribute Error: 'list' object has no attribute 'split'“ (please see the China Box Office, country of origin, and genre examples below)



def parse(self, response):
chinesetitle = response.css('.cont h2::text').extract_first()
englishtitle = response.css('.cont h2 + p::text').extract_first()
chinaboxoffice = response.xpath('//span[@class="m-span"]/text()[2]').extract_first()
chinaboxoffice = chinaboxoffice.split('万')[0]
chinareleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()
chinareleasedate = chinareleasedate.split(':')[1].split('(')[0]
countryoforigin = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()
countryoforigin = countryoforigin.split(':')[1]
genre = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"类型")]/text()').extract_first()
genre = genre.split(':')[1]
director = response.xpath('//*[@id="tabcont1"]/dl/dd[1]/p/a/text()').extract()






python html xpath web-scraping scrapy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 26 at 13:58







Eric Johnson

















asked Mar 25 at 12:35









Eric JohnsonEric Johnson

284 bronze badges




284 bronze badges












  • Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

    – pguardiario
    Mar 26 at 1:13

















  • Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

    – pguardiario
    Mar 26 at 1:13
















Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13





Those are actually allowed in html5, scrapy just doesn't support it and probably never will.

– pguardiario
Mar 26 at 1:13












2 Answers
2






active

oldest

votes


















2














Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.



chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()


To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.



To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.



Edit : To answer the question in the comments,



response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()


returns a string like 'rn 上映时间:2017-7-27(中国)rn ' which is not what you are looking for. You can clean it up like:



chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]



This gives us the correct date.






share|improve this answer

























  • Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

    – Eric Johnson
    Mar 25 at 13:27












  • Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

    – Nihal Sangeeth
    Mar 25 at 16:32











  • Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

    – Eric Johnson
    Mar 26 at 13:56


















2














You don't have to torture yourself with xpath by the way, you can use css:



response.css('.cont h2::text').extract_first()
# '战狼2'
response.css('.cont h2 + p::text').extract_first()
# 'Wolf Warriors 2'





share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337928%2fwhats-the-correct-scrapy-xpath-for-p-elements-incorrectly-placed-within-h-t%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.



    chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
    englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
    chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
    productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()


    To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.



    To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.



    Edit : To answer the question in the comments,



    response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()


    returns a string like 'rn 上映时间:2017-7-27(中国)rn ' which is not what you are looking for. You can clean it up like:



    chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]



    This gives us the correct date.






    share|improve this answer

























    • Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

      – Eric Johnson
      Mar 25 at 13:27












    • Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

      – Nihal Sangeeth
      Mar 25 at 16:32











    • Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

      – Eric Johnson
      Mar 26 at 13:56















    2














    Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.



    chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
    englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
    chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
    productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()


    To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.



    To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.



    Edit : To answer the question in the comments,



    response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()


    returns a string like 'rn 上映时间:2017-7-27(中国)rn ' which is not what you are looking for. You can clean it up like:



    chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]



    This gives us the correct date.






    share|improve this answer

























    • Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

      – Eric Johnson
      Mar 25 at 13:27












    • Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

      – Nihal Sangeeth
      Mar 25 at 16:32











    • Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

      – Eric Johnson
      Mar 26 at 13:56













    2












    2








    2







    Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.



    chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
    englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
    chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
    productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()


    To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.



    To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.



    Edit : To answer the question in the comments,



    response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()


    returns a string like 'rn 上映时间:2017-7-27(中国)rn ' which is not what you are looking for. You can clean it up like:



    chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]



    This gives us the correct date.






    share|improve this answer















    Here are some examples from which you can infer the last one. Remember to always use a class or id attribute to identify the html element. /div[3]/div[2]/div/div[1]/.. is not a good practise.



    chinesetitle = response.xpath('//div[@class="ziliaofr"]/div/h2/text()').extract_first()
    englishtitle = response.xpath('//div[@class="ziliaofr"]/div/p/text()').extract_first()
    chinesereleasedate = response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first())
    productionregions = response.xpath('//div[@class="ziliaofr"]/div/p')[6].xpath('text()').extract_first()


    To find chinesereleasedate I took the p element whose text contains '上映时间'. You have to parse this to get the exact value.



    To find productionregions I took the 7th selector from the list response.xpath('//div[@class="ziliaofr"]/div/p')[6] selected the text. A better method would be to check if the text contains '国家及地区' just like above.



    Edit : To answer the question in the comments,



    response.xpath('//div[@class="ziliaofr"]/div/p[contains(text(),"上映时间")]/text()').extract_first()


    returns a string like 'rn 上映时间:2017-7-27(中国)rn ' which is not what you are looking for. You can clean it up like:



    chinesereleasedate = chinesereleasedate.split(':')[1].split('(')[0]



    This gives us the correct date.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Mar 25 at 16:31

























    answered Mar 25 at 13:00









    Nihal SangeethNihal Sangeeth

    1,0396 silver badges18 bronze badges




    1,0396 silver badges18 bronze badges












    • Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

      – Eric Johnson
      Mar 25 at 13:27












    • Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

      – Nihal Sangeeth
      Mar 25 at 16:32











    • Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

      – Eric Johnson
      Mar 26 at 13:56

















    • Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

      – Eric Johnson
      Mar 25 at 13:27












    • Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

      – Nihal Sangeeth
      Mar 25 at 16:32











    • Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

      – Eric Johnson
      Mar 26 at 13:56
















    Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

    – Eric Johnson
    Mar 25 at 13:27






    Understood your point about the class attribute, very useful information. When you say "You have to parse this to get the exact value", what do you mean?

    – Eric Johnson
    Mar 25 at 13:27














    Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

    – Nihal Sangeeth
    Mar 25 at 16:32





    Parsing was not the right word. Added it to the answer. Please accept the answer if you find it useful.

    – Nihal Sangeeth
    Mar 25 at 16:32













    Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

    – Eric Johnson
    Mar 26 at 13:56





    Thank you very much. I now see what you mean. Your method worked for me. I tried to redo it for other fields, but I now get an error stating "Attribute Error: 'list' object has no attribute 'split'". I've edited my original post, I'd appreciate it if you could take a look.

    – Eric Johnson
    Mar 26 at 13:56













    2














    You don't have to torture yourself with xpath by the way, you can use css:



    response.css('.cont h2::text').extract_first()
    # '战狼2'
    response.css('.cont h2 + p::text').extract_first()
    # 'Wolf Warriors 2'





    share|improve this answer



























      2














      You don't have to torture yourself with xpath by the way, you can use css:



      response.css('.cont h2::text').extract_first()
      # '战狼2'
      response.css('.cont h2 + p::text').extract_first()
      # 'Wolf Warriors 2'





      share|improve this answer

























        2












        2








        2







        You don't have to torture yourself with xpath by the way, you can use css:



        response.css('.cont h2::text').extract_first()
        # '战狼2'
        response.css('.cont h2 + p::text').extract_first()
        # 'Wolf Warriors 2'





        share|improve this answer













        You don't have to torture yourself with xpath by the way, you can use css:



        response.css('.cont h2::text').extract_first()
        # '战狼2'
        response.css('.cont h2 + p::text').extract_first()
        # 'Wolf Warriors 2'






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Mar 26 at 1:17









        pguardiariopguardiario

        37.7k11 gold badges82 silver badges118 bronze badges




        37.7k11 gold badges82 silver badges118 bronze badges



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55337928%2fwhats-the-correct-scrapy-xpath-for-p-elements-incorrectly-placed-within-h-t%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

            Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

            Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript