How can I get proper response back from scrapy?How can I safely create a nested directory?How can I remove a trailing newline in Python?How to randomly select an item from a list?How to get the current time in PythonHow can I make a time delay in Python?How do I get the number of elements in a list in Python?Scrapy is throwing URL errorBack to basics: ScrapyPass extra values with start_url without meta to Scrapy spiderHow to process Scrapy output for NLP?

How does the Around command at zero work?

How can one's career as a reviewer be ended?

Why was this person allowed to become Grand Maester?

A word that means "blending into a community too much"

Grep Match and extract

Should I refuse to be named as co-author of a low quality paper?

If there's something that implicates the president why is there then a national security issue? (John Dowd)

Teaching a class likely meant to inflate the GPA of student athletes

What would be the way to say "just saying" in German? (Not the literal translation)

Did Apple bundle a specific monitor with the Apple II+ for schools?

Russian word for a male zebra

What aircraft was used as Air Force One for the flight between Southampton and Shannon?

Why did Intel abandon unified CPU cache?

Amplitude of a crest and trough in a sound wave?

Generate basis elements of the Steenrod algebra

Does a bank have to tell me if a check made out to me was cashed there?

Do people with slow metabolism tend to gain weight (fat) if they stop exercising?

How creative should the DM let an artificer be in terms of what they can build?

Which languages would be most useful in Europe at the end of the 19th century?

What is the polarity of this barrel plug with a double circle?

What STL algorithm can determine if exactly one item in a container satisfies a predicate?

Who won a Game of Bar Dice?

Why am I Seeing A Weird "Notch" on the Data Line For Some Logical 1s?

How can I remove material from this wood beam?

How can I get proper response back from scrapy?

How can I safely create a nested directory?How can I remove a trailing newline in Python?How to randomly select an item from a list?How to get the current time in PythonHow can I make a time delay in Python?How do I get the number of elements in a list in Python?Scrapy is throwing URL errorBack to basics: ScrapyPass extra values with start_url without meta to Scrapy spiderHow to process Scrapy output for NLP?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.

Is there a way to join these together? This is my spider

import scrapy

class QuotesSpider(scrapy.Spider):

name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

def parse(self, response):

for i in response.css('ul.results-list'):
 yield 
 'company_name': i.css('li.type-company h3 a::text').extract(),
 'address': i.css('li.type-company p::text').extract(),

My results as you can see its missing some parts.. enter image description here

Hope any of you see whats going on.. thank you!

asked Mar 24 at 20:44

Hi tE

235

add a comment |

Is there a way to join these together? This is my spider

import scrapy

class QuotesSpider(scrapy.Spider):

name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

def parse(self, response):

for i in response.css('ul.results-list'):
 yield 
 'company_name': i.css('li.type-company h3 a::text').extract(),
 'address': i.css('li.type-company p::text').extract(),

My results as you can see its missing some parts.. enter image description here

Hope any of you see whats going on.. thank you!

asked Mar 24 at 20:44

Hi tE

235

add a comment |

Is there a way to join these together? This is my spider

import scrapy

class QuotesSpider(scrapy.Spider):

name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

def parse(self, response):

for i in response.css('ul.results-list'):
 yield 
 'company_name': i.css('li.type-company h3 a::text').extract(),
 'address': i.css('li.type-company p::text').extract(),

My results as you can see its missing some parts.. enter image description here

Hope any of you see whats going on.. thank you!

asked Mar 24 at 20:44

Hi tE

235

Is there a way to join these together? This is my spider

import scrapy

class QuotesSpider(scrapy.Spider):

name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

def parse(self, response):

for i in response.css('ul.results-list'):
 yield 
 'company_name': i.css('li.type-company h3 a::text').extract(),
 'address': i.css('li.type-company p::text').extract(),

My results as you can see its missing some parts.. enter image description here

Hope any of you see whats going on.. thank you!

python web-scraping scrapy

asked Mar 24 at 20:44

Hi tE

235

asked Mar 24 at 20:44

Hi tE

235

asked Mar 24 at 20:44

Hi tE

235

asked Mar 24 at 20:44

Hi tE

235

asked Mar 24 at 20:44

Hi tE

235

add a comment |

2 Answers
2

active

oldest

votes

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.

Try this one and remove the unnecessary spaces through regex:

import scrapy
import re

class QuotesSpider(scrapy.Spider):

 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):

 for i in response.css('ul.results-list'):
 yield 
 'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
 'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

1

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

add a comment |

Using the regex, just modified the code for a better output.

import re
import scrapy


class QuotesSpider(scrapy.Spider):
 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):
 for i in response.css('.type-company'):
 yield 
 'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
 'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),

answered Mar 25 at 2:49

Arun Augustine

304110

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55328394%2fhow-can-i-get-proper-response-back-from-scrapy%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.

Try this one and remove the unnecessary spaces through regex:

import scrapy
import re

class QuotesSpider(scrapy.Spider):

 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):

 for i in response.css('ul.results-list'):
 yield 
 'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
 'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

1

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

add a comment |

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.

Try this one and remove the unnecessary spaces through regex:

import scrapy
import re

class QuotesSpider(scrapy.Spider):

 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):

 for i in response.css('ul.results-list'):
 yield 
 'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
 'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

1

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

add a comment |

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.

Try this one and remove the unnecessary spaces through regex:

import scrapy
import re

class QuotesSpider(scrapy.Spider):

 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):

 for i in response.css('ul.results-list'):
 yield 
 'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
 'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.

Try this one and remove the unnecessary spaces through regex:

import scrapy
import re

class QuotesSpider(scrapy.Spider):

 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):

 for i in response.css('ul.results-list'):
 yield 
 'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
 'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

edited Mar 25 at 5:44

answered Mar 25 at 1:22

Pankaj

856714

answered Mar 25 at 1:22

Pankaj

856714

answered Mar 25 at 1:22

Pankaj

856714

1

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

add a comment |

1

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!

– Hi tE
Mar 25 at 11:37

add a comment |

Using the regex, just modified the code for a better output.

import re
import scrapy


class QuotesSpider(scrapy.Spider):
 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):
 for i in response.css('.type-company'):
 yield 
 'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
 'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),

answered Mar 25 at 2:49

Arun Augustine

304110

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

add a comment |

Using the regex, just modified the code for a better output.

import re
import scrapy


class QuotesSpider(scrapy.Spider):
 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):
 for i in response.css('.type-company'):
 yield 
 'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
 'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),

answered Mar 25 at 2:49

Arun Augustine

304110

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

add a comment |

Using the regex, just modified the code for a better output.

import re
import scrapy


class QuotesSpider(scrapy.Spider):
 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):
 for i in response.css('.type-company'):
 yield 
 'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
 'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),

answered Mar 25 at 2:49

Arun Augustine

304110

Using the regex, just modified the code for a better output.

import re
import scrapy


class QuotesSpider(scrapy.Spider):
 name = 'gov2'
 start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']

 def parse(self, response):
 for i in response.css('.type-company'):
 yield 
 'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
 'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),

answered Mar 25 at 2:49

Arun Augustine

304110

answered Mar 25 at 2:49

Arun Augustine

304110

answered Mar 25 at 2:49

Arun Augustine

304110

answered Mar 25 at 2:49

Arun Augustine

304110

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

add a comment |

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

Thanks the output looks way betetr hahah

– Hi tE
Mar 25 at 11:39

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers
2

2 Answers
2

2 Answers
2