How can I get proper response back from scrapy?How can I safely create a nested directory?How can I remove a trailing newline in Python?How to randomly select an item from a list?How to get the current time in PythonHow can I make a time delay in Python?How do I get the number of elements in a list in Python?Scrapy is throwing URL errorBack to basics: ScrapyPass extra values with start_url without meta to Scrapy spiderHow to process Scrapy output for NLP?
How does the Around command at zero work?
How can one's career as a reviewer be ended?
Why was this person allowed to become Grand Maester?
A word that means "blending into a community too much"
Grep Match and extract
Should I refuse to be named as co-author of a low quality paper?
If there's something that implicates the president why is there then a national security issue? (John Dowd)
Teaching a class likely meant to inflate the GPA of student athletes
What would be the way to say "just saying" in German? (Not the literal translation)
Did Apple bundle a specific monitor with the Apple II+ for schools?
Russian word for a male zebra
What aircraft was used as Air Force One for the flight between Southampton and Shannon?
Why did Intel abandon unified CPU cache?
Amplitude of a crest and trough in a sound wave?
Generate basis elements of the Steenrod algebra
Does a bank have to tell me if a check made out to me was cashed there?
Do people with slow metabolism tend to gain weight (fat) if they stop exercising?
How creative should the DM let an artificer be in terms of what they can build?
Which languages would be most useful in Europe at the end of the 19th century?
What is the polarity of this barrel plug with a double circle?
What STL algorithm can determine if exactly one item in a container satisfies a predicate?
Who won a Game of Bar Dice?
Why am I Seeing A Weird "Notch" on the Data Line For Some Logical 1s?
How can I remove material from this wood beam?
How can I get proper response back from scrapy?
How can I safely create a nested directory?How can I remove a trailing newline in Python?How to randomly select an item from a list?How to get the current time in PythonHow can I make a time delay in Python?How do I get the number of elements in a list in Python?Scrapy is throwing URL errorBack to basics: ScrapyPass extra values with start_url without meta to Scrapy spiderHow to process Scrapy output for NLP?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
My results as you can see its missing some parts.. 
Hope any of you see whats going on.. thank you!
python web-scraping scrapy
add a comment |
I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
My results as you can see its missing some parts.. 
Hope any of you see whats going on.. thank you!
python web-scraping scrapy
add a comment |
I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
My results as you can see its missing some parts.. 
Hope any of you see whats going on.. thank you!
python web-scraping scrapy
I am trying to scrape some search results from this company register, but when i try to scrape the company name my results dont seem to return properly, its like the company name item is split into 2 html items based of the search keyword.
Is there a way to join these together? This is my spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': i.css('li.type-company h3 a::text').extract(),
'address': i.css('li.type-company p::text').extract(),
My results as you can see its missing some parts.. 
Hope any of you see whats going on.. thank you!
python web-scraping scrapy
python web-scraping scrapy
asked Mar 24 at 20:44
Hi tEHi tE
235
235
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
add a comment |
Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield
'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55328394%2fhow-can-i-get-proper-response-back-from-scrapy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
add a comment |
As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
add a comment |
As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),
As I see, you want to fetch all the texts within a and p tags and there is many tags within this tags.
Try this one and remove the unnecessary spaces through regex:
import scrapy
import re
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('ul.results-list'):
yield
'company_name': re.sub('s+',' ',''.join(i.css('li.type-company h3 a ::text').extract())),
'address': re.sub('s+',' ',''.join(i.css('li.type-company p ::text').extract())),
edited Mar 25 at 5:44
answered Mar 25 at 1:22
PankajPankaj
856714
856714
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
add a comment |
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
1
1
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
Thank you very much, that is what I was looking for! Ill look some more into regex! thats amazing haha!
– Hi tE
Mar 25 at 11:37
add a comment |
Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield
'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
add a comment |
Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield
'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
add a comment |
Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield
'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),
Using the regex, just modified the code for a better output.
import re
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'gov2'
start_urls = ['https://beta.companieshouse.gov.uk/search/companies?q=a']
def parse(self, response):
for i in response.css('.type-company'):
yield
'company_name': re.sub('s+', ' ', ''.join(i.css('h3 a ::text').extract())),
'address': re.sub('s+', ' ', ''.join(i.css('p ::text').extract())),
answered Mar 25 at 2:49
Arun AugustineArun Augustine
304110
304110
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
add a comment |
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
Thanks the output looks way betetr hahah
– Hi tE
Mar 25 at 11:39
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55328394%2fhow-can-i-get-proper-response-back-from-scrapy%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown