Nutch is not parsing the entire website, only the first URLApache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website
Original German for Goethe quote
Everyone Gets a Window Seat
How do French and other Romance language speakers cope with the movable do system?
How is this situation not a checkmate?
What action is recommended if your accommodation refuses to let you leave without paying additional fees?
The answer is a girl's name (my future granddaughter) - can anyone help?
Avoiding dust scattering when you drill
Single tx included in two different blocks
How to identify whether a publisher is genuine or not?
Notation clarity question for a conglomerate of accidentals
Did the Soviet army intentionally send troops (e.g. penal battalions) running over minefields?
Compute the price of a derivative
Subject prefixes in Bantu languages
Is "weekend warrior" derogatory?
Why Vegetable Stock is bitter, but Chicken Stock not?
Knights and Knaves: What does C say?
How dangerous are my worn rims?
French license plates
Ĉi tie or ĉi-tie? Why do people sometimes hyphenate ĉi tie?
Is the "spacetime" the same thing as the mathematical 4th dimension?
Realistically, how much do you need to start investing?
GPLv3 forces us to make code available, but to who?
Should I be an author on another PhD student's paper if I went to their meetings and gave advice?
What is the point of impeaching Trump?
Nutch is not parsing the entire website, only the first URL
Apache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I'm trying to use Nutch Fetcher
to fetch the entire website, but it only loads the first URL:
import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);
This is what I see in the log:
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02
What am I missing? It's Nutch 1.15.
java nutch
add a comment
|
I'm trying to use Nutch Fetcher
to fetch the entire website, but it only loads the first URL:
import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);
This is what I see in the log:
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02
What am I missing? It's Nutch 1.15.
java nutch
add a comment
|
I'm trying to use Nutch Fetcher
to fetch the entire website, but it only loads the first URL:
import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);
This is what I see in the log:
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02
What am I missing? It's Nutch 1.15.
java nutch
I'm trying to use Nutch Fetcher
to fetch the entire website, but it only loads the first URL:
import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);
This is what I see in the log:
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02
What am I missing? It's Nutch 1.15.
java nutch
java nutch
asked Mar 28 at 21:13
yegor256yegor256
59.6k92 gold badges377 silver badges521 bronze badges
59.6k92 gold badges377 silver badges521 bronze badges
add a comment
|
add a comment
|
1 Answer
1
active
oldest
votes
The Fetcher
class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch
method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse
tools) and generate a new segment for fetching the newly discovered links.
This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what thebin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on theparse
step, then the crawldb is updated and the new links can go then through the same cycle.
– Jorge Luis
Mar 29 at 9:53
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55406931%2fnutch-is-not-parsing-the-entire-website-only-the-first-url%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The Fetcher
class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch
method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse
tools) and generate a new segment for fetching the newly discovered links.
This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what thebin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on theparse
step, then the crawldb is updated and the new links can go then through the same cycle.
– Jorge Luis
Mar 29 at 9:53
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
add a comment
|
The Fetcher
class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch
method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse
tools) and generate a new segment for fetching the newly discovered links.
This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what thebin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on theparse
step, then the crawldb is updated and the new links can go then through the same cycle.
– Jorge Luis
Mar 29 at 9:53
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
add a comment
|
The Fetcher
class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch
method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse
tools) and generate a new segment for fetching the newly discovered links.
This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.
The Fetcher
class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch
method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse
tools) and generate a new segment for fetching the newly discovered links.
This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.
answered Mar 28 at 22:07
Jorge LuisJorge Luis
2,3692 gold badges11 silver badges18 bronze badges
2,3692 gold badges11 silver badges18 bronze badges
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what thebin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on theparse
step, then the crawldb is updated and the new links can go then through the same cycle.
– Jorge Luis
Mar 29 at 9:53
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
add a comment
|
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what thebin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on theparse
step, then the crawldb is updated and the new links can go then through the same cycle.
– Jorge Luis
Mar 29 at 9:53
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.
– yegor256
Mar 29 at 6:35
1
1
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the
bin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse
step, then the crawldb is updated and the new links can go then through the same cycle.– Jorge Luis
Mar 29 at 9:53
Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the
bin/crawl
script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse
step, then the crawldb is updated and the new links can go then through the same cycle.– Jorge Luis
Mar 29 at 9:53
1
1
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.
– Jorge Luis
Mar 29 at 10:15
add a comment
|
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55406931%2fnutch-is-not-parsing-the-entire-website-only-the-first-url%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown