Nutch is not parsing the entire website, only the first URLApache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website

Original German for Goethe quote

Everyone Gets a Window Seat

How do French and other Romance language speakers cope with the movable do system?

How is this situation not a checkmate?

What action is recommended if your accommodation refuses to let you leave without paying additional fees?

The answer is a girl's name (my future granddaughter) - can anyone help?

Avoiding dust scattering when you drill

Single tx included in two different blocks

How to identify whether a publisher is genuine or not?

Notation clarity question for a conglomerate of accidentals

Did the Soviet army intentionally send troops (e.g. penal battalions) running over minefields?

Compute the price of a derivative

Subject prefixes in Bantu languages

Is "weekend warrior" derogatory?

Why Vegetable Stock is bitter, but Chicken Stock not?

Knights and Knaves: What does C say?

How dangerous are my worn rims?

French license plates

Ĉi tie or ĉi-tie? Why do people sometimes hyphenate ĉi tie?

Is the "spacetime" the same thing as the mathematical 4th dimension?

Realistically, how much do you need to start investing?

GPLv3 forces us to make code available, but to who?

Should I be an author on another PhD student's paper if I went to their meetings and gave advice?

What is the point of impeaching Trump?

Nutch is not parsing the entire website, only the first URL

Apache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:

import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);

This is what I see in the log:

[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02

What am I missing? It's Nutch 1.15.

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

add a comment
|

I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:

import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);

This is what I see in the log:

[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02

What am I missing? It's Nutch 1.15.

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

add a comment
|

I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:

import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);

This is what I see in the log:

[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02

What am I missing? It's Nutch 1.15.

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:

import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);

This is what I see in the log:

[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02

What am I missing? It's Nutch 1.15.

java nutch

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

asked Mar 28 at 21:13

yegor256

59.6k92 gold badges377 silver badges521 bronze badges

add a comment
|

1 Answer
1

active

oldest

votes

The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.

This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

1

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

1

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55406931%2fnutch-is-not-parsing-the-entire-website-only-the-first-url%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

1

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

1

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

add a comment
|

This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

1

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

1

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

add a comment
|

This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

answered Mar 28 at 22:07

Jorge Luis

2,3692 gold badges11 silver badges18 bronze badges

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

1

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

1

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

add a comment
|

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

1

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

1

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

– yegor256
Mar 29 at 6:35

Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

– Jorge Luis
Mar 29 at 9:53

One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

– Jorge Luis
Mar 29 at 10:15

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

1 Answer
1

1 Answer
1

1 Answer
1