Nutch is not parsing the entire website, only the first URLApache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website

Original German for Goethe quote

Everyone Gets a Window Seat

How do French and other Romance language speakers cope with the movable do system?

How is this situation not a checkmate?

What action is recommended if your accommodation refuses to let you leave without paying additional fees?

The answer is a girl's name (my future granddaughter) - can anyone help?

Avoiding dust scattering when you drill

Single tx included in two different blocks

How to identify whether a publisher is genuine or not?

Notation clarity question for a conglomerate of accidentals

Did the Soviet army intentionally send troops (e.g. penal battalions) running over minefields?

Compute the price of a derivative

Subject prefixes in Bantu languages

Is "weekend warrior" derogatory?

Why Vegetable Stock is bitter, but Chicken Stock not?

Knights and Knaves: What does C say?

How dangerous are my worn rims?

French license plates

Ĉi tie or ĉi-tie? Why do people sometimes hyphenate ĉi tie?

Is the "spacetime" the same thing as the mathematical 4th dimension?

Realistically, how much do you need to start investing?

GPLv3 forces us to make code available, but to who?

Should I be an author on another PhD student's paper if I went to their meetings and gave advice?

What is the point of impeaching Trump?



Nutch is not parsing the entire website, only the first URL


Apache Nutch does not index the entire website, only subfoldersNutch 2 parse and outlinksNutch Crawling and ignoring new urlsApache Nutch ErrorNutch and crawling millions of websitesNutch: input url gets modified by nutch parsecheckerNutch not crawling entire websiteSophisticated page parsing with Apache NutchApache Nutch: Get list of URLs and not content from the entire webHow can nutch be configured to crawl only the updated pages in the website






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









0















I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:



import org.apache.nutch.fetcher.Fetcher;
new Fetcher(conf).fetch(segment, 1);


This is what I see in the log:



[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
[WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
[INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
[INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
[INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
[INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
[INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
[INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
[INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
[INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
[INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
[INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
[INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
[INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
[INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
[INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02


What am I missing? It's Nutch 1.15.










share|improve this question






























    0















    I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:



    import org.apache.nutch.fetcher.Fetcher;
    new Fetcher(conf).fetch(segment, 1);


    This is what I see in the log:



    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
    [WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    [INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
    [INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
    [INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
    [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
    [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
    [INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
    [INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
    [INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
    [INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
    [INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
    [INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
    [INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
    [INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
    [INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
    [INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
    [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
    [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
    [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
    [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
    [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02


    What am I missing? It's Nutch 1.15.










    share|improve this question


























      0












      0








      0








      I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:



      import org.apache.nutch.fetcher.Fetcher;
      new Fetcher(conf).fetch(segment, 1);


      This is what I see in the log:



      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
      [WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
      [INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
      [INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
      [INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
      [INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
      [INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
      [INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
      [INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
      [INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
      [INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
      [INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
      [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
      [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02


      What am I missing? It's Nutch 1.15.










      share|improve this question














      I'm trying to use Nutch Fetcher to fetch the entire website, but it only loads the first URL:



      import org.apache.nutch.fetcher.Fetcher;
      new Fetcher(conf).fetch(segment, 1);


      This is what I see in the log:



      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: starting at 2019-03-29 00:11:47
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: segment: /var/folders/vl/633jwjvn2jvbj9zfg1sgglhw0000gp/T/1198814103175176756/segments/20190329001146
      [WARN] org.apache.hadoop.mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
      [INFO] org.apache.nutch.fetcher.FetchItemQueues: Using queue mode : byHost
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: threads: 1
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: time-out divisor: 2
      [INFO] org.apache.nutch.fetcher.QueueFeeder: QueueFeeder finished: total 1 records hit by time limit : 0
      [INFO] org.apache.nutch.net.URLExemptionFilters: Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter'
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 129 Using queue mode : byHost
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold: -1
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: throughput threshold retries: 5
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 fetching http://www.zerocracy.com/ (queue crawl delay=5000ms)
      [INFO] org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not configured.
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.host = null
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.port = 8080
      [INFO] org.apache.nutch.protocol.http.Http: http.proxy.exception.list = false
      [INFO] org.apache.nutch.protocol.http.Http: http.timeout = 10000
      [INFO] org.apache.nutch.protocol.http.Http: http.content.limit = 65536
      [INFO] org.apache.nutch.protocol.http.Http: http.agent = yc/Nutch-1.15
      [INFO] org.apache.nutch.protocol.http.Http: http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
      [INFO] org.apache.nutch.protocol.http.Http: http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
      [INFO] org.apache.nutch.protocol.http.Http: http.enable.cookie.header = true
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 has no more work available
      [INFO] org.apache.nutch.fetcher.FetcherThread: FetcherThread 133 -finishing thread FetcherThread, activeThreads=0
      [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
      [INFO] org.apache.nutch.fetcher.Fetcher: -activeThreads=0
      [INFO] org.apache.nutch.fetcher.Fetcher: Fetcher: finished at 2019-03-29 00:11:49, elapsed: 00:00:02


      What am I missing? It's Nutch 1.15.







      java nutch






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 28 at 21:13









      yegor256yegor256

      59.6k92 gold badges377 silver badges521 bronze badges




      59.6k92 gold badges377 silver badges521 bronze badges

























          1 Answer
          1






          active

          oldest

          votes


















          1
















          The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.



          This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.






          share|improve this answer

























          • Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

            – yegor256
            Mar 29 at 6:35






          • 1





            Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

            – Jorge Luis
            Mar 29 at 9:53






          • 1





            One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

            – Jorge Luis
            Mar 29 at 10:15












          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55406931%2fnutch-is-not-parsing-the-entire-website-only-the-first-url%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1
















          The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.



          This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.






          share|improve this answer

























          • Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

            – yegor256
            Mar 29 at 6:35






          • 1





            Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

            – Jorge Luis
            Mar 29 at 9:53






          • 1





            One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

            – Jorge Luis
            Mar 29 at 10:15















          1
















          The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.



          This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.






          share|improve this answer

























          • Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

            – yegor256
            Mar 29 at 6:35






          • 1





            Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

            – Jorge Luis
            Mar 29 at 9:53






          • 1





            One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

            – Jorge Luis
            Mar 29 at 10:15













          1














          1










          1









          The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.



          This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.






          share|improve this answer













          The Fetcher class is only responsible of fetching/downloading the URLs present in the segment, using a configured number of threads. This translates into the fetcher not parsing or extracting URLs out of the fetched content. The fetch method only downloads the content, nothing more. For your use case you would need to parse the HTML content by your self (or using the org/apache/nutch/parse tools) and generate a new segment for fetching the newly discovered links.



          This is how usually Nutch works, you provide one or more seed URLs. This URLs are fetched/parsed and the new discovered links are stored for the next iteration.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Mar 28 at 22:07









          Jorge LuisJorge Luis

          2,3692 gold badges11 silver badges18 bronze badges




          2,3692 gold badges11 silver badges18 bronze badges















          • Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

            – yegor256
            Mar 29 at 6:35






          • 1





            Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

            – Jorge Luis
            Mar 29 at 9:53






          • 1





            One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

            – Jorge Luis
            Mar 29 at 10:15

















          • Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

            – yegor256
            Mar 29 at 6:35






          • 1





            Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

            – Jorge Luis
            Mar 29 at 9:53






          • 1





            One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

            – Jorge Luis
            Mar 29 at 10:15
















          Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

          – yegor256
          Mar 29 at 6:35





          Hm... sounds a bit weird. How many iterations will I have? Can you please provide a link to some tutorial explaining this concept? All articles that I've seen so far suggest me doing generate+fetch+parse+dump and that's it. One iteration and the entire website is supposed to be here.

          – yegor256
          Mar 29 at 6:35




          1




          1





          Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

          – Jorge Luis
          Mar 29 at 9:53





          Exactly, what you described is the normal cycle: generate/fetch/parse/update, this is what the bin/crawl script does (github.com/apache/nutch/blob/master/src/bin/crawl#L340-L352). The issue is that fetch is only the step of downloading the content, nothing more. The link extraction from the raw HTML happens on the parse step, then the crawldb is updated and the new links can go then through the same cycle.

          – Jorge Luis
          Mar 29 at 9:53




          1




          1





          One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

          – Jorge Luis
          Mar 29 at 10:15





          One more thing it is very difficult to do an entire website in one cycle. Not even in multiple cycles. Nutch doesn't have any way of knowing how many links it will be found on a website. This is restricted either by specifying how many URLs do you want to fetch or how "deep" the crawl should go.

          – Jorge Luis
          Mar 29 at 10:15




















          draft saved

          draft discarded















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55406931%2fnutch-is-not-parsing-the-entire-website-only-the-first-url%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript