Does robots.txt prevent humans to gather data?Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt

Closest Prime Number

Two monoidal structures and copowering

How to be diplomatic in refusing to write code that breaches the privacy of our users

What is the intuitive meaning of having a linear relationship between the logs of two variables?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Applicability of Single Responsibility Principle

How can I kill an app using Terminal?

Did Dumbledore lie to Harry about how long he had James Potter's invisibility cloak when he was examining it? If so, why?

Lay out the Carpet

Go Pregnant or Go Home

Trouble understanding the speech of overseas colleagues

How do I go from 300 unfinished/half written blog posts, to published posts?

Why Were Madagascar and New Zealand Discovered So Late?

How long to clear the 'suck zone' of a turbofan after start is initiated?

Customer Requests (Sometimes) Drive Me Bonkers!

Integer addition + constant, is it a group?

Italian words for tools

Avoiding estate tax by giving multiple gifts

How does the UK government determine the size of a mandate?

How easy is it to start Magic from scratch?

Nautlius: add mouse right-click action to compute MD5 sum

Why not increase contact surface when reentering the atmosphere?

What happens if you roll doubles 3 times then land on "Go to jail?"

How to pronounce the slash sign



Does robots.txt prevent humans to gather data?


Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt













1















I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?



Maybe it's clearer with an example: I cannot crawl this page:



https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE


Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?










share|improve this question




























    1















    I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?



    Maybe it's clearer with an example: I cannot crawl this page:



    https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE


    Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?










    share|improve this question


























      1












      1








      1








      I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?



      Maybe it's clearer with an example: I cannot crawl this page:



      https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE


      Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?










      share|improve this question
















      I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?



      Maybe it's clearer with an example: I cannot crawl this page:



      https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE


      Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?







      browser scrapy robots.txt






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 23 at 2:15









      unor

      68.1k17145250




      68.1k17145250










      asked Mar 21 at 15:50









      M. CoppéeM. Coppée

      162




      162






















          2 Answers
          2






          active

          oldest

          votes


















          1














          robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.



          The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.



          Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.






          share|improve this answer























          • Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

            – M. Coppée
            Mar 21 at 16:23











          • Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

            – Gallaecio
            Mar 22 at 12:24


















          1














          Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):




          WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.



          […]



          These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.




          So, robots are programs that automatically retrieve documents linked/referenced in other documents.



          If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.



          The FAQ "What is a WWW robot?" confirms this:




          Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).







          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55284346%2fdoes-robots-txt-prevent-humans-to-gather-data%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.



            The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.



            Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.






            share|improve this answer























            • Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

              – M. Coppée
              Mar 21 at 16:23











            • Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

              – Gallaecio
              Mar 22 at 12:24















            1














            robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.



            The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.



            Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.






            share|improve this answer























            • Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

              – M. Coppée
              Mar 21 at 16:23











            • Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

              – Gallaecio
              Mar 22 at 12:24













            1












            1








            1







            robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.



            The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.



            Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.






            share|improve this answer













            robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.



            The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.



            Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 21 at 16:05









            GallaecioGallaecio

            96011023




            96011023












            • Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

              – M. Coppée
              Mar 21 at 16:23











            • Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

              – Gallaecio
              Mar 22 at 12:24

















            • Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

              – M. Coppée
              Mar 21 at 16:23











            • Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

              – Gallaecio
              Mar 22 at 12:24
















            Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

            – M. Coppée
            Mar 21 at 16:23





            Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

            – M. Coppée
            Mar 21 at 16:23













            Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

            – Gallaecio
            Mar 22 at 12:24





            Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

            – Gallaecio
            Mar 22 at 12:24













            1














            Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):




            WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.



            […]



            These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.




            So, robots are programs that automatically retrieve documents linked/referenced in other documents.



            If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.



            The FAQ "What is a WWW robot?" confirms this:




            Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).







            share|improve this answer



























              1














              Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):




              WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.



              […]



              These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.




              So, robots are programs that automatically retrieve documents linked/referenced in other documents.



              If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.



              The FAQ "What is a WWW robot?" confirms this:




              Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).







              share|improve this answer

























                1












                1








                1







                Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):




                WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.



                […]



                These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.




                So, robots are programs that automatically retrieve documents linked/referenced in other documents.



                If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.



                The FAQ "What is a WWW robot?" confirms this:




                Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).







                share|improve this answer













                Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):




                WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.



                […]



                These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.




                So, robots are programs that automatically retrieve documents linked/referenced in other documents.



                If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.



                The FAQ "What is a WWW robot?" confirms this:




                Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).








                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 23 at 2:36









                unorunor

                68.1k17145250




                68.1k17145250



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55284346%2fdoes-robots-txt-prevent-humans-to-gather-data%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    SQL error code 1064 with creating Laravel foreign keysForeign key constraints: When to use ON UPDATE and ON DELETEDropping column with foreign key Laravel error: General error: 1025 Error on renameLaravel SQL Can't create tableLaravel Migration foreign key errorLaravel php artisan migrate:refresh giving a syntax errorSQLSTATE[42S01]: Base table or view already exists or Base table or view already exists: 1050 Tableerror in migrating laravel file to xampp serverSyntax error or access violation: 1064:syntax to use near 'unsigned not null, modelName varchar(191) not null, title varchar(191) not nLaravel cannot create new table field in mysqlLaravel 5.7:Last migration creates table but is not registered in the migration table

                    용인 삼성생명 블루밍스 목차 통계 역대 감독 선수단 응원단 경기장 같이 보기 외부 링크 둘러보기 메뉴samsungblueminx.comeh선수 명단용인 삼성생명 블루밍스용인 삼성생명 블루밍스ehsamsungblueminx.comeheheheh

                    155 수학 과학 기타 둘러보기 메뉴eh추가해eh문서를 완성해