Does robots.txt prevent humans to gather data?Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt
Closest Prime Number
Two monoidal structures and copowering
How to be diplomatic in refusing to write code that breaches the privacy of our users
What is the intuitive meaning of having a linear relationship between the logs of two variables?
Hostile work environment after whistle-blowing on coworker and our boss. What do I do?
Applicability of Single Responsibility Principle
How can I kill an app using Terminal?
Did Dumbledore lie to Harry about how long he had James Potter's invisibility cloak when he was examining it? If so, why?
Lay out the Carpet
Go Pregnant or Go Home
Trouble understanding the speech of overseas colleagues
How do I go from 300 unfinished/half written blog posts, to published posts?
Why Were Madagascar and New Zealand Discovered So Late?
How long to clear the 'suck zone' of a turbofan after start is initiated?
Customer Requests (Sometimes) Drive Me Bonkers!
Integer addition + constant, is it a group?
Italian words for tools
Avoiding estate tax by giving multiple gifts
How does the UK government determine the size of a mandate?
How easy is it to start Magic from scratch?
Nautlius: add mouse right-click action to compute MD5 sum
Why not increase contact surface when reentering the atmosphere?
What happens if you roll doubles 3 times then land on "Go to jail?"
How to pronounce the slash sign
Does robots.txt prevent humans to gather data?
Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt
I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?
Maybe it's clearer with an example: I cannot crawl this page:
https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE
Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?
browser scrapy robots.txt
add a comment |
I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?
Maybe it's clearer with an example: I cannot crawl this page:
https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE
Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?
browser scrapy robots.txt
add a comment |
I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?
Maybe it's clearer with an example: I cannot crawl this page:
https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE
Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?
browser scrapy robots.txt
I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?
Maybe it's clearer with an example: I cannot crawl this page:
https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE
Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?
browser scrapy robots.txt
browser scrapy robots.txt
edited Mar 23 at 2:15
unor
68.1k17145250
68.1k17145250
asked Mar 21 at 15:50
M. CoppéeM. Coppée
162
162
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value ofROBOTSTXT_OBEYisFalsefor historical reasons, but if you create a project with thescrapycommand-line tool, the generatedsettings.pyfile overrides the default value and sets it toTrue. Check yoursettings.pyfile if you are unsure. If you do not have asettings.pyfile, thenROBOTSTXT_OBEYisFalse, and whatever issue your are having must be unrelated to therobots.txtfile.
– Gallaecio
Mar 22 at 12:24
add a comment |
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55284346%2fdoes-robots-txt-prevent-humans-to-gather-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value ofROBOTSTXT_OBEYisFalsefor historical reasons, but if you create a project with thescrapycommand-line tool, the generatedsettings.pyfile overrides the default value and sets it toTrue. Check yoursettings.pyfile if you are unsure. If you do not have asettings.pyfile, thenROBOTSTXT_OBEYisFalse, and whatever issue your are having must be unrelated to therobots.txtfile.
– Gallaecio
Mar 22 at 12:24
add a comment |
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value ofROBOTSTXT_OBEYisFalsefor historical reasons, but if you create a project with thescrapycommand-line tool, the generatedsettings.pyfile overrides the default value and sets it toTrue. Check yoursettings.pyfile if you are unsure. If you do not have asettings.pyfile, thenROBOTSTXT_OBEYisFalse, and whatever issue your are having must be unrelated to therobots.txtfile.
– Gallaecio
Mar 22 at 12:24
add a comment |
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.
The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.
Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.
answered Mar 21 at 16:05
GallaecioGallaecio
96011023
96011023
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value ofROBOTSTXT_OBEYisFalsefor historical reasons, but if you create a project with thescrapycommand-line tool, the generatedsettings.pyfile overrides the default value and sets it toTrue. Check yoursettings.pyfile if you are unsure. If you do not have asettings.pyfile, thenROBOTSTXT_OBEYisFalse, and whatever issue your are having must be unrelated to therobots.txtfile.
– Gallaecio
Mar 22 at 12:24
add a comment |
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value ofROBOTSTXT_OBEYisFalsefor historical reasons, but if you create a project with thescrapycommand-line tool, the generatedsettings.pyfile overrides the default value and sets it toTrue. Check yoursettings.pyfile if you are unsure. If you do not have asettings.pyfile, thenROBOTSTXT_OBEYisFalse, and whatever issue your are having must be unrelated to therobots.txtfile.
– Gallaecio
Mar 22 at 12:24
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?
– M. Coppée
Mar 21 at 16:23
Actually, the default value of
ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.– Gallaecio
Mar 22 at 12:24
Actually, the default value of
ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.– Gallaecio
Mar 22 at 12:24
add a comment |
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
add a comment |
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
add a comment |
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
[…]
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.
So, robots are programs that automatically retrieve documents linked/referenced in other documents.
If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.
The FAQ "What is a WWW robot?" confirms this:
Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
answered Mar 23 at 2:36
unorunor
68.1k17145250
68.1k17145250
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55284346%2fdoes-robots-txt-prevent-humans-to-gather-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown