Does robots.txt prevent humans to gather data?Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt

Closest Prime Number

Two monoidal structures and copowering

How to be diplomatic in refusing to write code that breaches the privacy of our users

What is the intuitive meaning of having a linear relationship between the logs of two variables?

Hostile work environment after whistle-blowing on coworker and our boss. What do I do?

Applicability of Single Responsibility Principle

How can I kill an app using Terminal?

Did Dumbledore lie to Harry about how long he had James Potter's invisibility cloak when he was examining it? If so, why?

Lay out the Carpet

Go Pregnant or Go Home

Trouble understanding the speech of overseas colleagues

How do I go from 300 unfinished/half written blog posts, to published posts?

Why Were Madagascar and New Zealand Discovered So Late?

How long to clear the 'suck zone' of a turbofan after start is initiated?

Customer Requests (Sometimes) Drive Me Bonkers!

Integer addition + constant, is it a group?

Italian words for tools

Avoiding estate tax by giving multiple gifts

How does the UK government determine the size of a mandate?

How easy is it to start Magic from scratch?

Nautlius: add mouse right-click action to compute MD5 sum

Why not increase contact surface when reentering the atmosphere?

What happens if you roll doubles 3 times then land on "Go to jail?"

How to pronounce the slash sign

Does robots.txt prevent humans to gather data?

Ethics of robots.txtHow to configure robots.txt to allow everything?What does <meta http-equiv=“X-UA-Compatible” content=“IE=edge”> do?Can a relative sitemap url be used in a robots.txt?What does /*.php$ mean in robots.txt?Where to put robots.txt to prevent crawlingInterpreting robots.txt vs. terms of useRobots.txt - prevent index of .html filesDoes order matter in robots.txt?Scrapy and respect of robots.txt

I understand that robots.txt is a file which is intended for "robots" or should I say "automated crawler". However, does it prevent a human from typing the "forbidden" page and gather the data by hand?

Maybe it's clearer with an example: I cannot crawl this page:

https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE

Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

add a comment |

Maybe it's clearer with an example: I cannot crawl this page:

https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE

Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

add a comment |

Maybe it's clearer with an example: I cannot crawl this page:

https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE

Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

Maybe it's clearer with an example: I cannot crawl this page:

https://www.drivy.com/search?address=Gare+de+Li%C3%A8ge-Guillemins&address_source=&poi_id=&latitude=50.6251&longitude=5.5659&city_display_name=&start_date=2019-04-06&start_time=06%3A00&end_date=2019-04-07&end_time=06%3A00&country_scope=BE

Can I still take "manually" via the my web browser's developers tool the JSON file containing the data?

browser scrapy robots.txt

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

edited Mar 23 at 2:15

unor

68.1k17145250

edited Mar 23 at 2:15

unor

68.1k17145250

edited Mar 23 at 2:15

unor

68.1k17145250

asked Mar 21 at 15:50

M. Coppée

162

asked Mar 21 at 15:50

M. Coppée

162

asked Mar 21 at 15:50

M. Coppée

162

add a comment |

2 Answers
2

active

oldest

votes

robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.

The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.

Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.

answered Mar 21 at 16:05

Gallaecio

96011023

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

add a comment |

Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

So, robots are programs that automatically retrieve documents linked/referenced in other documents.

If a human retrieves a document (using a browser or some other program), or if a human feeds a list of manually collected URLs to some program (and the program doesn’t add/follow references in the retrieved documents), the rules in the robots.txt do not apply.

The FAQ "What is a WWW robot?" confirms this:

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

answered Mar 23 at 2:36

unor

68.1k17145250

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55284346%2fdoes-robots-txt-prevent-humans-to-gather-data%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.

The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.

Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.

answered Mar 21 at 16:05

Gallaecio

96011023

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

add a comment |

robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.

The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.

Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.

answered Mar 21 at 16:05

Gallaecio

96011023

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

add a comment |

robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.

The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.

Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.

answered Mar 21 at 16:05

Gallaecio

96011023

robots.txt files are guidelines, they do not prevent anyone, human or machine, from accessing any content.

The default settings.py file that is generated for a Scrapy project sets ROBOTSTXT_OBEY to True. You can set it to False if you wish.

Mind that websites may employ anti-scraping measures to prevent you from scraping those pages, nonetheless. But that is a whole other topic.

answered Mar 21 at 16:05

Gallaecio

96011023

answered Mar 21 at 16:05

Gallaecio

96011023

answered Mar 21 at 16:05

Gallaecio

96011023

answered Mar 21 at 16:05

Gallaecio

96011023

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

add a comment |

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

Oh ok, therefore I guess that if I use scrapy Shell for an URL. If the response appears it automatically means that my bot is respecting the robots.txt (since default is TRUE) ?

– M. Coppée
Mar 21 at 16:23

Actually, the default value of ROBOTSTXT_OBEY is False for historical reasons, but if you create a project with the scrapy command-line tool, the generated settings.py file overrides the default value and sets it to True. Check your settings.py file if you are unsure. If you do not have a settings.py file, then ROBOTSTXT_OBEY is False, and whatever issue your are having must be unrelated to the robots.txt file.

– Gallaecio
Mar 22 at 12:24

add a comment |

Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

So, robots are programs that automatically retrieve documents linked/referenced in other documents.

The FAQ "What is a WWW robot?" confirms this:

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

answered Mar 23 at 2:36

unor

68.1k17145250

add a comment |

Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

So, robots are programs that automatically retrieve documents linked/referenced in other documents.

The FAQ "What is a WWW robot?" confirms this:

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

answered Mar 23 at 2:36

unor

68.1k17145250

add a comment |

Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

So, robots are programs that automatically retrieve documents linked/referenced in other documents.

The FAQ "What is a WWW robot?" confirms this:

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

answered Mar 23 at 2:36

unor

68.1k17145250

Based on the original robots.txt specification from 1994, the rules in a robots.txt only target robots (bold emphasis mine):

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

[…]

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed.

So, robots are programs that automatically retrieve documents linked/referenced in other documents.

The FAQ "What is a WWW robot?" confirms this:

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

answered Mar 23 at 2:36

unor

68.1k17145250

answered Mar 23 at 2:36

unor

68.1k17145250

answered Mar 23 at 2:36

unor

68.1k17145250

answered Mar 23 at 2:36

unor

68.1k17145250

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

밀양 대씨 역사 각주 함께 보기 둘러보기 메뉴밀양 대씨

1973년 목차 사건 문화 탄생 사망 노벨상 달력 둘러보기 메뉴

2 Answers
2

2 Answers
2

2 Answers
2