using rendered page links via java clientIs Java “pass-by-reference” or “pass-by-value”?How do I efficiently iterate over each entry in a Java Map?Does a finally block always get executed in Java?What is the difference between public, protected, package-private and private in Java?How do I read / convert an InputStream into a String in Java?When to use LinkedList over ArrayList in Java?How do I generate random integers within a specific range in Java?How do I determine whether an array contains a particular value in Java?How do I convert a String to an int in Java?Creating a memory leak with Java
Did Voldemort kill his father before finding out about Horcruxes?
how many bits in the resultant hash will change, if the x bits are changed in its the original input
Animal Shelter Management C++
Interviewing with an unmentioned 9 months of sick leave taken during a job
How fast does a character need to move to be effectively invisible?
What powers the air required for pneumatic brakes in aircraft?
Should I be able to keep my company purchased standing desk when I leave my job?
Do I need a 50/60Hz notch filter for battery powered devices?
Manually select/unselect lines before forwarding to stdout
Can a pizza stone be fixed after soap has been used to clean it?
How should one refer to knights (& dames) in academic writing?
What are "full piece" and "half piece" in chess?
How to determine the optimal threshold to achieve the highest accuracy
Does the Intel 8085 CPU use real memory addresses?
Credit card details stolen every 1-2 years. What am I doing wrong?
What is the meaning of [[:space:]] in bash?
Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?
Strategy to pay off revolving debt while building reserve savings fund?
Is there any conditions on a finite abelian group so that it cannot be class group of any number field?
Why does FFmpeg choose 10+20+20 ms instead of an even 16 ms for 60 fps GIF images?
What advantages do focused Arrows of Slaying have over more generic ones?
FPGA CPU's, how to find the max speed?
Why does "git status" show I'm on the master branch and "git branch" does not?
Why should I cook the flour first when making bechamel sauce?
using rendered page links via java client
Is Java “pass-by-reference” or “pass-by-value”?How do I efficiently iterate over each entry in a Java Map?Does a finally block always get executed in Java?What is the difference between public, protected, package-private and private in Java?How do I read / convert an InputStream into a String in Java?When to use LinkedList over ArrayList in Java?How do I generate random integers within a specific range in Java?How do I determine whether an array contains a particular value in Java?How do I convert a String to an int in Java?Creating a memory leak with Java
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
@Test
public void htmlUnitTest() throws Exception
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
java htmlunit
add a comment |
I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
@Test
public void htmlUnitTest() throws Exception
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
java htmlunit
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31
add a comment |
I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
@Test
public void htmlUnitTest() throws Exception
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
java htmlunit
I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
@Test
public void htmlUnitTest() throws Exception
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
java htmlunit
java htmlunit
asked Mar 26 at 8:09
yoav.stryoav.str
5925 gold badges25 silver badges62 bronze badges
5925 gold badges25 silver badges62 bronze badges
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31
add a comment |
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31
add a comment |
1 Answer
1
active
oldest
votes
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55352432%2fusing-rendered-page-links-via-java-client%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
add a comment |
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
add a comment |
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
edited Mar 29 at 17:44
answered Mar 27 at 18:19
RBRiRBRi
1,4512 gold badges7 silver badges10 bronze badges
1,4512 gold badges7 silver badges10 bronze badges
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
add a comment |
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?
– yoav.str
Mar 28 at 19:05
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55352432%2fusing-rendered-page-links-via-java-client%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
please provide the url to give us a chance to reproduce your case
– RBRi
Mar 26 at 10:37
any site which has rendering try for instance edition.cnn.com
– yoav.str
Mar 27 at 8:31