using rendered page links via java clientIs Java “pass-by-reference” or “pass-by-value”?How do I efficiently iterate over each entry in a Java Map?Does a finally block always get executed in Java?What is the difference between public, protected, package-private and private in Java?How do I read / convert an InputStream into a String in Java?When to use LinkedList over ArrayList in Java?How do I generate random integers within a specific range in Java?How do I determine whether an array contains a particular value in Java?How do I convert a String to an int in Java?Creating a memory leak with Java

Did Voldemort kill his father before finding out about Horcruxes?

how many bits in the resultant hash will change, if the x bits are changed in its the original input

Animal Shelter Management C++

Interviewing with an unmentioned 9 months of sick leave taken during a job

How fast does a character need to move to be effectively invisible?

What powers the air required for pneumatic brakes in aircraft?

Should I be able to keep my company purchased standing desk when I leave my job?

Do I need a 50/60Hz notch filter for battery powered devices?

Manually select/unselect lines before forwarding to stdout

Can a pizza stone be fixed after soap has been used to clean it?

How should one refer to knights (& dames) in academic writing?

What are "full piece" and "half piece" in chess?

How to determine the optimal threshold to achieve the highest accuracy

Does the Intel 8085 CPU use real memory addresses?

Credit card details stolen every 1-2 years. What am I doing wrong?

What is the meaning of [[:space:]] in bash?

Is there an English equivalent for "Les carottes sont cuites", while keeping the vegetable reference?

Strategy to pay off revolving debt while building reserve savings fund?

Is there any conditions on a finite abelian group so that it cannot be class group of any number field?

Why does FFmpeg choose 10+20+20 ms instead of an even 16 ms for 60 fps GIF images?

What advantages do focused Arrows of Slaying have over more generic ones?

FPGA CPU's, how to find the max speed?

Why does "git status" show I'm on the master branch and "git branch" does not?

Why should I cook the flour first when making bechamel sauce?



using rendered page links via java client


Is Java “pass-by-reference” or “pass-by-value”?How do I efficiently iterate over each entry in a Java Map?Does a finally block always get executed in Java?What is the difference between public, protected, package-private and private in Java?How do I read / convert an InputStream into a String in Java?When to use LinkedList over ArrayList in Java?How do I generate random integers within a specific range in Java?How do I determine whether an array contains a particular value in Java?How do I convert a String to an int in Java?Creating a memory leak with Java






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








0















I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.

an example can be cnn site ...
So far I have tried using :




testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'




 @Test
public void htmlUnitTest() throws Exception

try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))

webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();

List<HtmlAnchor> anchors = page.getAnchors();

System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");

try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);





but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )



can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.










share|improve this question






















  • please provide the url to give us a chance to reproduce your case

    – RBRi
    Mar 26 at 10:37











  • any site which has rendering try for instance edition.cnn.com

    – yoav.str
    Mar 27 at 8:31

















0















I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.

an example can be cnn site ...
So far I have tried using :




testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'




 @Test
public void htmlUnitTest() throws Exception

try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))

webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();

List<HtmlAnchor> anchors = page.getAnchors();

System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");

try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);





but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )



can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.










share|improve this question






















  • please provide the url to give us a chance to reproduce your case

    – RBRi
    Mar 26 at 10:37











  • any site which has rendering try for instance edition.cnn.com

    – yoav.str
    Mar 27 at 8:31













0












0








0








I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.

an example can be cnn site ...
So far I have tried using :




testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'




 @Test
public void htmlUnitTest() throws Exception

try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))

webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();

List<HtmlAnchor> anchors = page.getAnchors();

System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");

try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);





but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )



can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.










share|improve this question














I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.

an example can be cnn site ...
So far I have tried using :




testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'




 @Test
public void htmlUnitTest() throws Exception

try (final WebClient webClient = new WebClient(BrowserVersion.CHROME))

webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);

final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();

List<HtmlAnchor> anchors = page.getAnchors();

System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");

try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt")))
writer.write(content);





but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )



can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.







java htmlunit






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 26 at 8:09









yoav.stryoav.str

5925 gold badges25 silver badges62 bronze badges




5925 gold badges25 silver badges62 bronze badges












  • please provide the url to give us a chance to reproduce your case

    – RBRi
    Mar 26 at 10:37











  • any site which has rendering try for instance edition.cnn.com

    – yoav.str
    Mar 27 at 8:31

















  • please provide the url to give us a chance to reproduce your case

    – RBRi
    Mar 26 at 10:37











  • any site which has rendering try for instance edition.cnn.com

    – yoav.str
    Mar 27 at 8:31
















please provide the url to give us a chance to reproduce your case

– RBRi
Mar 26 at 10:37





please provide the url to give us a chance to reproduce your case

– RBRi
Mar 26 at 10:37













any site which has rendering try for instance edition.cnn.com

– yoav.str
Mar 27 at 8:31





any site which has rendering try for instance edition.cnn.com

– yoav.str
Mar 27 at 8:31












1 Answer
1






active

oldest

votes


















0














The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().



One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.






share|improve this answer

























  • this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

    – yoav.str
    Mar 28 at 19:05











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55352432%2fusing-rendered-page-links-via-java-client%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().



One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.






share|improve this answer

























  • this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

    – yoav.str
    Mar 28 at 19:05
















0














The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().



One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.






share|improve this answer

























  • this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

    – yoav.str
    Mar 28 at 19:05














0












0








0







The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().



One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.






share|improve this answer















The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().



One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 29 at 17:44

























answered Mar 27 at 18:19









RBRiRBRi

1,4512 gold badges7 silver badges10 bronze badges




1,4512 gold badges7 silver badges10 bronze badges












  • this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

    – yoav.str
    Mar 28 at 19:05


















  • this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

    – yoav.str
    Mar 28 at 19:05

















this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

– yoav.str
Mar 28 at 19:05






this js is the page being render once the java script is being interpreted on page start . without any event triggering ... and the motivation is to be website agnostic , meaning I don't want to be aware to this site architecture ... how real world crawlers such yahoo and google does it ?

– yoav.str
Mar 28 at 19:05









Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.







Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55352432%2fusing-rendered-page-links-via-java-client%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript