Take the following HTML file for example:
<html> <head> <body> <a href="http://example.com/plain-html-a-link">html-link</a> <a id=jsalink href=placeholder>jsalink</a> <script> var jslink = document.getElementById("jsalink") jslink.href = "http://example.com/js_a_link" </script> </body> </head> </html>
There are obviously two A links there, but one of them is being modified by JS. This dynamic modification of elements is extremely common today. So what happens if you use the old method of getting A links?
$ curl localhost:8000/jsalink.html 2>/dev/null | grep '<a' <a href="http://example.com/plain-html-a-link">html-link</a> <a id=jsalink href=placeholder>jsalink</a>
Well that obviously didn't work... How about using the BeautifulSoup python module?
$ python3 atu-getlinks.py http://localhost:8000/jsalink.html http://example.com/plain-html-a-link http://localhost:8000/placeholder
Also no...The best way i've found to do it is to actually have a browser engine parse the entire file and execute the JS, and then grab all the a links by issuing a command to the JS interpreter. I wrote the following script to do exactly that. It uses the Chrome browser in headless mode to perform all the parsing, and then via selenium, issues a JS statement to grab all the A links:
Running this results in:
$ python3 selenium-getlinks.py http://localhost:8000/jsalink.html http://example.com/plain-html-a-link http://example.com/js_a_link
That's more like it.
PS. This is still not "perfect" since certain frameworks will change content via certain event handlers. This handles some (e.g. DOMContentLoaded), but not others (e.g. onclick events). You kinda just have to deal with that. Making a script to identify changes based on all event handlers would likely be extremely risky.
No comments:
Post a Comment