Wednesday, February 12, 2020

A Better, More Modern, HTML Link Grabber

Lots of examples of HTML <a> link grabbers simply parse the source code of the page for a links and output that. I'm sure I don't need to say that technique is antiquated and doesn't really work that well with modern web applications and front-end frameworks. Everybody and their mother just loves modifying HTML using javascript. The old method would miss that stuff badly.

Take the following HTML file for example:

<html>
    <head>
        <body>
            <a href="http://example.com/plain-html-a-link">html-link</a>
            <a id=jsalink href=placeholder>jsalink</a>
        <script>
            var jslink = document.getElementById("jsalink")
            jslink.href = "http://example.com/js_a_link"
        </script>
        </body>
    </head>
</html>

There are obviously two A links there, but one of them is being modified by JS. This dynamic modification of elements is extremely common today. So what happens if you use the old method of getting A links?

$ curl localhost:8000/jsalink.html 2>/dev/null | grep '<a'
            <a href="http://example.com/plain-html-a-link">html-link</a>
            <a id=jsalink href=placeholder>jsalink</a>

Well that obviously didn't work... How about using the BeautifulSoup python module?

$ python3 atu-getlinks.py http://localhost:8000/jsalink.html
http://example.com/plain-html-a-link
http://localhost:8000/placeholder

Also no...The best way i've found to do it is to actually have a browser engine parse the entire file and execute the JS, and then grab all the a links by issuing a command to the JS interpreter. I wrote the following script to do exactly that. It uses the Chrome browser in headless mode to perform all the parsing, and then via selenium, issues a JS statement to grab all the A links:


Running this results in:

$ python3 selenium-getlinks.py http://localhost:8000/jsalink.html
http://example.com/plain-html-a-link
http://example.com/js_a_link

That's more like it.

PS. This is still not "perfect" since certain frameworks will change content via certain event handlers. This handles some (e.g. DOMContentLoaded), but not others (e.g. onclick events). You kinda just have to deal with that. Making a script to identify changes based on all event handlers would likely be extremely risky.