Thursday, August 2, 2018

XPATH Notes (how to grep xpath)

XPATH is a querying language for XML document trees. Lots of web scrapers use it since HTML can be represented as XML directly.

Your basic "grep" like XPATH query is something like the following:

  • //*[@itemprop="recipeIngredient"]
Breakdown:

  • // = start at root of tree and include itself in any searches
  • * = any tag, anywhere in the document, otherwise replace with tag name
  • [blah] = evaluate the condition blah inside the brackets
  • @itemprop = This is how you reference attributes instead of tags
  • [@itemprop] = the condition is: if the itemprop attribute exists in some tag
  • [@itemprop="recipeingredient"] = condition is: if itemprop attribute's value is "recipeingredient"
Another example is if I wanted to search anything that references example.com in an XML document, I'd search for any href attribute that contains "example.com" like so:
  • //*[@href='example.com']
Or limit it just to direct hyperlinks like "a" tags
  • //a[@href='example.com]
XPATH has a lot more functionality than this but this is mostly what I need it for.

PS.
The expression in the condition brackets "[blah]" can be used with certain functions: https://www.w3schools.com/xml/xpath_syntax.asp

No comments:

Post a Comment