How to save locally the result of XPath queries in Firefox and Chrome
It happens relatively often that, while browsing the internet like a normal person, I want to extract some data from a webpage, save it locally, and manipulate it in some way. Since it is an one-off operation, I really do not want to bother writing a web-scraper with Python. Instead, here is a simple way of doing this through the developer console in Firefox or Chrome!
Simple scraping tasks can often be achieved by navigating to a page and executing some Xpath queries to extract the elements of interest.
Python and Selenium can be used to write complex web-scrapers to automate this kind of web navigation and data gathering, but this way is too cumbersome for small, one-off scraping tasks.
I have been looking for a way of doing this directly in the developer console of my browser as I navigate to the page I am interested in, but while executing Xpath is trivial via $x('//some/path')
, saving the results is not.
The trick
Until, at last, I found this solution on StackOverflow, allowing one to save objects as JSON directly from the console:
function downloadObjectAsJson(exportObj, exportName){
var dataStr = "data:text/json;charset=utf-8," +
encodeURIComponent(JSON.stringify(exportObj));
var downloadAnchorNode = document.createElement('a');
downloadAnchorNode.setAttribute("href", dataStr);
downloadAnchorNode.setAttribute("download", exportName + ".json");
document.body.appendChild(downloadAnchorNode); // required for firefox
downloadAnchorNode.click();
downloadAnchorNode.remove();
}
Essentially, this snippet:
- Serializes the object to be saved into JSON,
- Adds to the page a temporary
a
element whosehref
attribute is set to the encoded data to be saved, - Simulates a click from the user, tricking the browser into downloading the data to a file,
- Finally removes this element from the page.
Xpath queries executed via $x
return arrays of HTML elements, which are not JSON-serializable.
Converting them to an appropriate representation is however very easy:
function convertElementArrayToStringArray(element_array) {
converted = [];
for(var i = 0; i < element_array.length; i++) {
if("outerHTML" in element_array[i]) {
converted.push(element_array[i].outerHTML);
}
else {
converted.push(element_array[i].nodeValue);
}
}
return conv;
}
This function converts HTML nodes to their outerHTML
representation, while keeping text nodes as they are.
Executing the query and saving the result is then just a matter of chaining these two functions:
function saveSelectorQuery(result) {
var conv = convertElementArrayToStringArray(result);
downloadObjectAsJson(conv, "selector-query");
}
Usage
For ease of use, here are the previous functions as a single snippet:
function downloadObjectAsJson(exportObj, exportName){
var dataStr = "data:text/json;charset=utf-8," +
encodeURIComponent(JSON.stringify(exportObj));
var downloadAnchorNode = document.createElement('a');
downloadAnchorNode.setAttribute("href", dataStr);
downloadAnchorNode.setAttribute("download", exportName + ".json");
document.body.appendChild(downloadAnchorNode); // required for firefox
downloadAnchorNode.click();
downloadAnchorNode.remove();
}
function convertElementArrayToStringArray(element_array) {
converted = [];
for(var i = 0; i < element_array.length; i++) {
if("outerHTML" in element_array[i]) {
converted.push(element_array[i].outerHTML);
}
else {
converted.push(element_array[i].nodeValue);
}
}
return converted;
}
function saveSelectorQuery(result) {
var conv = convertElementArrayToStringArray(result);
downloadObjectAsJson(conv, "selector-query");
}
Simply copy-paste these into the developer console, then call the last function with your selector to download the results!
For example, executing saveSelectorQuery($x("//h2"))
on this very web page (try it!) will download a file called selector-query.json
with the following contents:
["<h2 id=\"the-trick\">The trick</h2>","<h2 id=\"usage\">Usage</h2>","<h2 class=\"footer-heading\">Emilio's Blog</h2>"]
which are exactly the second-level headers in the post. To only get the titles of the headers, without the surrounding HTML, simply append ‘/text()’ at the end of the previous query.
After this, read the JSON file with your favorite programming language and have fun!