Table of Contents

About

In some previous notes on website automation the talk has been about using Selenium in order to create a sequential workflow where a website would be navigated programatically and heedlessly (without any UI). The reason for using Selenium is mainly convenience because the Selenium IDE enables users to record a sequence of steps while browsing a website that can then be played back even on a different computer.

However, one of the problems of doing that is that running a browser in the background, even if heedlessly takes a massive performance penalty compared to using regular REST-based tools such as wget, curl or xidel. Selenium also seems best suited in cases where pages are heavily JavaScript-dependent such that interacting with them requires session persistence, which is something that falls aside the stateless HTTP protocol.

The following notes describe a node-red flow that is able to check a remote website, sift though the data and send a notification once a certain keyword is detected. Originally, something similarly was used to collect memes but it would work pretty reliably for right-about any public forum.

Requirements

It's unfortunate that node-red does not have more nodes that can process a DOM document that implement more dataflow-like programming paradigms. Only node-red is required that also additionally has no requirements itself. node-red will use its built-in HTTP node to pull data from the website and then "cheerio", a JavaScript DOM parser will be used to navigate though the elements in order to extract the data. Finally, the data will be sent to Gotify via a REST call in order to send a notification that can be displayed on a Desktop using a client such as Winify.

For this project in particular, a Cloudflare-bypassing remote proxy is used because the website to scan has "Cloudflare security" turned all the way up.

The Flow

Unfortunately, compared to Selenium, there is no "user-friendly" way to traverse the website such that the flow requires programming and some user knowledge regarding website layouts and HTML/CSS elements. In this particular instance, the flow is specifically designed to retrieve the 4chan /pol/ (politically incorrect) catalog https://boards.4chan.org/pol/catalog and then run through all the threads while looking for keywords. Once a keyword is found within a thread, a link, image and summary text for the thread is collected in order to generate a Gotify notification.

The flow consists in a cron-node (that can also be replaced by a built-in Node-Red inject node, if need be) that will periodically invoke the flow itself. The http request node will make a call through the Cloudflare proxy and to the destination website. When the body of the request is received, the threads node is responsible for using the "cheerio" JavaScript DOM parser to go through the website HTML by using CSS locators in order to extract the data required by the user. The data is then passed to the gotify template that constructs a notification body that the next-up http request node will send to a Gotify server.

The smaller flow at the top will run once Node-Red starts and will set a list of terms included within the search terms template as a flow variable that will be read by the threads node later on when the threads node will attempt to match content with the supplied keywords. The set node first compiles the list of keywords to look for sent by the search terms node to regular expression objects and then stores the list of regular expressions to a flow variable.

[{"id":"0be3fb8d97f78784","type":"http request","z":"da8f19901d01a81b","name":"","method":"GET","ret":"txt","paytoqs":"ignore","url":"http://docker:8000/html?url=https://boards.4chan.org/pol/catalog","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":350,"y":340,"wires":[["09cf04e6d0530c78","55e684c97f29f3a7"]]},{"id":"364b7fa0496f6dd8","type":"inject","z":"da8f19901d01a81b","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","x":180,"y":400,"wires":[["0be3fb8d97f78784"]]},{"id":"02faf083d591625d","type":"debug","z":"da8f19901d01a81b","name":"debug 23","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":820,"y":280,"wires":[]},{"id":"09cf04e6d0530c78","type":"function","z":"da8f19901d01a81b","name":"threads","func":"var terms = flow.get('search')\nif(!Array.isArray(terms)) {\n    return\n}\n\nconst $ = cheerio.load(msg.payload)\n$('#threads > div.thread').each((_, element) => {\n    const thread = $(element).attr('id')\n    const rem = flow.get('rem')\n    if(Array.isArray(rem) && rem.includes(thread)) {\n        return\n    }\n    msg = {}\n    const text = $(`#${thread}`).find('div.teaser').text()\n    const term = terms.find(element => element.test(text))\n    if(typeof term === 'undefined') {\n        return\n    }\n    msg.thread = thread\n    msg.term = term\n    msg.text = text\n    msg.icon = $(`#${thread}`).find('img.thumb').attr('src')\n    msg.link = $(`#${thread}`).find('a').attr('href')\n    node.send(msg)\n})\n","outputs":1,"timeout":0,"noerr":0,"initialize":"","finalize":"","libs":[{"var":"cheerio","module":"cheerio"}],"x":520,"y":340,"wires":[["549b51c66d314f88","e5eaaead809c3364","08ec045be3841eb7"]]},{"id":"549b51c66d314f88","type":"template","z":"da8f19901d01a81b","name":"gotify","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"{\n  \"extras\": {\n    \"client::display\": {\n      \"contentType\": \"text/markdown\"\n    }\n  },\n  \"title\": \"4chan - {{{term}}}\",\n  \"message\": \"{{{text}}}<br> [![thumbnail](https:{{{icon}}})](https:{{{link}}})\"\n}\n","output":"json","x":670,"y":340,"wires":[["fa9cb11e6d0fe5e0","02faf083d591625d"]]},{"id":"fa9cb11e6d0fe5e0","type":"http request","z":"da8f19901d01a81b","name":"","method":"POST","ret":"txt","paytoqs":"ignore","url":"https://gotify.tld/message?token=...-","tls":"","persist":false,"proxy":"","insecureHTTPParser":false,"authType":"","senderr":false,"headers":[],"x":830,"y":340,"wires":[[]]},{"id":"e35f68f5b2aa9abc","type":"cronplus","z":"da8f19901d01a81b","name":"","outputField":"payload","timeZone":"","storeName":"file","commandResponseMsgOutput":"output1","defaultLocation":"","defaultLocationType":"default","outputs":1,"options":[{"name":"schedule1","topic":"topic1","payloadType":"default","payload":"","expressionType":"cron","expression":"0 0 * * * * *","location":"","offset":"0","solarType":"all","solarEvents":"sunrise,sunset"}],"x":180,"y":280,"wires":[["0be3fb8d97f78784"]]},{"id":"e5eaaead809c3364","type":"function","z":"da8f19901d01a81b","name":"rem","func":"let rem = flow.get('rem')\nif(!Array.isArray(rem)) {\n    rem = []\n}\nrem.push(msg.thread)\nflow.set('rem', rem)\n","outputs":0,"timeout":0,"noerr":0,"initialize":"","finalize":"","libs":[],"x":670,"y":260,"wires":[]},{"id":"08ec045be3841eb7","type":"debug","z":"da8f19901d01a81b","name":"debug 6","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":660,"y":460,"wires":[]},{"id":"7c415728bd978cf6","type":"inject","z":"da8f19901d01a81b","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"","crontab":"","once":true,"onceDelay":"1","topic":"","payload":"","payloadType":"date","x":270,"y":140,"wires":[["86a534a88e7863e3"]]},{"id":"86a534a88e7863e3","type":"template","z":"da8f19901d01a81b","name":"search terms","field":"payload","fieldType":"msg","format":"handlebars","syntax":"mustache","template":"[\n    \"humor\",\n    \"webm\",\n]","output":"json","x":450,"y":140,"wires":[["61845fd7de080777"]]},{"id":"61845fd7de080777","type":"function","z":"da8f19901d01a81b","name":"set","func":"let terms = msg.payload\n\nterms.forEach((element, index) => {\n    terms[index] = new RegExp(element, 'i');\n})\n\nflow.set('search', terms)\nreturn msg","outputs":1,"timeout":0,"noerr":0,"initialize":"","finalize":"","libs":[],"x":610,"y":140,"wires":[["890865451d1f18f9"]]},{"id":"890865451d1f18f9","type":"debug","z":"da8f19901d01a81b","name":"debug 7","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":760,"y":140,"wires":[]},{"id":"55e684c97f29f3a7","type":"debug","z":"da8f19901d01a81b","name":"debug 19","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"true","targetType":"full","statusVal":"","statusType":"auto","x":500,"y":560,"wires":[]}]

Conclusions

The beauty of website automation solutions like changedection.io or Selenium is that they facilitate setting up an easy or visually-guided way of playing back some website actions however one drawback of these solutions is that they are not meant for data processing but rather made to carry out actions and perhaps extract the text on some element. For any purpose that exceeds that, an in-house solution is still the valid fallback.

However, Node-Red and dataflow programming offer an intuitive way of using building blocks that can be seen as logical functions in a functional programming sense where data flows from one function to the other in order to accomplish a certain goal. In that sense, it would be possible to create a self-standing script, perhaps by programming Selenium directly using its library API but so much more convenient to write the data extractor using dataflow programming.