For the last couple of months, I've been working on my side project called WebWording. The goal of WebWording is to experiment with processing texts of news headlines. To do processing, you need data. I chose RSS as my data source as every major news outlet still supports this friendly and accessible format. RSS crawling and parsing in itself is a solved problem. There are many packages and modules which will do the heavy lifting for development, so there's no real need for me to explain it in detail.

In the following, I will revisit the code which already fetched 100k articles for me in the past month.

For my specific use case, I developed the following requirements The RSS crawler should check multiple feeds periodically and push its fetched data to the API-Service. The feeds should be configurable so one can add different sources over time. What also needs to be configurable is the API-Service endpoint.

Out of personal taste, I decided on using nodejs with typescript for the implementation.

Configuring the feeds

I don't plan on adding new sources dynamically. Even currently, there are 13 feeds configured in my project; this is why I decided on just putting the config into a JSON-File which will be committed to the repository.
As the JSON-Format is not typed, I created a wrapper that will assign the proper types to all the configs.

import f from "../feedConfig.json";

export interface FeedConfig {
    name: string; 	// human readable name
    url: string; 	// url of the RSS-Feed
    cron: string; 	// checkrate as cron expression
}

export const feeds: FeedConfig[] = f;

To define when feeds should be checked, I decided on using the npm package node-cron. On startup of the service, the application loads all the feed configurations and registers their cron-triggers.

async function run() {
    for (const f of feeds) {
        const cronJob = createJob(f);
        cronJob.start();
    }
}

Getting the Data

To fetch the data, we're using the class FeedFetcher. FeedFetcher is a wrapper around the package rss-parser. To instantiate, we'll pass a FeedConfig and an array of FeedResultProcessors to the constructor of FeedFetcher. A FeedResultProcessor is an implementation to do something with the results of the fetch. Currently, I have two implementations in place. The ConsoleProcessor is just logging the results to the console. The ApiStorageProcessor will transfer the article data to our storage backend. A thing of note here is that if the storage backend does not know the configured source, the ApiStorageProcessor will create one over a REST-Interface. Bringing all this together, the call to create a fetch job will look like this.

function createJob(feed: FeedConfig): CronJob {
    return new CronJob(feed.cron,async () => {
        try {
            await (new FeedFetcher(feed, [
                new ConsoleProcessor(),
                new ApiStorageProcessor(feed, config.api_url)
            ])).fetch()
        } catch (e) {
            console.error(e);
        }
    });
}

Finally, to configure the whole application, we are using process.env of nodejs. To make it easier in development high highly recommend checking out the package dotenv. Dotenv can load a .env file and put the parsed data to process.env, thus reducing the number of words typed in the command line by the developer.

Conclusion

After revisiting the code, which did a lot of heavy lifting for me in the past months, I feel good about what I programmed. Sure there's room for improvement because there are no tests and no docs. For the implementation itself, the only thing I feel awkward about is that the FeedConfigs are defined statically, which is why I don't want to open-source the code. Should you still be interested in having the code feel free to send me a mail.