Infopost | 2024.02.15

Cowboy Bebop hyperspace Netflix

My web 1.1 indexer has been chugging away, though not like 24/7 or anything.
Growth

Knight arrow meme web crawler filtering

I mentioned this previously: keeping out the bad web isn't easy. Commercial links are everywhere and even the honest cases need scrutiny, e.g. a tech blog linking to Apache docs or something. This is particularly bad for SEO-minded sites that want to create outbound links for rank.

On the subject of wheat and chaff, to paraphrase:

RobRob Ick, why are there non-blog links in your recommendations?

He's right, of course, I've often described this effort as linking the blogosphere, not the corposphere. Referring back to my original plan:

The bottom right illustrates a hypothetical post with external recommendations that are both indieweb (blue color scheme) and selective normieweb (red color scheme).

In the original design, each recommendation set would offer a handful of personal web links and a separate handful of mainstream links. I wasn't sure how to approach the latter other than: "The Atlantic, not Newsweek", "IGN, not Electronic Arts". I should come back to this quandary once I've fried some bigger fish.
Scaling search

Back when I had an experimental (small) post corpus I could quickly iterate over every indexed page. Searching for recommendations was quick despite the unmodest computational demands of trigram comparison with extra steps. As you might imagine, the post database eventually grew up and search began to take too long. At this point I could have pivoted to a vectorization/clustering approach but felt I could accomplish something similar with my existing trigram method. Recommendations were looking good, so why not stick with it?

The problem is a clustering one, though. Each post has a ranked list of similar post, though posts #2 and #3 might not be like each other. In thinking about search/recommendation, the vast majority of posts have nothing to do with the subject post, I really just want to consider the cluster.

Since 50% of this project is therapeutic, I set out to create an iteration on my existing design rather than spend hours crawling github libraries. As opposed to the traditional clustering approach of mapping input to a point in n-dimensional space, I decided to treat each post as a node in a graph.

Graph node napkin illustration

Simple enough, each yellow circle is a node/post.

Graph bidirectional connected napkin illustration
The edges have lengths but I didn't write a number for them.

Each node can link to n nearest neighbors, nearness being determined by the trigram comparison algorithm. For simplicity I'm using n=2 in these napkin diagrams. The edges are directed but often form bidirectional links. This means:
Building the graph was as simple as: for each node, traverse the entire graph and keep the closest n. This retraced the scalability issue that brought me here, but just once per page insert rather than every query.

For the objective of search/recommendation, I could examine connected clusters of nodes to find the recommendation candidates. And I could have a high degree of confidence that something 500 edges away was probably not worth considering.

Graph bidirectional connected napkin illustration longest edges

Getting to those related clusters for an arbitrary post wasn't obvious. Often, with graphs, you'll have an entry node and traverse until you know you should stop. That's not the case here; if I'm looking for posts similar to footgolf (based on text content), is traversing from Yoshi's Island to ride-on lawnmowers a step in the right direction? The traversal needs to start in or near a cluster and there are often to multiple clusters if the subject post is about footgolf and electrical box paintings.

The search feels more like gradient descent. In short, if I delivered you via helicopter into the Sierra Nevadas and asked you to find the lowest point in a twenty-mile square, how would you proceed? You could walk downhill and find a low point, but you would need to traverse the entire 400 square miles to know it wasn't just a low-ish valley. Since the exhaustive approach is, well, exhausting, you might instead ask for a helicopter ride to a dozen points on the map, walk downhill, and then choose the lowest. And you'd be something like twelve times as confident as you would with just one descent.

Based on this view, I started tracking nodes that might provide a geographically-diverse span of the data. There might be a better way to accomplish this, but based on something Rob sent me, I assembled a list of entry points by recording nodes with the longest non-infinite edges. The search/recommendation process would still need to figure out how to best walk downhill, but this would give me a well-spaced set of starting points.

Graph bidirectional connected napkin illustration entry ngrams

All this graph stuff was good but totally decoupled from the actual thing I was doing: comparing post text. What's more, if I selected 64 or 128 starting nodes to span the thousands or millions of webpages stored in the graph, these entry points could be unhelpful in plenty of situations.

Another idea I had kicked around before going the graph route was some sort of hierarchical trigram index. That seemed less fun to draw so I went with the graph approach, but a simple version of the index model could help here. I created an index of trigrams that appear between [small number] and [larger number] times in the entire corpus. Each index entry has pointers to the representative posts. It's a smallish, fast data structure that often points the search process in the right direction.

Graph bidirectional connected napkin illustration text query

Back to my use case: finding recommendations for a post not contained in the graph. Illustrated above, the subject post is the (!) circle, my traversal starts with the gray arrows:
The closest nodes to (!) are the would-be recommendations, you can see since it's a really small example I don't have to do too much work to get there from the starting nodes.

Graph bidirectional connected napkin illustration node traversal search

The traversal process is just a matter of checking adjacent, unvisited nodes and deciding whether to continue. Naively, I could just say, "keep going as long as the results are getting better". For (A)->(B), is the (!)/(B) similarity greater than the (!)/(A) similarity?

In the case illustrated above, the gray x's show traversals stopping, either because the node has already been visited or because the neighbor is less like the (!) node. Once traversal is finished, the algo has a list of visited nodes and their similarity to (!), analogous to the list of altitude measurements from the Sierras hike.



Related - internal

Some posts from this site with similar content.

Post
2023.12.20

Rim worlds

Visiting and connecting the fringes of the web.
Post
2023.12.30

Feature complete

My static site generator can now recommend external blog/smallweb posts with similar subject matter.
Post
2023.07.02

Feature request

I just want to link other people's blogs.

Related - external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

blog.burntsushi.net

ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} - Andrew Gallant's Blog

I blog mostly about my own programming projects.
404ed
forrestthewoods.com

Learning Rust via Advent of Code

How I used Advent of Code to learn about the programming language Rust.
cprimozic.net

Growing Bonsai Networks with RNNs - Casey Primozic's Homepage

A summary of my research and experiments on growing sparse computational graphs by training small RNNs. This post describes the architecture, training process, and pruning method used to create the graphs and then examines some of the learned solutions to a variety of objectives.

Created 2024.05 from an index of 211,884 pages.