kilroy - Granularity

hot	top
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]

Infopost | 2023.10.08

Mt Shasta glacier climbing crampon film photo grainy

A brief recap:

I added a blog widget that automatically links similar posts on my site.
I said, "gee, it'd be neat to link to the rest of the web this way" and submitted a feature request to the ether.
I ingested a bunch of RSS and listed blogs with post titles/descriptions similar to my tag list.
Since tags, titles, and descriptions aren't much data, I started estimating similarity based post text. <-- You are here.

Tokenizing kilroy (design/code)

It's all chronicled by my meta tag, but the tldr is that I have my own static site generator with markup that I use to write posts. My Element abstract base type is inherited by text sections, images, galleries, tables, etc. So it was pretty straightforward to add getPlaintext() to this type and let the child classes decide what would go into a tokenization.

Hitting the markup was considerably easier than trying to tokenize the output html. And so each post object now has:

class Post {

...
 
   Set<String> getTokens() {
      for (Element element : postElements) {
         tokens.addAll(tokenize(element.getPlaintext()));
         tokens.removeAll(stopwords);
      }
      return tokens;
   }

It's easy to find generic lists of stopwords ('a', 'the', 'therefore'...), I tailored mine by running getTokens() on every post and scrutinizing the top of the histogram.

So my earlier post recommender code became:

class Post {

...
   double getSimilarity(Post a, Post b) {
      // return computeIntersection(a.getTags(), b.getTags());
      return computeIntersection(a.getTokens(), b.getTokens());
   }

Words and n-grams

functions | visual | size | machine learning | layers | neural | rgb |
maxpool | transpose | input | white | helpful | used for | dimensionality |
generation | images | sounds | kernels | pixel | output | convolutional |
machine | convolution | the output | finding | needed | learning |
examples | upscaling | edges

Since 'cowboy' means one thing, 'bebop' means another, and 'cowboy bebop' means yet another, I used words (1-grams), 2-grams, and 3-grams for the tokenize() operations mentioned above. Dumping the token intersections (above and below) shows reasonable results: most of the words have significance and can indicate two posts are alike.

that would | pretend | happen | gme | were just | rounds | that they |
what's the | plot | ticker | injection | gift | trading | product | only
the | posts | plus | position | the plus | led | stats | the minus | old |
night | minus | believe | shares | that would have | for that | imagine |
page | interest | purchase | money | price | stuck | options | favorite |
requests | web | premiums | hits | the gme | certain | better when |
holding | did the

But what about tags?

Indeed, tags are really good data that shouldn't be ignored just because I have a more expansive dataset. While the tag dictionary is only a few hundred words long, they are words chosen specifically to label the subject matter of the post.

Plus the code is already there. So, what, like weight them 50/50?

I dumped some example data, here is the similarity calculator reporting new high scores as it iterates through all the other posts:

Tag similarity   Token similarity
--------------   ----------------
0.057            0.021
0.285            0.022
0.357            0.020
0.400            0.028
0.428            0.013
0.461            0.011
0.529            0.027
0.571            0.047
0.689            0.017

Happily, the values diverge quite a bit, so this isn't just a more computationally-intensive way to do what I was doing with tags. But since the tag similarity values tend to be an order of magnitude greater than the token ones, naively adding the metrics would give the tags a lot of weight. With a little extra code, I normalized each measurement (with its own maximum) and then weighted them equally.

Looking outward

Source.

Finding similar posts on other websites is conceptually the same as this but since it requires a lot of automated page visits, it's a post for another time. But I still have my rss/xml data that I previously compared to kilroy tags. Some of the token-based results:

Latest post	Similarity (pct)	Simliarity (nom)
wizardzines.com	0.113	817
mooreds.com	0.108	369
jhey.dev	0.107	112
zander.wtf	0.106	236
tonmeister.ca	0.099	453
vianegativa.us	0.098	499
usuf.fyi	0.097	370
engineersneedart.com	0.097	144

Clicking through them, it's not exactly like looking in a mirror, but then a lot of the match tokens looked like:

game | meme | question | learn | link | played | users | twitter |
computer | security | legal | action | playing | boot | results | keeps

While 'game' and 'meme' aren't bland enough to qualify as stopwords, there may be more significance to matches on uncommon tokens/n-grams. That may something to add to the backlog, more immediately I need to grab the actual posts rather than rely on the descriptions.

For now, here's a fun cheat sheet from the list:

Source.

tags:

◄

2023.10.01

The underworld

Is Stray good game for a two year old to copilot? Also Payday 3, Baldur's Gate 3, and Fallout 76.

2023.10.18

Autopilot/soft landing

Investment risks, 401k funds are stupid, and ETFs that sell covered calls.

►

Related / internal

Some posts from this site with similar content.

2023.09.27

Enwebbed

Parsing RSS feeds to find peers.

2023.12.30

Feature complete

My static site generator can now recommend external blog/smallweb posts with similar subject matter.

2025.05.27

Embeddings

Adding embeddings to post matching.

Related / external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

jakebasile.com

Why A Personal Website?

I had fun setting up and deploying this site, but throughout the process, I had this nagging voice in my head alternating between "No one will visit this.", "Isn't it vain to have a personal site?", and "Why are you using a static site generator instead of writing it?" As an exercise and a form of self-soothing, I will attempt to answer these questions/accusations. Photo Details Camera: E-M5MarkIII Lens: OLYMPUS M.

mostlymaths.net

mostlymaths.net | Blog details

Here are some details of the tools used to build and keep this blog. The engine is the static site generator Hugo Hosted on Github Pages The main header is a Julia set I computed many years ago The font is Reforma 1969 The code font is Monoid The drop caps are from the Byrne font by Nicholas Rougeaux The break decoration font is Nymphette Uses MathJax in some pages Uses D3.

arne.me

Why You Should Write Your Own Static Site Generator

I rewrote my personal website using basic libraries and the flexibility is incredible.

Created 2025.08 from an index of 864,653 pages.

hot

top

Content

navigation