hot	top
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]

Infopost | 2025.05.27

Experimenting with embeddings for webpage recommendations has been in my backlog for a bit, it wasn't til I read this post that I realized I could knock out a simple implementation in a weekend.

There are plenty of good resources on what embeddings are and how they work, I'll just skip to the parts that apply to my use case:

Embeddings consist of text tokens paired with hundreds of floating point values. Specifically, "flower" -> [0.44, -0.07, 0.15, ...]
The values are determined by training done on a big computer run by someone who does it as their day job. How they get there is complicated but the result is simple: 'car' and 'Fiat' have similar vectors, 'porcupine' and 'envy' have different vectors.
From the post linked above, adding the vectors of multiple words (elementwise) produces an aggregate meaning, so you can compare "the quick brown fox jumps over the lazy dog" with "how vexingly quick daft zebras jump". Caveat: I don't know if addition is a best practice or just some blogger's simplified approach.

The subtasks to get to Hello Embeddings were not substantial:

Get an embedding table.
Encode webpages as an aggregate vector.
Compute similarity using vector math.

My baseline implementation

I already have code to match/recommend webpages, an unsophisticated solution that does a set intersection of 1-grams, 2-grams, and 3-grams from each page. It's worked well so far but could stand to benefit from the fuzzy/semantic correlation provided by an ML approach. Most importantly, the code and data from the baseline matcher was quite helpful in getting past some challenges with doing embeddings on a humble gaming PC.

Downsizing

Lists of embeddings are not small: vocabulary size * 100-500 floating point numbers. A minimum of two gigs is a lot to ask and searching for a bunch of arbitrary strings isn't the kind of thing you want to do on disk. Since I don't believe in databases, I decided to look at the data itself.

It didn't take long to see that the list had clutter: stopwords, ambiguous acronyms, and numeric values. Borrowing the list of 1-gram stopwords from my naive implementation, I got the following:

2,684,369 stopwords
1,845,662 okay words

Cutting the list by more than half was a good start but 1.8M words is still a few hundred megs. Perhaps the low-SNR terms not in my stopword list could be determined from their vector data?

I printed a few words and the max value from their corresponding vector, the idea being that a word with strong semantic meaning might have very high max value. Of course, a word's meaning could be split between multiple vector elements, but it's a start.

word      max of absval
----      -------------
the       -0.4605
of        -0.4943
is         0.5202 
by         0.5663
born       0.8322 
category  -1.4007 
medalists -2.0290

Stopwords showed values close to 0, more meaningful words seemed to go strongly positive or negative. I compiled a histogram of max values (absolute value, multiplied by 100 for readability):

Max        Count
-------    -----
[<= 19]  : 2
[20:24]  : 108
[25:29]  : 4249
[30:34]  : 31849
[35:39]  : 95061
[40:44]  : 161162
[45:49]  : 193109
[50:54]  : 191068
[55:59]  : 170870
[60:64]  : 148120
[65:69]  : 128406
[70:74]  : 111729
[75:79]  : 99438
[80:84]  : 89559
[85:89]  : 79901
[90:94]  : 69961
[95:99]  : 58415
[100:104]: 46316
[105:109]: 34446
[110:114]: 24920
[115:119]: 16796
[120:124]: 11412
[125:129]: 7083
[130:134]: 4524
[135:139]: 2848
[140:144]: 1754
[145:149]: 1056
[150:154]: 702
[155:159]: 435
[160:164]: 269
[165:169]: 182
[170:174]: 122
[175:179]: 89
[180:184]: 64
[185:189]: 50
[>= 190 ]: 112

So if I wanted to get down to the n-thousand most meaningful(?) words, perhaps I could just set a threshold using the histo counts. The code also spat out a histogram of median embedding values, perhaps a high median would indicate strong semantic meaning across multiple dimensions:

Median   Count
-------  -----
[0:2]  : 0
[3:5]  : 74
[6:8]  : 108774
[9:11] : 514049
[12:14]: 443230
[15:17]: 297473
[18:20]: 220397
[21:23]: 142686
[24:26]: 48517
[27:29]: 8585
[30:32]: 1688
[33:35]: 425
[36:38]: 164
[39:41]: 79
[42:44]: 29
[45:47]: 12
[48:50]: 1
[51:53]: 2
[54:56]: 0

Using max >= 1.2 (120 as listed in the histo) and median >= 0.24 (24) I created a pruned list of embeddings. The results looked like this:

internationalis 
insat 
inmortales
imathia
osteotomy
ousia 
unomig 
urayasu
zien
yakan
xylophanes
46kbb
satánico
sénégalaise
moerdijk

While these words no doubt have ample meaning, they might not come up so frequently in blog posts and other Outer Web pages.

Existing work came to my rescue once again. I had a trigram whitelist (the opposite of stopwords) as well as an index of terms used for search. The search terms are constructed by iterating over every indexed Outer Web page and tossing out ones that appear too frequently or too infrequently.

Filtering the embeddings using these lists gave me 41,000 words and a totally reasonable 30mb memory footprint. But is a 41k vocabulary enough to characterize the meaning of an arbitrary webpage? The answer is nuanced, but it starts with taking a look a some random pages and seeing what words hit:

https://ahistoryofjapan.com/2024/11/11/the-meiji-reformation/

edwards christianity temple shogunate japan damaged reformation bronze lynne temples sentiment illustration smashed buddhist burning season

https://moitereisbuntewelt.blogspot.com/2020/01/british-paratroopers-colour-guide.html

welt sicily sport sand wehrmacht uniform models soldier russian paint wooden airborne trousers unit soviet kitchen goodness scrim translate strokes 28mm moons cloth documented replacement saddle disruptive figures british painting roman shading window irony pale paratroopers lighter fractions patterns guides cheers maroon wwii military germans superb legion gaiters darker scarlet gaming recesses cavalry obscure faded pulp tutorial bunte brackets ferrari famous vallejo wargamer glazes paints darkest italy transitions scifi rifle patches dried tricky earth beige overlap elaborate

http://alifeinphotography.blogspot.com/2022/08/a-peterbilt-prototype-in-infrared.html

crosses america display jenkins minolta byways stashed highway corporations truck infrared prototype parked photographing barns glory keeper organizations spanning scanner travel canon photographed gloria camera countries film records assignments philosophy antique photography agencies nature career amazon 1920s historian mission humanitarian

https://vickiboykis.com/2018/02/19/building-a-twitter-art-bot-with-python-aws-and-socialist-realism-art/

concrete firing library polling models workflow shoulder credentials uploading cloud wikiart amazon news software vicki profiles console painting script maintenance leads tweets captions scrape metadata forms records functions partisan extract cabin stalin json systemic followers imports string sleeping handler supportive distribution code overload directory realistic stream interval genre programming environment twitter tweaking acces processs prototype lambdas python technology libraries locally surrounded realism aims kinesis instructions developer developed arguing downloading noisy decode socialist artwork posting websites development triggered russian genres classifier functionality soviet kitchen filename docs animal headlines hashtag repeats emotion education artist scenes painter analytical tears artists classical scraping party travis toxic scraper javascript bird paintings uploaded documentation propaganda nudity refactor regime scroll communism dictionary clears client immobilized government machines science engineer array waterfall artworks testing experimenting dependent linux database asleep boundaries component pencils feasible revolution scripts architecture execution operations dictionaries humanity hooks trial

Comparison

With a slimmed-down list of embeddings loaded into memory, I could iterate over every page and produce a single embedding vector by summing the component words. Comparing two pages could be accomplished by computing the cosine similarity of their embedding vectors.

Heuristic search

Cosine similarity comparison is reasonable for comparing two webpages or even for finding related posts within my blog (n = 700ish). It's not reasonable for "here's an arbitrary webpage, find me similar ones in the hundreds of thousands of Outer Web pages". But neither was trigram comparison, and that's why Outer Web search is a two step process:

Query a keyword index for posts with matching terms.
Starting with the pages found in 1, search the connected graph of pages until a 'done' condition is met.

To reiterate, cosine similarity is an easy win for making Step 2 better, can it do anything for Step 1? Perhaps.

One-hot index

I created a list of posts that have the highest similarity to each of the possible one-hot vectors. For instance, a vector [0.0 0.0 0.0 1.0 0.0 ... 0.0] is most similar to the following posts:

Now, I have no idea why those posts are semantically similar and neither does the embedding data. It is, after all, an entire internet of words distilled into a hundred categories. Some of the other indexes yielded results that look more alike at a glance.

One more:

With a little instrumentation, I could determine if one-hot graph entry points result in good matches relative to a basic keyword index.

Cosine similarity maths

Cosine similarity's Hello World is something like "King - Man = Queen". That is, if you subtract the vector for 'man' from the vector for 'king' you get 'queen'. I did vector addition and subtraction on some randomly-selected terms to see what would happen. Randomly plucking words from the index gave me a lot of meaningless results, like 'taco' + 'dignity' = a variety of words resembling each.

Another way to illustrate it, here's an addition that makes sense:

blogosphere + recap
  1. 0.750: tweets
  2. 0.738: bloggers
  3. 0.729: newsblog
  4. 0.728: podcast
  5. 0.727: lamestream
  6. 0.725: podcasts
  7. 0.723: scobleizer

But the subtraction goes off the rails a bit:

blogosphere - recap
  1. 0.509: nationalism
  2. 0.461: separatism
  3. 0.458: feminists
  4. 0.448: nationalist
  5. 0.438: globalization
  6. 0.438: activists
  7. 0.437: nationalists

Another illustrative result: embeddings are trained on the various possible meanings of a word, e.g. 'chihuahua' is both a dog and a state:

chihuahua + bear
  1. 0.762: coyote
  2. 0.724: sonora
  3. 0.701: durango
  4. 0.680: bobcat
  5. 0.676: wolf
  6. 0.674: sinaloa
  7. 0.667: juarez

'Bear' has a variety of meanings but they appear to be dominated by the animal - at least when added to a chihuahua.

The random keyword selection was pretty funny at times, e.g. vacation - dystopia:

vacation - dystopia
  1. 0.439: homestays
  2. 0.405: picnics
  3. 0.395: resort
  4. 0.390: resorts
  5. 0.378: meals
  6. 0.371: honeymoons
  7. 0.370: cookouts

And though the similarity values are pretty low, I will personally confirm that cookouts and honeymoons are a lot like vacations without the dystopic elements. Speaking of dystopias:

tiktok + america
  1. 0.664: effed
  2. 0.649: firetv
  3. 0.646: globl
  4. 0.642: asshat
  5. 0.638: jnpr
  6. 0.632: bridgerton
  7. 0.628: bedazzle

A few others:

zoolander + honeymoon = shrek
idiocracy - movie = misanthropy
rants + censorship = deadnaming
atlanta + waves = seattle
atlanta - waves = gwinett
motorsport - klm = supercars

Next month's follow-up post about being more efficient with this implementation:

2025.06.14

Compressed embeddings

Adding effiency to a simple embeddings implementation for webpage linking.

Moment of zen

Embedding math

tiktok + america = effed

tags:

◄

2025.05.21

Showdown

In the next few months we'll see how the Roberts Court handles the executive power grab.

2025.06.07

Cooperating

Diving back into Horizon, progressing in SOTE, and finding a new game for the little one.

►

Related / internal

Some posts from this site with similar content.

2025.06.14

Compressed embeddings

Adding effiency to a simple embeddings implementation for webpage linking.

2023.12.30

Feature complete

My static site generator can now recommend external blog/smallweb posts with similar subject matter.

2023.09.27

Enwebbed

Parsing RSS feeds to find peers.

Related / external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

www.grymvald.com

hot

top

Content

navigation

My baseline implementation

One-hot index

Compressed embeddings

Showdown

Cooperating

Compressed embeddings

Feature complete

Enwebbed

Grymvald Gazetteer: (D23&Me) Room 012 - Upper Trade Guild Northeast Depot (U3)

Along The Way: CHRISTMAS, OTHER HOLIDAYS, & NEW YEAR 2 0 2 4

Zalando's Machine Learning Platform

hot		top
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]

hot		top
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]

hot		top
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]
[a]	[+]	[a]	[+]