Infopost | 2020.01.26

Machine learning deep learning autoencoder Horizon Zero Dawn tiling

Machine learning deep learning autoencoder bottleneck

I did some more Keras dabbling since there was cold weather and time off. I continued down the autoencoder thread that I'd explored a bit previously.

To recap

Machine learning deep learning autoencoder Magic the Gathering card fix

Autoencoders are primarily used to perform image denoising (and as ML examples). AE compression does not stand up to mathematical algorithms, but neural networks can learn graphical features and draw them into an image. An autoencoder can also generate/repair images by tweaking the 'latent space' which is basically the network's abstract idea of what is in the image.

Machine learning deep learning autoencoder kernels classification napkin diagram

I tried to draw my best understanding of a convolutional autoencoder on a cocktail napkin (with oversimplifications described later). An autoencoder might learn what a type of ball looks like and use that information to redraw it, given partial or complete information. The latent layer amounts to a classification or set of classifications.

The encoding portion consists of convolutional 'kernels' (m-by-n filters) that get dragged over the image and produce a 'feature map' that is basically how the image is seen by the kernel's understanding of the world. Convolution is necessary to see features in a position-independent way.

The latent space reads all of these filters and decides what is in the image, kernels that don't recognize anything will not be as loud as ones that do. The decoder then takes this information and redraws the image using transpose convolution (the inverse of the encoding portion). By comparing the input and output, the network can learn without classification or additional supervision.

The above drawing uses kernels that see an entire ball. In practice, they only recognize certain portions of each object. By having many layers (left to right), the interpretation of the image gets more abstract.


I don't quite remember where I left off with my previous attempt, but in giving it another go, I hit something that felt familiar: the network would produce a flat gray output image that seemed to be shooting the median pixel value for all of the training data. So I was starting from scratch.

I don't think using rms loss was my issue; this is a pretty standard measure of error for networks that produce images. Learning method matters, so I switched between adagrad and rms prop.

Most autoencoder examples use images no larger than ~120 pixels per side. This makes sense for a lot of applications (e.g. the standard MNIST number recognizer) and certainly cuts down on training time (especially relevant since most available code is sample code). My original goal was a bit higher: 512 squares with larger convolutional kernels. I was aiming to have a network with 40k-1M weights, and perhaps this simply isn't enough to handle such large input. Then again, the point of convolution is to be highly independent of input dimensions, so why would a 64x64 input be any worse than four 32x32s? Among other things, the answer is that each kernel produces a same-size feature map and that quicky eats up available memory.

m = Sequential()
# 512x512
m.add(layers.Conv2D(8, (12, 12), activation='relu', padding='same', input_shape=input_shape))
m.add(layers.Conv2D(8, kernel_size=(8, 8), padding='same'))
m.add(layers.Dense(input_channels, activation='relu', kernel_initializer='glorot_uniform'))
m.add(layers.MaxPooling2D(2, 2))

# 256x256
m.add(layers.Conv2D(8, kernel_size=(6, 6), padding='same'))
m.add(layers.Conv2D(16, kernel_size=(4, 4), padding='same'))
m.add(layers.Dense(input_channels, kernel_initializer='glorot_uniform'))
m.add(layers.MaxPooling2D(2, 2))

# 128x128
m.add(layers.Conv2D(8, kernel_size=(4, 4), padding='same'))
m.add(layers.Conv2D(16, kernel_size=(3, 3), padding='same'))
m.add(layers.Dense(input_channels, activation='relu', kernel_initializer='glorot_uniform'))
m.add(layers.MaxPooling2D(2, 2))

# 64x64
m.add(layers.Conv2D(8, kernel_size=(3, 3), padding='same'))
m.add(layers.Conv2D(8, kernel_size=(2, 2), padding='same'))
m.add(layers.Dense(input_channels, activation='relu', kernel_initializer='glorot_uniform'))

# Flatten to latent space

# Reshape to 64x64x(L or RGB), transpose convolution will then upscale.
m.add(layers.Reshape((64, 64, output_channels)))

# 64x64
m.add(layers.Conv2DTranspose(64, (1, 1), strides=(2, 2), activation='relu', kernel_initializer='glorot_uniform'))
m.add(layers.Dense(output_channels, activation='relu', kernel_initializer='glorot_uniform'))

# 128x128
m.add(layers.Conv2DTranspose(64, (4, 4), strides=(2, 2), activation='relu', kernel_initializer='glorot_uniform'))      
m.add(layers.Cropping2D(((1, 1),(1, 1))))
m.add(layers.Dense(output_channels, activation='relu', kernel_initializer='glorot_uniform'))

# 256x256
m.add(layers.Conv2DTranspose(64, (4, 4), strides=(2, 2), activation='relu', kernel_initializer='glorot_uniform'))      
m.add(layers.Cropping2D(((1, 1),(1, 1))))
m.add(layers.Dense(output_channels, activation='relu', kernel_initializer='glorot_uniform'))

# 512x512
m.add(layers.Conv2DTranspose(64, (1, 1), strides=(1, 1), activation='relu', kernel_initializer='glorot_uniform'))      

m.add(layers.Dense(output_channels, name='output'))

Proceeding, for a time, with that image size naturally meant smaller training batches. I had hoped to address this simply by doing overnight runs, but after hours my output was still monotone gray and occasionally a checkerboard. I was expecting to see something resembling progress (e.g. blotchy noise) early on.

I pingponged between sample code. Everyone seems to do things a little different - except for the people who just copy and paste the official Keras example and call it their own. It really doesn't help that so many examples are hardwired to use existing models or input data. Swapping snippets isn't hard, it just takes time to get everything rewired for a given method of doing input/batching/display/normalization.

I eventually dialed it down to 32x32 so I could hasten the trial and error process. Previous attempts used a bunch of noise/dropout layers that are generally good practice, but I was concerned some of these might be cranked up too high. Certainly upon re-reading SpatialDropout2D, I realized that cutting an entire feature map from a layer with eight kernels might be a bit heavy-handed. I had quite a few batch normalizers in there as well.

Layer (type)                 Output Shape              Param #
conv2d (Conv2D)              (None, 32, 32, 16)        784
conv2d_1 (Conv2D)            (None, 32, 32, 16)        1040
max_pooling2d (MaxPooling2D) (None, 32, 32, 16)        0
conv2d_2 (Conv2D)            (None, 32, 32, 16)        2320
conv2d_3 (Conv2D)            (None, 32, 32, 16)        1040
dropout (Dropout)            (None, 32, 32, 16)        0
batch_normalization (BatchNo (None, 32, 32, 16)        64
conv2d_transpose (Conv2DTran (None, 35, 35, 32)        8224
cropping2d (Cropping2D)      (None, 33, 33, 32)        0
batch_normalization_1 (Batch (None, 33, 33, 32)        128
conv2d_transpose_1 (Conv2DTr (None, 36, 36, 32)        16416
cropping2d_1 (Cropping2D)    (None, 32, 32, 32)        0
batch_normalization_2 (Batch (None, 32, 32, 32)        128
dense (Dense)                (None, 32, 32, 512)       16896
dense_1 (Dense)              (None, 32, 32, 3)         1539
Total params: 48,579
Trainable params: 48,419
Non-trainable params: 160

I wasn't sure about the latent layer so I removed that, having seen a number of examples that simply went from convolution to transpose convolution.

What I really noodled on was the output layer(s). The decoder portion of the network is a bunch of transpose convolutional layers that take abstract information and redraw the features that were abstractized by the encoder. So as the decoder traces out a bunch of feature maps from latent space (and maybe enlarges it), you'll end up with a bunch of semi-images that must be recombined. This seems like a crucial step and all of the examples I saw seemed to have a different approach but not much in the way of an explanation of how.

A common approach is an n-channel dense or convolutional layer that takes all the feature maps (width, height, channels, kernels) and spits out an image. This makes sense - the layer uses each kernel's output to decide whether or not to fire. I have however, been wary of the model summaries that show dense layers having not a lot of weights, particularly when troubleshooting an autoencoder that produces no detail.

I found another implementation that used a single-kernel (per channel) 3x3 convolutional layer. While this makes sense for output shape, it seems like you're producing an output image by dragging a 3x3 conv box across a bunch of feature maps. This feels like a massive dumbing down of the elaborately-constructed data.

m.add(layers.Conv2DTranspose(32, (4, 4), strides=(1, 1), activation='relu', kernel_initializer='glorot_uniform'))
m.add(layers.Cropping2D(((2, 2),(2, 2))))
m.add(layers.Dense(512, activation='relu', kernel_initializer='glorot_uniform'))
m.add(layers.Dense(output_channels, activation='linear', kernel_initializer='glorot_uniform'))

What I settled on for the back end of the transpose convolutional block was a dense layer with a lot of units. Typically you'll see units = 3 for RGB and 1 for monochrome, since the dimensionality of the output space is a tensor or whatever. I'm sure this is the right approach for many cases and takes advantage of the magic of a dense (weight-shared, not fully-connected) layer, but I was worried about bottlenecks and concerned by the number of parameters associated with the dense(3) layer.

More precisely, I used dense(units=a lot, relu) followed by dense(channels, linear). The first, large, relu layer was meant to combine all of the transpose convolutional feature maps in a manner that allowed a large number of knobs to be turned by the magic of neuroscience, training, and cuda. Then an output-size final layer to decide how to interpret the mess of activations before it. Linear activation is scary since it's combining a bunch of inputs, but makes sense for a 0-255 output.

Trying it out

Machine learning deep learning autoencoder Horizon Zero Dawn sample files

I have a few data sets to choose from:
Using my Java graphics library, I wrote an image sampler and set it upon my Horizon: Zero Dawn screencap directory to produce 1000ish 32x32 squares, it started noisy then got better and better:

Machine learning deep learning autoencoder Horizon Zero Dawn training epochs

40/40 [==============================] - 66s 2s/step - loss: 0.0131 - val_loss: 0.0096
Epoch 12/16
40/40 [==============================] - 66s 2s/step - loss: 0.0132 - val_loss: 0.0096
Epoch 13/16
40/40 [==============================] - 65s 2s/step - loss: 0.0132 - val_loss: 0.0096
Epoch 14/16
40/40 [==============================] - 65s 2s/step - loss: 0.0131 - val_loss: 0.0096
Epoch 15/16
40/40 [==============================] - 67s 2s/step - loss: 0.0131 - val_loss: 0.0095
Epoch 16/16
40/40 [==============================] - 66s 2s/step - loss: 0.0130 - val_loss: 0.0095

After an hour or so with a batch size of 160, I was getting down into the 0.00x loss territory. In retrospect, this was despite the fact that my input/output training sets were being independently rotated/mirrored. Whoops. Regardless, output from these small pieces started looking like input, but neither really looked like anything since they were 1/2000th of an image.

Machine learning deep learning autoencoder Horizon Zero Dawn tiling

So the next thing was to write a script to apply the autoencoder to an entire image, square-by-square. Not only would this reconstruct something that actually looked like a thing, but it'd be a good test of if the autoencoder was overfitting the training set. The output looked way better than a flat gray png.

Machine learning deep learning autoencoder Horizon Zero Dawn tiling overlap

The square-by-square application left obvious borders in the decoded tiles. This was likely because edges of each tile have less data than the middle portions. The simple solution was to modify the stitcher to crop the few boundary pixels and change the step size to remove the gaps. The right way to do this, however, is to have a 32x32 autoencoder only produce a 24x24 output, thereby saving on network size and computation.

Next steps

Machine learning deep learning convolutional autoencoder layers diagram

There are a few things to try next:

Other people's (neural) fails

LG image tagging fails

My phone automatically tags images (thanks). Pretty much everything on it is a memebank (because using gif services is cheating). This has resulted in a few awesome fails. Apparently weapons look a lot like musical instruments and a heart looks like a fruit. Stormtroopers are basically mannequins - we all knew that anyway - and jazz addicts use their (jazz?) hands.

LG image tag fails

On this flip side, I could go for a sculpture of Christian Pulisic riding a dragon.

Guy who looks like Trump Giuliani lovechild Andy McCarthy

Fox is clearly using a deepfake merge of Donald Trump and Rudy Giuliani for its contributors. Nice try.

Deep Dream weimaraner puppy

Perhaps I simply don't appreciate it, but it seems like some of these neural algorithms are pretty overhyped. We've already covered neural style transfer, though I kind of like that one. I ran the default example of Deep Dream; it's... a crappy kaleidoscope. And both kind of cheat by modifying the original image with an activated style layer - escaping the problems with generating a wholly new image and making the application of their algorithm extremely slow.

Tool concert January 2020 SDSU stage lasers

Lee scored some tickets to Tool earlier this month. It was a great show. They were strict about use of cellphones (which was kind of nice) until the encore. I grabbed a couple shots and then went back to enjoying it.

Tool concert San Diego State University Tool concert San Diego State University band Tool concert San Diego State University crowd panorama

Gloomhaven bear archers board scenario

The Unnatural Ones' list of exploits keeps growing.

Gloomhaven spoilers log whiteboard events epic prose notes

Dogs weimaraner brewery cornhole table bandana

The Society explored the Miralani Makers District this month.

San Diego sake brewery art

Thunderhawk Alements was standard Miramar fare, Serpentine Cider was - well - cider. I wasn't especially fond of most of the sakes at Setting Sun, but the place had style.

Poster Late Apex Jeremy Deconcini

I happened upon a poster and wanted to know more, Katherina Michael's Amazon review.

Ben Adams is a surfer who finds himself responsible for neutralizing a terrorist threat. Well, obviously he?s not just any surfer? he?s a former FBI agent and also felon, who?s now retired and living a life of surfing and drinking beers on the beach. Trying to stay off the radar, he keeps to himself, with few ? if any ? friends, and a loyal dog. But the FBI has some leverage from his past, and has decided that his recklessness is a special talent, so has forced him to go undercover on an insane mission.

If he doesn?t comply, ISIS may get a critical nuclear weapon trigger and Ben would certainly go back to jail. But doing the job will involve high-speeds, high-stakes, and Ben?s own high-jinks thrown into the mix.

Late Apex is the third novel in Jeremy DeConcini?s Ben Adams trilogy. It?s an action-packed and fast paced page-turner in its own right ? you don?t need to have read the first two novels to dive right into this one ? with a rough and tumble hero you?ll love for his irreverence.

Thanks to his work as a Special Agent in the Department of Homeland Security, DeConcini?s experiences and personal political opinions permeate his plot and convince the reader that Late Apex?s storyline is truly plausible. All this to say? this highly enjoyable and suspenseful read seems realistic enough, even as we cringe at the American negligence.

Motorcycle, dog, fallout zone, racing phrase, surfer, FBI... it's like a perfect storm of awesome things. I don't think I'm going to read it though, it sounds like it's trying to be too awesome and can't possibly make it work.

Denver airport bar Switch beer

What's better than an east coast trip during an impeachment trial?

Hotel lamp looks like goatse

You thought that was rhetorical. Answer: finding a goatse-inspired lamp. If you don't know what that is, don't look it up.

thumbnail Dog Money brewery Leesburg Virginia logo dog monocle thumbnail Virginia Pain Quotidien cappucino thumbnail Virginia Maryland border bridge Potomac thumbnail Virginia Maryland Vanish brewery tasters
thumbnail Virginia Maryland lock house

This trip included stops at Dog Money brewery (great graphic design, not great beer), Vanish (good beer, great spot out in the boonies), the Potomac locks, and some other places. J and me finally vanquished Wotan thanks in small part to perfected builds and in large part to the DLC being scaled to player count.

PUGB map Karakin lobby

It's here, the new speed map. Complete with IEDs, rocket strikes, and...

PUBG RPG panzerfaust load screen Karakin

... RPGs???

PUBG spike strip trap molotov cocktail galaxy brain play tactics

Also I perfected the genius tactic of spike-stripping a door. Then camping it. The spikes are mostly for entertainment.



Prepping wood planks for the next project, chasing an elusive Gloomhaven personal quest, concatenate layers in ML, and a little bit of recreation.


After some interesting reads, I implemented a convolution+pooling block inspired by ResNet. It looks like this:

Edges and corners

The first week of December almost always brings one of the biggest swells of the year. This one wasn't epic, but good enough to take a vacation Friday for some 0730 shooting and 0930 surfing. The lead image is a pretty good depiction of the difference...