Studio711.com – Ben Martens

Geek

DAWG Part 2

This post assumes that you’ve read at least the radix tree portion of the post from two days ago. Given that tree built from my dictionary of 100,000 words, what word would require the most nodes to be touched? One key is that the children of every node are sorted alphabetically.

So thinking logically about this, it’s probably going to be a long word. It’s probably going to contain a good number of letters that are towards the end of the alphabet. “aadvark” for example, is not going to work because right away we find the string “aa” after looking at only two nods. “zoo” on the other hand takes 26 nodes just to find the z.

I solved this by pumping all 100,000 words through the radix tree and counting how many nodes I touched for each word. The winner was “superstructures". Here are the nodes I touched at each level:

a b c d e f g h i j k l m n o p q r s
a c e f g h i k l m n o p q s t u
a b c d e f g i k l m n p
e
r
a b c e f g h i l m n o p r s
a c e i m o p t
a i o r
a e o u
c
t
u
r
a e
s

Now you have learned your random piece of trivia for the day. I can’t imagine the situation where this would be useful knowledge. Also, this answer would vary depending on the dictionary you use.

PS. I haven’t had a ton to write about so you get stuck with whatever thoughts fly through my brain. Scary, eh? I wonder what normal people think about.

Directed Acyclic Word Graph

If that title wasn’t warning enough, you should know that this is going to be a pretty geeky post. I thought I’d do a little writeup of the project I’ve been toying with for the last couple weeks and I think a handful of you will enjoy it.

It all began when I started thinking about how you would write a program to find all the words on a 5×5 grid of letters (like Boggle.) There are many apps that do this already, but how do they work? How do you take a dictionary of 100,000 words and quickly apply to it to a bazillion possible letter combinations on a 5×5 grid and find all the valid words?

Radix Tree

I started off by reading in all the words and storing them in a radix tree. There is one root node and then the children of that root node are the distinct first letters in your word list. For a full dictionary you’ll probably end up with that first level of the tree having 26 nodes, one for each letter. This pattern continues down the tree so if you grab any node in the tree, the nodes above it correspond to the prefix of a words and all possible routes down through the children correspond to valid suffixes. Take any path from top to bottom and you have a word in the dictionary.

Each node also contains a flag that we’ll call Terminal which is true if that node can end a word and false if it cannot. So in the example on the right, the “n” in “one” would have Terminal = true because “on” is a valid word.

This turns out to be a very quick way to determine if a string of letters is a valid word. But there is a lot of extra data in the tree. For example, in the diagram on the right, the ending letters in “one” and “three” are identical nodes. There’s room for improvement.

Directed Acyclic Word Graph

As I was thinking about ways to compress this tree a little bit, DaveM pointed me to DAWGs (Directed Acyclic Word Graphs.) To create the DAWG, I create an array of pointers to all the nodes in the tree sorted by their depth in the tree. Starting at the leaf nodes (depth = 0) and moving up to the root, I apply the following algorithm

For each pair of nodes at depth x, if the two nodes have the same character and the two nodes have the same terminal flag and the two nodes have identical subtrees then remove node2 from its parent and add node1 to the childlist of the parent of node2.

Done correctly in a large word list, you can throw away a half to two thirds of the nodes in you tree!

Storing the Graph

It takes a while to build and compress the graph so it makes sense to serialize the tree so that I can just read it in from disc and be ready to go. I encoded each node as a 32-bit number. The first 5 bits represent the character (enough to store the 26 letters in the alphabet), 1 bit represents the terminal flag on/off value, and the last 26 bits represent the array index of the list of the node’s children. I flatten the tree into a jagged array where each node has a slot in the array and the value of that slot is a series of 32-bit numbers which represent it’s children with the above encoding scheme. The algorithm looks like this:

Get a list of the nodes in breadth first order.

Foreach each node in the BFS list look at each of the node’s children. If the child does not have a slot in the array, create an array slot and store that index in the child node. Encode the child node into the 32-bit number and append it to the list of the parent node’s children.

So for any array index N, the list of 32 bit numbers in that array slot represent the children of that node.

Writing this out to disc is pretty straightforward. The only special steps I took were to start the file with the length of the array and then when I write each array slot, I start with the number of children at that position. This speeds things up when I read in the file because I can allocate the right sized array.

Putting It to Use

Now it’s time to put the DAWG to use. I begin by generating a random 5×5 board. I use a weighting scheme to favor the more common letters and produce boards with more words. I then start at look at each node and follow this algorithm:

If this spot hasn’t been visited and the current string plus the letter on this spot is a valid substring in the DAWG, mark this spot as visited and search again looking in all 7 directions. If the current string is a valid word in the DAWG, add it to the list of found words.

Once that’s done I sort the found word list for quicker lookup. Now the player can start the game. When they punch in a word, it’s a simple lookup in that sorted list to determine if it’s a valid word. Since I know what all the valid words are, I can tell the user they have found X% of the words and then show the full word list at the end of the game.

Performance

The speed of the DAWG generation and serialization doesn’t really matter since it only happens once, but I can plow through 100K words in roughly 30 seconds. I’m writing this game for my phone which has a 528MHz processor in it. I can read in the DAWG from disc and load the array into memory in about 9 tenths of a second. Generating the random board and finding all the words happens in only 1 tenth of a second!

The next step is to give this game a little nicer GUI and add a timer, but I always get sidetracked playing the game instead of trying to improve it. This might be as good as it gets before I move on to the next hobby project.

Ski Geek

Somehow, every activity I participate in is turned into a geek project. Skiing is no different. This season, I have been carrying around the Garmin GPS that I purchased earlier in the year. It keeps a signal inside my coat and dutifully tracks my position every few seconds.

I can load this into my National Geographic topographical map program and get an interesting view, but I wanted more. I want to know my top speed, how much time I spent in the lift lines, how fast the lift was moving, which lifts I rode the most, etc.

To that end, I’ve started writing a program to analyze the GPS data. The raw statistics are fairly simple and I was able to get a display churned out pretty quickly. Lately I’ve been stuck on trying to automatically figure out when I was on a lift. On the surface, it seems easy: you’re on a lift when you’re going up. That’s not always true. Runs have rises in them and lifts have dips in them. So then I tried to say that any time I’m heading in the same direction for X miles and Y vertical feet then I’m on a lift. Even that has problems. What happens when you get off a lift and keep skiing straight down the backside of the hill? What happens when you get a couple errant GPS points that aren’t in line with the lift? There is enough drift in the data to make it very complicated. If I can’t get the automatic solution figure out, I’m going to have the user tell me where the lifts are the first time and then I’ll save that data. I want to move on to getting either a 2D or 3D map working next. After that I’ll work out a good way to display all the statistics about the day and each individual run.

If you’re interested in seeing the code and/or helping out, it’s all available on codeplex.

The picture below shows the track from my last ski day at Crystal Mt in the National Geographic software. My software will end up looking something like this but with more data and information on the screen.

Office Decoration

When I moved into the new office, I came up with an interesting way to decorate the back wall. I used the National Geographic Topographic maps software to create a bunch of 8.5×11 PDFs. I printed everything off at Kinko's and then painstakingly placed each sheet on the wall with thumbtacks.

The whole process took forever. If I had it to do over again, I would pay extra to have it done on a plotter or some other large format. This would have saved a lot of time and also I think it would have looked better because some of the sheets didn't quite line up when I tried to place them on the wall.

That being said, I still think it looks pretty good. I get a lot of comments from people as they walk by my office. The map goes from Seattle in the lower left corner, east to Stevens Pass, and then north up to almost the north end of Whidbey Island. The smaller map on the side wall shows Mt. Rainier. As I go on various trips, I print off a small picture and stick it on the map.

I don't think I'll do it again (at least in this manner), but I'm happy with the way it came out.

Geohashing

Rachel and I have been chatting about trying to do a little geocaching instead of regular hiking. Today's XKCD comic presents an idea called geohashing. There is a formula to figure out a geohash location. The inputs to the formula are the date, that date's (or most recent) Dow opening, and the integer lat/lon of your current location. The formula includes an MD5 hash and is pictured below. There is an app already written to calculate each meeting spot and there is a wiki with more info.

I'm not going to bother to explain what all that means, but if you need more help, you can click on the links for extra learnin'.

Tivo Hack

Last night, I hacked the Tivo and lived to tell about it. The Tivo HD comes with a 160GB drive which is good for 20 hours of high def recording (184 standard def hours.) That's quite a bit, but on busy sports weekends, 20 hours fills up pretty quickly. Plus, this is supposed to be a pretty easy hack and I wanted to give it a shot.

I ordered a 500GB Western Digital hard drive from NewEgg along with an external SATA USB drive enclosure. Once I had the material, it was pretty straightforward. I opened up the Tivo and took out the hard drive (make sure you have a Torx wrenches!) I connected the 500GB drive to my system via the USB enclosure and connected the Tivo drive to a SATA connection inside my computer. I fired up WinMFS and ran mfscopy. I goofed a little and forgot to delete all the old TV shows from the Tivo so the mfscopy command took the time to copy over all those recorded shows. It took about an hour, but it would have gone much faster if I had remembered to delete the shows first.

After mfscopy finished, I made a backup of the original Tivo drive (something I should have done at the beginning.) I'll keep the Tivo drive untouched for a while to make sure the new one is ok, but if it works then I'll wipe the old Tivo drive and use it for regular storage. Theoretically I can restore that backup at any time.

The mfscopy is supposed to copy over all the Season Passes, Cable Card settings, etc. It's a bit for bit copy of the whole drive. I really didn't want to go through the Cable Card pairing again so I was hoping that it really did work. I put the 500GB drive into the Tivo and fired it up. It worked perfectly! My Tivo now says that I have 64 hours of HD space and 607 hours of standard def space. It's going to be hard to fill that up unless I leave the house for a month.

There are some firmware changes you can make to hard drives to make them run slower and be a bit quieter. I didn't do any of those and it turned out ok. I do hear the hard drive a little, but that doesn't bother me. If you're really concerned about the noise, find that tool and/or pay a little extra for one of the Seagate DB35 drives. TivoCommunity.com has all the info you need.

All in all this was a very simple "hack." After Tim saw how it worked he said, "That's not even a hack." He's right. It's really simple as long as you're willing to void your warranty and risk destroying your investment. Beyond that you only have to click a few buttons and wait. Go for it!

Web Cam Time Lapse

I've done a couple time lapse videos recently, and I think I'm becoming a bit of a time lapse fan. This latest app combines time lapse photography with the skiing web cams that I watch every day (via my Vista gadget!)

The app watches a list of web cams which you provide and saves a copy of the image at intervals which you specify. My plan is to let a bunch of these run for the next few months and then stitch them together into some type of movie. Maybe none of them will turn out well, but it was an interesting little application to write nonetheless.

The program writes out a file called usersettings.xml and that contains all the information that you provided when you set up the web cams via the Add button. If you need to edit any of the information you provided, you'll have to close the program, edit the xml, and then restart. New threads are spawned to download all the images. There are a lot of potential holes in the program, but it has been running fine on my machine for a few days.

If you've been looking for something similar, feel free to check it out. The source code and binaries are posted on Codeplex at http://www.codeplex.com/webcamtimelapse.

OnDemand

I was flipping through the OnDemand section of our digital cable subscription and was surprised at some of the content they have on there. You can get a lot of the new movies that are out on DVD as well as some of the more popular TV shows, but of course you have to pay for all that. The free selections aren't too bad. For example, what happens when the karaoke bar changes their karaoke night? Just head on home and flip on the cable box. There are karaoke songs out there! And what happens when you get tired of hearing your friends belt out barely recognizable tunes and you want some authentic NJ lovin'? Flip on over to the section full of Bon Jovi music videos! They have music lessons for guitar and piano as well as a couple movies that I've actually heard of (including Warren Miller and Monty Python.) Comcast has a website that lets you search through all the content which is a lot easier than using the on screen guide. There's a lot of junk out there but there are a few gems worth finding. Now if only I could get OnDemand HD content.

iPod Update

I finished off the iPod project last night. When you sit down in my car you just see one small white cable that pops up from the floor near the cup holder. Plug that into the dock port on the iPod and voila! 30GB of tunes at your fingertips. I'm really happy with the way the project went. It could have been a nightmare but it was pretty simple. Check out the photos here.