The most common operation in A* is finding the next cheapest node to explore in the open list in which pushing and popping nodes are the most common operation performed. Therefore, these are the operations you want to be cheap.
The most popular way of implementing the open list is to use a heap for the nodes. This lets you pop and push in O(log(n)) time. While this may be good for general solutions, our project was guaranteed to be a square 2D graph with a maximum number of nodes. This allowed me to use a more customized data structure for the open list which I will refer to as a bucket list. This is essentially a 2D array, where you can define how nodes are placed into buckets and keep track of the current cheapest bucket to pop from.
A bucket list has the added benefit of not needing to sort the individual buckets, since you generally only have a few nodes (<20) per bucket it is faster to search the bucket for the cheapest node in the bucket.
Another important factor is the size of the nodes. Ideally you want each node to be as small as possible. To achieve this, I used binary values to store neighbor and parent information. This resulted in adding a single byte of information while still allowing access to 256 possibilities.
After implementing the above the average A* search took approximately 2000 microseconds (μs) to complete or 2 milliseconds (ms). While I was extremely happy with this time, I wanted to see if I could push it even further.
After profiling the tests, I realized that resetting nodes was a costly operation. To improve upon this, I added visited nodes to a list and would only reset the list of visited nodes. The time improvements this saved were impressive, cutting the average A* search time in half to approximately 1000 μs.
I wanted to see if I could take this even further by eliminating the need to reset nodes all together. This was done by increasing the size of each node slightly by adding an iteration count to it. When visiting a node, I could then check the iteration count to determine if the node had been visited or not. I did see an improvement of about 50 μs in the test results but overall, it had a minor effect.
Now that my data structures were optimized it was time to turn to the search algorithm and see how I could optimize that. The first step I took was to eliminate backtracking. Once you move to a new node, you do not need to visit the previous node again. Normally you search all open nodes connected to a node, but by always moving towards the goal and factoring out backtracking you can eliminate 3-5 neighboring nodes from a single nodes neighbor search. This surprised me with a 200 μs speed improvement.
The final improvement I made was creating a function lookup table. Instead of iterating over all neighbors in which invalid neighbors are skipped, it turns out it is faster to jump to a function that will only evaluate the valid neighbors. To do this I created a separate function using the concept of subsets to generate the 256 required functions for the lookup table. This improvement was even better than backtracking resulting in a 300 μs speed reduction.
At this point I spent about 30 hours working on the project and went from a 2000 μs average down to 500 μs average or 0.5 ms and was satisfied with my accomplishment. The effort paid off allowing me to win the speed contest.