Howfar It Is from Here to There

My redistricting algorithm selects census blocks based on their distance from the district center. The coordinates of the district center and of the next census block are given in latitude and longitude integer tenmillionths degrees. Of the many methods to calculate that distance, which is best?

A first method assumes the earth is flat, and calculates a hypotenuse using the Pythagorean equation of the square root of the sum of the squares. The calculation is reasonably quick, but is accurate only for very small distances.
A second method assumes the earth is a sphere, and uses spherical trigonometry. Because the answer comes from taking the arccosine, the accuracy is impaired for very small distances.
A third method that must have been used by my Grandfather uses haversines, which was developed to avoid being impaired for small distances. This method also assumes the earth is a sphere, and uses spherical trigonometry. Unfortunately, the greater complexity of this method means that a greater amount of computer time would be needed to do each calculation.
A fourth method used by the Federal Communications Commission in 47CFR73.208 is to multiply the horizontal and vertical distances by correction factors, then calculate the hypotenuse using the square root of the sum of the squares. The FCC specified that this method was only good up to 475 kilometers. I had hoped that this distance limit was imposed based on the interested distance of signal wave propagation, and that the method was actually good for greater distances. After playing with the numbers, that hope is dimmed.
Other methods exist today that are splendidly accurate, but recursively complex, meaning too much computer time would be needed for my purposes.

To meet my goal of quickness, I tried to use a different method similar to the fourth method. Instead of calculating the correction factors by a trigonometric Taylor series, use a lookup table. The table lists for every positive integer degree of latitude: a horizontal correction factor and a vertical correction factor. To improve accuracy, simple linear interpolation is used. I fudged together a table for a distance range up to 2000 kilometers. The resulting accuracy for this method was only under 1000 parts per million. The biggest contribution to error seems to be the difference between the assumed corrected shape and the actual shape of planet Earth.

Perhaps trading off some quickness to gain accuracy would be a good idea. If most distances are short, then the lookup table could be used for quickness. For longer distances, the second method taking the arccosine could be used to maintain accuracy. To decide whether the distance is short or long, get a really rough distance based on my blog of October 2010. Since the correction factors in the current table were meant to cover up to 2000 kilometers, a new table is needed calibrated to shorter distances. How much shorter depends on the cross over range between short and long, which should also be determined carefully for the best trade-off between quickness and accuracy.

The problem then is to write a new FORTRAN howfar function with a crossover and create a new table of correction factors. To save time and improve accuracy, one way to solve the problem would be to write a computer program specific to the task. The technique of writing a robot computer program whose purpose is computer programming is not new. Although the computer programming is FORTRAN, another computer language such as BASIC could be used for the robot computer program. I prefer interpreted instead of compiled, and high-level language to lower level language. Presumably, it would be best for other people though if FORTRAN was also used for the robot computer program.

The robot computer program could set the vertical distance to zero. The actual distance can be computed, for example by modifying {http://geographiclib.sourceforge.net/html/Fortran/index.html}. Substituting into the FCC equation will yield a value for a horizontal correction factor. It is similar in shape to the cosine, and the interpolation line is below the curve. Therefore the error due to the cosine curve is plus. Adding slightly to the horizontal correction factor can reduce that error magnitude by making that cosine curve error plus and minus. Rounding should be done to the horizontal correction factor to better match the actual accuracy. Accuracy calculations could set the crossover range.

Reality of course is that because of limited time, I will simply ignore this problem. I am still ignoring writing a cake cutter algorithm, and that has higher priority. The summer of California’s dangerous drought and wicked wildfires is finally winding down, and Silicon Valley actually had light rain. Time to prepare for whatever winter may bring.

Split from Splitline Redistricting Algorithm

Splitting off from Splitline Algorithm for Redistricting.

Back in August 2010, I blogged about how Sandboxwalls could use the Splitline Algorithm to locate the starting seeds. I had hoped that someone would be upgrading the Splitline Algorithm software, or even better, that someone would help me create software that would locate the starting seeds for Sandboxwalls or Voronoi diagram redistricting. Splitline seems to have died. I have started once again to consider creating an algorithm to determine where to place the starting seeds. Splitline Algorithm nicely divides a state into areas, and one seed could be planted in the center of each area. Although I understand the basic concept of the Splitline Algorithm, I still don’t understand the details of how it should work.

Sandboxwalls will cut up a state into horizontal stripes. Each stripe is thick by the same amount of integer tenmillionths degrees of latitude. To initially split the state into two pieces, it is simple to start from the bottom of the State, move North and accumulate stripes into the newly created area until the desired population is exceeded. It is not difficult to start from the West and move a vertical line eastward until the desired population is reached in the newly created area west of the vertical line. The shorter of the horizontal line or vertical line can then be chosen to split the State into two areas. Splitline Algorithm would have chosen the shortest from among many dividing lines at many angles of slope instead of merely a pair.

I hope to justify this drastic simplification of Splitline by noting that I am just trying to locate the starting seeds instead of creating a final map. By using only horizontal and vertical dividing lines, the required software is greatly simplified. The resulting areas from a split would be convenient boxes instead of complicated polygons. After all, it seems that I will have to spend the time to create the software, and time is limited. I may need a new name for such a drastically degraded algorithm since it would not meet the specifications for Splitline.

Conceptually, the final split of a remaining area could be done by choosing from among more than just horizontal and vertical dividing lines. The vertical line could accumulate triangles instead of vertical stripes. The choice would be choosing between a triangle that increased the tilt of the dividing line or a triangle that decreased the tilt of the dividing line. Additionally, dividing lines could start with various angles of slope instead of merely horizontal and vertical. The question is whether this concept is worth extra time needed to write the extra FORTRAN, as well as the extra time spent by Sandboxwalls during execution.

Next blog, I will again examine computing the distance between points on the surface of planet earth. Perhaps I should have a robot write the software.

The Input files

For the computer to do redistricting, the computer must receive certain data. Sandboxwalls used three data files as input. The first file contained the census block data. The second file contained the borders of the State as a polygon. The third file contained the information on what census blocks were immediately adjacent to any given census block.

The first file was the bloated census data file from the Census Bureau. The only information needed by Sandboxwalls from that file was the census block data. The data needed for each census block was only the following: the latitude, the longitude, the county, the city, and of course the population. Also included was the census tract and census group, but those weren’t really needed, so Sandboxwalls won’t use them in the future. I am going to add the possibility of neighborhood or district or parish or ward or whatever else you would call the subdivision of a city. As I suggested in my last blog, the use of an extractor program will considerably reduce the size of a file used to replace the first file.

The second file was a database file from the Internet that gave the borders of the State. Digital decay takes its toll, and I can no longer find such files. Back then, I had considered the possibility of using the latitude and longitude of the census blocks to re-create a State border. One reason I didn’t do that was the existence of the database file for each State. Also, the actual State border would inevitably be different from the result achieved by an algorithm that would re-create the State border from the census blocks. I am now writing such an algorithm. At least that will make the second file superfluous. The algorithm will be included in Sandboxwalls.

The third file on adjacent blocks was made from various files available from the Census Bureau. Hopefully, some day others will become interested in creating such a file on adjacent blocks. The format I suggested in my blog of October 2011 still seems reasonable, but creating such a file hasn’t gotten any easier. It also hasn’t attracted any interest since then.

Meanwhile, there are other problems to consider, such as saving water during this drought here in Silicon Valley

Forward with pain

I can’t resist the temptation of following the gerrymandering soap opera. Our politicians have provided way more than enough entertainment in their nothing is too immoral quest for power over the ballot box results. One of my cynical observations is that when you get old, injuries no longer heal, instead they accumulate. A serious injury a couple of months ago left me unable to work. As frustrating as losing a paycheck is, worse is that I was too injured to play, so I couldn’t even enjoy the time off. Fortunately, I could still bang on a keyboard, and suddenly found myself with time to do just that.
My redistricting software is buggy, incomplete, barely coherent, and generally not ready for prime time. I decided to start at the beginning and try to clean up the software. Computer power has continued to advance as predicted, but so has digital decay. The type of data supplied by the Census Bureau as well as the format of that data changes every decade. I suspect that is what impeded the use of the split line algorithm on the 2010 census data.
I have concluded that I need a little software program that will extract the needed redistricting data from the bloated gerrymandering data. Sandboxwalls can be designed to use a consistent form of the redistricting data. The extractor program can be used to supply that redistricting data by extracting it from the gerrymandering data. That way sandboxwalls can be the same for all decades. Theoretically, only a different extractor program will be needed for each decade.
Fortunately, I am now back to work.  Next blog I will consider what files are needed.

Summer is not enough to redistrict

      Summer went by really fast. I was busy enough with other problems. Since my blog doesn’t really have an audience, it is more like a diary to be reread someday.   A relative of mine peeked at it.

         I downloaded census and other data for Arizona. I compiled code I wrote in Fortran.  Sorting latitudes into bins seems to work, and I recommend that approach.

     Everybody insists that districts must be contiguous, but nobody publishes a table of adjacent census blocks.  Creating a table for adjacent census blocks was very time-consuming. I tried to create a table with eight columns. Columns 1 through seven were for data, and column 8 was a pointer to more data in case of more than seven sides to a census block. LOGRECNO could be a four byte integer, whereas the GEOID could not. Each row was for the LOGRECNO of a census block. The data in the row was the LOGRECNO of all adjacent census blocks. After the data, the next column in the row is a zero to signal the end of the data. NANs or undefineds are not allowed in the table.  I expected that few census blocks would have more than seven adjacent neighbors.

      Results were poor.  A few sides seem to have the same census block on both sides.  Also, some census blocks apparently had dozens of adjacent census blocks.  It is easier for me to believe these anomalies are errors in the census data, instead of admitting that these anomalies are due to my buggy code.  It was a stretch to extract the necessary data and trapeze all the way from sides to faces to GEOIDs to LOGRECNOs.  The trapeze apparently dropped some data along the way.   Sandboxwalls juggled the data yet again to match the latitudes and longitudes in the bins before putting the data into an internal eight column array for its use.

      It would have been nice if someone else had published such a table of adjacent census blocks. The GIS companies being well-funded by the politicians have staff available to extract this sort of data for their gerrymanders. I suppose the excuse for nobody publishing such a table is that there aren’t that many people attempting to create software to do automated redistricting. This people, however, needs all the help he can get. 

     In Sandboxwalls, when the district seed balloons into another district balloon, vector forces are created. The vector forces accumulate, but are all released at the same time. Having the release at intervals of population increase seem to work better than having a release caused by various events. It can consume a fair amount of time for the vectors to move the district centers to their new positions and re-create the arrays. Wall height was determined such that the effective eccentricity of district to city to County radius was 1 to 2 to 3. When the sand overflowed a wall, a new radius was created at the point of overflow. I never got a complete result, but did expand far enough that the population difference between all districts was under 1%. Some districts expanded to their maximum radius.  I suspect that was not because the district was hemmed in by other districts, an eventuality for which I was prepared. I am afraid it was because of too many holes in the adjacent census blocks data.   I ran out of time before I had a chance to investigate this.   I never did create a dotCSV file for a 30 district map of Arizona.

      Time to forget about redistricting by computer, and get ready to help put on the upcoming consolidated elections November 8.  After that, well, summer vacation is over, so playtime is over.  Maybe next year if I can afford it.

Sorting the census blocks for redistricting

 

Another method for sorting the census data for redistricting has occurred to me. Back in ancient times, there were machines that could sort punch cards. See: http://en.wikipedia.org/wiki/IBM_80_series_Card_Sorters.  For example, cards could be sorted into ranges of latitude. The cards would be conveyed horizontally until they dropped into the proper bin below.  Today, the same sort of thing can be done in software, known as a radix sort. This could be faster than even the merge sort that I was considering, and the code could be more compact as well. I hadn’t considered radix sort seriously before, since potentially a huge memory would be required. I remember monolithic 2K byte chips, so half  gigabyte memories are huge to me. Ancient FORTRAN required arrays to be dimensioned at the beginning of the software program.  Today, modern FORTRAN allows calculating efficient dimensions for the array before allocating memory to the array.  The old way would have been to use a linear array and then calculate the 2 dimensional subscripts for that array. 

 I intend to write code for a radix type  sort to separate the census blocks into stripes of latitude ranges.  The census blocks in each stripe can then be sorted by longitude.  Merge sort will hopefully work well enough for the stripes.  Other methods of sorting may be faster, but I doubt it is worth my time to code them. After all, the goal is redistricting by computer. I have already learned way too much about sorting, but help is not on the way.  I am still writing code and haven’t started debugging.  A compile now and then reveals tons of errors that are easily fixed.

bug is i=1 was shifted left a column

Decades ago, some academics tried computerized redistricting by growing districts from a seed.  Sandboxwalls also grows districts from a seed.   Even though that research was paid for by taxpayer dollars, that information is inaccessible,  stuck up in ivory towers.  I imagine those experiments mostly failed, as researchers after that pursued instead the heuristic methods for redistricting.   Circles don’t fit together well, but I’m hoping my experiment with sandboxwalls will do better.  Time will tell, but spare time is hard to come by. I envy the several contestant teams in the state of Virginia. Each team had over half a dozen undergraduates and a faculty advisor. The good news is that slowly more people are doing redistricting. Unfortunately, gerrymandering is still the clear winner.

Closs Enuf 4 Govment Work

  For redistricting, random processes should not be used.  Certain software given the identical census data run on a compatible computer should yield identical results.  That way all parties can verify that the results are not just something cooked up in a smoke filled room.  If the results from different computers aren’t precisely identical, careful analysis will reveal the cause of the discrepancy.  This is better than an independent redistricting commission.  A commissioner is likely to know that San Francisco is liberal and Orange County is conservative.  Even dissecting the brain of a commissioner will not reveal if that knowledge or some other knowledge improperly biased a decision.

 There has always been a trade-off in computing between accuracy and speed.  Double precision takes longer to calculate than single precision.  Consider finding the closer of two points to a center.  If the distance to the two points is almost identical, then greater precision is required to resolve which is closer.   If using 64 bit integers, the squares of each distance can be compared instead of the distances themselves.  Since I am using a borrowed six-year-old home computer, approximations and shortcuts will keep the running time reasonable.  There is always the danger that not being precise enough will result in an incorrect comparison.  A way to save the situation is to measure how identical the distances are.  If too close to identical, the calculation can be redone with higher precision and without so many approximations.  Hopefully, this should rarely happen.  Still, it takes more time to measure how identical, recalculation or not.   Until I actually run some software, I won’t know how long a time is needed for any of this.

 There is no cure for the worst-case scenario.  More cosine terms could always be used, and more decimal places in Pi could forever be added without limit.  I didn’t like the idea of a hidden MSB bit, but IEEE 754 is crucial for portability of software among computers.  Still, there comes a point where the results should be accepted if they were fairly generated.  In gambling lotteries, we analyze the fairness of the selection process, not the precision of the results.  In gerrymandering, the end results justify any shady means.

It is time for me to prepare for the incoming data.  No, not the census block data, but instead, my W-2’s and 1099’s must be collected.  My taxes are unpredictable, as I haven’t figured out yet what Congress has done to us taxpayers this time.  I will have to do my taxes before I can attempt to do  SandBoxWalls to Congress.
 

Some Census data will be released next month. I doubt I will be ready in time.

Out of Sorts with Redistricting

      My ideas for redistricting with a sandboxwalls program involve a lot of data sorting.  First, the census blocks of a state are sorted by latitude.  However, the states are two dimensional, not one-dimensional.  Consider a television picture.  The picture is made up of horizontal stripes.  Using that concept, the latitude of the State is then divided up into stripes.  The census blocks within each stripe are then sorted by longitude.  A box is placed on the picture.  The center of the box is where a district will grow outward in all directions from a dot.  The box then contains segments from a number of horizontal stripes.  All of the census blocks within the box must be sorted by distance from the center dot.  That way, the district can grow by acquiring the closest census blocks in order by distance.  See my last blog on the square root.  To keep all of the districts at about the same population, a tally records the current population for each and every growing district.  The district with the lowest population grows by acquiring a census block.  To know which district has the lowest population, the tally must be kept sorted by population.  To correlate PL94-171 data with Shapefiles data, they can both be sorted to match.
      To get this far has involved a lot of sorting.  If the census blocks are delivered by the Census Bureau in some kind of order, that can either be a help or even severely hinder depending on the order versus the chosen sorting algorithm.  I call the order seemingly being used as “thoughtlessly random”, and chose accordingly.  How to implement all this sorting is a job for a software engineer.  Unfortunately, I have never found a software engineer interested in playing with my sandboxwalls.
      After some thought, I realized that the classic bubble sort algorithm should work well for sorting the tally each time.  But bubble sort would be horribly time consuming at sorting a state’s amount of census blocks by latitude, and a different sorting algorithm is clearly needed there.  I use a binary search insertion sort algorithm for that.
      Actually, there isn’t exactly a best sorting algorithm for a given sorting problem.  My analysis of the sorting the box problem resulted in a solution that is a combination of algorithms.  Each segment in the box is split into 2 pieces. Each piece becomes a list which is then bubble sorted.  The 2 lists are then merge sorted into a single list.  Each merge sorted list is then binary search insertion sorted as a batch of census blocks into the final result.  This requires more than the usual amount of code of course, but it is the quickest way I know how to get the result.  I hope it will work.
      The public library eventually got me the book I put on hold.  The title is “The Art of Computer Programming volume 3″ by Donald E. Knuth.  I learned several things about sorting from this book.  Much of what was discussed is over my head, as the subject of sorting has depth.  I already knew that I needed help, but the book is the best help I have been able to get.  Also, it seems that some version of the merge sort algorithm is more appropriate for quickly sorting the census blocks by latitude than what I have now.
      The question is: Do I have enough time to go back and rewrite the FORTRAN, or would it be better to spend the time attempting to debug what I have now?  The reality is that I still have no idea how to extract the data needed for redistricting from the big knotted clump of data supplied by the Census Bureau for gerrymandering.  I may be limited to rectangular states if I am unable to obtain the data or figure out how to use the TIGER Shapefiles to limit the length of horizontal stripes to within a state border.
      The Holidays are almost over, and soon I will be back to earning a living instead of playing with FORTRAN programming for redistricting.  Happy New Year seems like a wish, not a reality in this economy.  When I get a chance, the next blog will be about accuracy.

Faster than a speeding square root

In the data from the Census Bureau, the location of the census blocks are given by latitude and longitude. Instead of degrees, minutes, and seconds, the latitude and longitude are given in millionths of a degree. Fortunately, these are conveniently expressed as 32-bit integers. Finding the geographical distance between two points usually means employing the Pythagorean theorem by the computer. But if fast is desired and sloppy is acceptable, the square root can be avoided. The first method is to simply add a horizontal “X” distance to the vertical “Y” distance to yield an approximate hypotenuse. This first method is three times faster than the Pythagorean theorem, but the error is up to 42%. One reason for the great speed is no need for a squared term, which could overrange the integer. Another reason is that square root requires floating point values.

 For Sandboxwalls in the early stages, all of the census blocks near the district center will be included in the district anyway. The order in which they are included is unimportant. In spite of the sloppiness, the result is the same. In the late stages of the Sandboxwalls algorithm, such sloppiness would cause some census blocks to wind up with a different district. Although the Pythagorean theorem is needed in the end, the Sandboxwalls algorithm can be speeded up by not using square root in the beginning. The trade-off is how much speed is achieved for an allowed sloppiness.

 A second method works as follows: take the shorter of “X” and “Y”, and cut it in half. Add that half to the longer of “X” and “Y” to yield an approximate hypotenuse. Although taking a little more computer time, the error is 0% to 12%. Although taking half is a division operation, an optimized compiler will accomplish this quickly with a shift register operation. Keeping in mind the need to use shift register operations, method three subtracts an additional term. Besides the half of the shorter of “X” and “Y”, a 16th of the shorter is also subtracted. Method three is still almost twice as fast as using the Pythagorean theorem, and the error is 0% to 9.2%.

Method 4 subtracts an eighth instead of a 16th. The speed is the same but the error becomes from -2.8% to about 7%. Method Five is an attempt to get the best of both Method three and method Four. Mostly method five uses method three, but switches to method Four if the angle is large. The error for method Five is 0% to 7%, but the price for the greater accuracy is much less speed.

 I wondered if the quick values obtained by one of these methods could be used as the initial value in the Newton Raphson algorithm for square roots. Unfortunately, this algorithm consumes more computer time using Fortran than the standard optimized Pythagorean theorem.

Here is a chart of my results:

Method 1: long + short integers                                78 time units      error:  0%  to  42%
Method 2: long + half short integers                         94 time units      error:  0%  to  12%
Method 3: long + half short – sixteenth short         156 time units      error:  0%  to  9.2%
Method 4: long + half short – eighth short              156 time units      error: -3%  to   7%
Method 5: long + switch sixteenth,eighth                203 time units      error:  0%  to   7%
Pythagorean: single precision square root               281 time units      assumed accurate

This is the fortran code cut from a test program using method 5:

       i=kx*(longmz-longsf);  j=ky*(latmz-latsf) !-Martinez
       if(i<j) then                             
          if(8*i<7*j) then   !–used 7/8 ratio since 8=2^3
               id=j+i/2-i/8
          else
               id=j+i/2-i/16
          endif 
       else
          if(7*i>8*j) then
               id=i+j/2-j/8
          else
               id=i+j/2-j/16
          endif
       endif

Another time consuming process is sorting the census blocks. I’ll consider that in my next blog.

Make my districts contiguous please

In redistricting, contiguous is a fancy word meaning the district is one lump without any loose flakes floating around. If a district includes part of a county, I see  no reason for that district to have little pieces of that county that are unconnected with the main part of the district. This assumes that the county was contiguous. Consider a city that is not contiguous, since it has small unconnected parcels of land. If a district includes part of that city, then perhaps that district has a valid excuse to not be contiguous if that district includes some of those small unconnected parcels of land.

For sandboxwalls to keep districts contiguous, before adding a particular census block to the district, it must  confirm that the particular census block is adjacent to the district. This could be done by checking a contiguous blocks table with records as follows:
The first integer identifies the particular census block. The next integers identify the census blocks adjacent to the particular census block. In the rest of the record the integers are padded with a zero.   Every census block would have a record, and the records would be in a searchable order.   Such a table would allow the software to verify that no district has unconnected fragments. 

Such a table could I presume be constructed from data in the census TIGER files. I looked at the following website.

http://www.census.gov/geo/www/tiger/tgrshp2008/ats2008ch3.txt

3.9.1.1            All Lines Shapefile Record Layout

            The shapefile name is: tl_2008__edges.shp

            The shapefile is county-based.

            The following is the shapefile’s attribute table layout:

Field   Length            Type    Description

STATEFP      2          String  Current state FIPS code

COUNTYFP   3          String  Current county FIPS code

TLID    10        Integer            Permanent edge ID

TFIDL 10        Integer            Permanent face ID on the left of the edge

TFIDR 10        Integer            Permanent face ID on the right of the edge

When they say they edge, I presume they mean a segment of the polygon that is the perimeter of a census block. I have no idea how to take the Permanent face ID and get a latitude and longitude of that census block. The census files are usually massive and  are geared for use by States and large companies, not hobbyists.  I have not attempted to download any of them. It is important enough to me to have such a table.  If desperate I could create a fake table that had a certain number of the nearest districts, even if they weren’t adjacent. Nearest could be determined by using latitude and longitude of the census blocks.  Hopefully someone can give me clues on how to construct an Adjacent Blocks Table.

To find the distance from a district center to the census block, do the following:
Subtract the longitude of the district center from the longitude of the census block to get horizontal distance X. Next, subtract the latitude of the district center from the latitude of the census block to get the vertical distance Y.  Since X and Y are on planet Earth instead of a flat surface, they must be multiplied by correction factors. To save computation time, sandboxwalls does not calculate the correction factors for every distance. That requires cosines. Instead, the correction factors are calculated for each district center and those are used repeatedly as an approximation. Now take the corrected X and Y and use the Pythagorean theorem. The hypotenuse is equal to the square root of the sum of the squares of X and Y.

Unfortunately, that means to find every distance requires a square root function. To save computation time, in my next blog I try a sloppy but fast way to find square root.

Follow

Get every new post delivered to your Inbox.