The Input files

For the computer to do redistricting, the computer must receive certain data. Sandboxwalls used three data files as input. The first file contained the census block data. The second file contained the borders of the State as a polygon. The third file contained the information on what census blocks were immediately adjacent to any given census block.

The first file was the bloated census data file from the Census Bureau. The only information needed by Sandboxwalls from that file was the census block data. The data needed for each census block was only the following: the latitude, the longitude, the county, the city, and of course the population. Also included was the census tract and census group, but those weren’t really needed, so Sandboxwalls won’t use them in the future. I am going to add the possibility of neighborhood or district or parish or ward or whatever else you would call the subdivision of a city. As I suggested in my last blog, the use of an extractor program will considerably reduce the size of a file used to replace the first file.

The second file was a database file from the Internet that gave the borders of the State. Digital decay takes its toll, and I can no longer find such files. Back then, I had considered the possibility of using the latitude and longitude of the census blocks to re-create a State border. One reason I didn’t do that was the existence of the database file for each State. Also, the actual State border would inevitably be different from the result achieved by an algorithm that would re-create the State border from the census blocks. I am now writing such an algorithm. At least that will make the second file superfluous. The algorithm will be included in Sandboxwalls.

The third file on adjacent blocks was made from various files available from the Census Bureau. Hopefully, some day others will become interested in creating such a file on adjacent blocks. The format I suggested in my blog of October 2011 still seems reasonable, but creating such a file hasn’t gotten any easier. It also hasn’t attracted any interest since then.

Meanwhile, there are other problems to consider, such as saving water during this drought here in Silicon Valley

Forward with pain

I can’t resist the temptation of following the gerrymandering soap opera. Our politicians have provided way more than enough entertainment in their nothing is too immoral quest for power over the ballot box results. One of my cynical observations is that when you get old, injuries no longer heal, instead they accumulate. A serious injury a couple of months ago left me unable to work. As frustrating as losing a paycheck is, worse is that I was too injured to play, so I couldn’t even enjoy the time off. Fortunately, I could still bang on a keyboard, and suddenly found myself with time to do just that.
My redistricting software is buggy, incomplete, barely coherent, and generally not ready for prime time. I decided to start at the beginning and try to clean up the software. Computer power has continued to advance as predicted, but so has digital decay. The type of data supplied by the Census Bureau as well as the format of that data changes every decade. I suspect that is what impeded the use of the split line algorithm on the 2010 census data.
I have concluded that I need a little software program that will extract the needed redistricting data from the bloated gerrymandering data. Sandboxwalls can be designed to use a consistent form of the redistricting data. The extractor program can be used to supply that redistricting data by extracting it from the gerrymandering data. That way sandboxwalls can be the same for all decades. Theoretically, only a different extractor program will be needed for each decade.
Fortunately, I am now back to work.  Next blog I will consider what files are needed.

Summer is not enough to redistrict

      Summer went by really fast. I was busy enough with other problems. Since my blog doesn’t really have an audience, it is more like a diary to be reread someday.   A relative of mine peeked at it.

         I downloaded census and other data for Arizona. I compiled code I wrote in Fortran.  Sorting latitudes into bins seems to work, and I recommend that approach.

     Everybody insists that districts must be contiguous, but nobody publishes a table of adjacent census blocks.  Creating a table for adjacent census blocks was very time-consuming. I tried to create a table with eight columns. Columns 1 through seven were for data, and column 8 was a pointer to more data in case of more than seven sides to a census block. LOGRECNO could be a four byte integer, whereas the GEOID could not. Each row was for the LOGRECNO of a census block. The data in the row was the LOGRECNO of all adjacent census blocks. After the data, the next column in the row is a zero to signal the end of the data. NANs or undefineds are not allowed in the table.  I expected that few census blocks would have more than seven adjacent neighbors.

      Results were poor.  A few sides seem to have the same census block on both sides.  Also, some census blocks apparently had dozens of adjacent census blocks.  It is easier for me to believe these anomalies are errors in the census data, instead of admitting that these anomalies are due to my buggy code.  It was a stretch to extract the necessary data and trapeze all the way from sides to faces to GEOIDs to LOGRECNOs.  The trapeze apparently dropped some data along the way.   Sandboxwalls juggled the data yet again to match the latitudes and longitudes in the bins before putting the data into an internal eight column array for its use.

      It would have been nice if someone else had published such a table of adjacent census blocks. The GIS companies being well-funded by the politicians have staff available to extract this sort of data for their gerrymanders. I suppose the excuse for nobody publishing such a table is that there aren’t that many people attempting to create software to do automated redistricting. This people, however, needs all the help he can get. 

     In Sandboxwalls, when the district seed balloons into another district balloon, vector forces are created. The vector forces accumulate, but are all released at the same time. Having the release at intervals of population increase seem to work better than having a release caused by various events. It can consume a fair amount of time for the vectors to move the district centers to their new positions and re-create the arrays. Wall height was determined such that the effective eccentricity of district to city to County radius was 1 to 2 to 3. When the sand overflowed a wall, a new radius was created at the point of overflow. I never got a complete result, but did expand far enough that the population difference between all districts was under 1%. Some districts expanded to their maximum radius.  I suspect that was not because the district was hemmed in by other districts, an eventuality for which I was prepared. I am afraid it was because of too many holes in the adjacent census blocks data.   I ran out of time before I had a chance to investigate this.   I never did create a dotCSV file for a 30 district map of Arizona.

      Time to forget about redistricting by computer, and get ready to help put on the upcoming consolidated elections November 8.  After that, well, summer vacation is over, so playtime is over.  Maybe next year if I can afford it.

Sorting the census blocks for redistricting


Another method for sorting the census data for redistricting has occurred to me. Back in ancient times, there were machines that could sort punch cards. See:  For example, cards could be sorted into ranges of latitude. The cards would be conveyed horizontally until they dropped into the proper bin below.  Today, the same sort of thing can be done in software, known as a radix sort. This could be faster than even the merge sort that I was considering, and the code could be more compact as well. I hadn’t considered radix sort seriously before, since potentially a huge memory would be required. I remember monolithic 2K byte chips, so half  gigabyte memories are huge to me. Ancient FORTRAN required arrays to be dimensioned at the beginning of the software program.  Today, modern FORTRAN allows calculating efficient dimensions for the array before allocating memory to the array.  The old way would have been to use a linear array and then calculate the 2 dimensional subscripts for that array. 

 I intend to write code for a radix type  sort to separate the census blocks into stripes of latitude ranges.  The census blocks in each stripe can then be sorted by longitude.  Merge sort will hopefully work well enough for the stripes.  Other methods of sorting may be faster, but I doubt it is worth my time to code them. After all, the goal is redistricting by computer. I have already learned way too much about sorting, but help is not on the way.  I am still writing code and haven’t started debugging.  A compile now and then reveals tons of errors that are easily fixed.

bug is i=1 was shifted left a column

Decades ago, some academics tried computerized redistricting by growing districts from a seed.  Sandboxwalls also grows districts from a seed.   Even though that research was paid for by taxpayer dollars, that information is inaccessible,  stuck up in ivory towers.  I imagine those experiments mostly failed, as researchers after that pursued instead the heuristic methods for redistricting.   Circles don’t fit together well, but I’m hoping my experiment with sandboxwalls will do better.  Time will tell, but spare time is hard to come by. I envy the several contestant teams in the state of Virginia. Each team had over half a dozen undergraduates and a faculty advisor. The good news is that slowly more people are doing redistricting. Unfortunately, gerrymandering is still the clear winner.

Closs Enuf 4 Govment Work

  For redistricting, random processes should not be used.  Certain software given the identical census data run on a compatible computer should yield identical results.  That way all parties can verify that the results are not just something cooked up in a smoke filled room.  If the results from different computers aren’t precisely identical, careful analysis will reveal the cause of the discrepancy.  This is better than an independent redistricting commission.  A commissioner is likely to know that San Francisco is liberal and Orange County is conservative.  Even dissecting the brain of a commissioner will not reveal if that knowledge or some other knowledge improperly biased a decision.

 There has always been a trade-off in computing between accuracy and speed.  Double precision takes longer to calculate than single precision.  Consider finding the closer of two points to a center.  If the distance to the two points is almost identical, then greater precision is required to resolve which is closer.   If using 64 bit integers, the squares of each distance can be compared instead of the distances themselves.  Since I am using a borrowed six-year-old home computer, approximations and shortcuts will keep the running time reasonable.  There is always the danger that not being precise enough will result in an incorrect comparison.  A way to save the situation is to measure how identical the distances are.  If too close to identical, the calculation can be redone with higher precision and without so many approximations.  Hopefully, this should rarely happen.  Still, it takes more time to measure how identical, recalculation or not.   Until I actually run some software, I won’t know how long a time is needed for any of this.

 There is no cure for the worst-case scenario.  More cosine terms could always be used, and more decimal places in Pi could forever be added without limit.  I didn’t like the idea of a hidden MSB bit, but IEEE 754 is crucial for portability of software among computers.  Still, there comes a point where the results should be accepted if they were fairly generated.  In gambling lotteries, we analyze the fairness of the selection process, not the precision of the results.  In gerrymandering, the end results justify any shady means.

It is time for me to prepare for the incoming data.  No, not the census block data, but instead, my W-2’s and 1099’s must be collected.  My taxes are unpredictable, as I haven’t figured out yet what Congress has done to us taxpayers this time.  I will have to do my taxes before I can attempt to do  SandBoxWalls to Congress.

Some Census data will be released next month. I doubt I will be ready in time.

Out of Sorts with Redistricting

      My ideas for redistricting with a sandboxwalls program involve a lot of data sorting.  First, the census blocks of a state are sorted by latitude.  However, the states are two dimensional, not one-dimensional.  Consider a television picture.  The picture is made up of horizontal stripes.  Using that concept, the latitude of the State is then divided up into stripes.  The census blocks within each stripe are then sorted by longitude.  A box is placed on the picture.  The center of the box is where a district will grow outward in all directions from a dot.  The box then contains segments from a number of horizontal stripes.  All of the census blocks within the box must be sorted by distance from the center dot.  That way, the district can grow by acquiring the closest census blocks in order by distance.  See my last blog on the square root.  To keep all of the districts at about the same population, a tally records the current population for each and every growing district.  The district with the lowest population grows by acquiring a census block.  To know which district has the lowest population, the tally must be kept sorted by population.  To correlate PL94-171 data with Shapefiles data, they can both be sorted to match.
      To get this far has involved a lot of sorting.  If the census blocks are delivered by the Census Bureau in some kind of order, that can either be a help or even severely hinder depending on the order versus the chosen sorting algorithm.  I call the order seemingly being used as “thoughtlessly random”, and chose accordingly.  How to implement all this sorting is a job for a software engineer.  Unfortunately, I have never found a software engineer interested in playing with my sandboxwalls.
      After some thought, I realized that the classic bubble sort algorithm should work well for sorting the tally each time.  But bubble sort would be horribly time consuming at sorting a state’s amount of census blocks by latitude, and a different sorting algorithm is clearly needed there.  I use a binary search insertion sort algorithm for that.
      Actually, there isn’t exactly a best sorting algorithm for a given sorting problem.  My analysis of the sorting the box problem resulted in a solution that is a combination of algorithms.  Each segment in the box is split into 2 pieces. Each piece becomes a list which is then bubble sorted.  The 2 lists are then merge sorted into a single list.  Each merge sorted list is then binary search insertion sorted as a batch of census blocks into the final result.  This requires more than the usual amount of code of course, but it is the quickest way I know how to get the result.  I hope it will work.
      The public library eventually got me the book I put on hold.  The title is “The Art of Computer Programming volume 3″ by Donald E. Knuth.  I learned several things about sorting from this book.  Much of what was discussed is over my head, as the subject of sorting has depth.  I already knew that I needed help, but the book is the best help I have been able to get.  Also, it seems that some version of the merge sort algorithm is more appropriate for quickly sorting the census blocks by latitude than what I have now.
      The question is: Do I have enough time to go back and rewrite the FORTRAN, or would it be better to spend the time attempting to debug what I have now?  The reality is that I still have no idea how to extract the data needed for redistricting from the big knotted clump of data supplied by the Census Bureau for gerrymandering.  I may be limited to rectangular states if I am unable to obtain the data or figure out how to use the TIGER Shapefiles to limit the length of horizontal stripes to within a state border.
      The Holidays are almost over, and soon I will be back to earning a living instead of playing with FORTRAN programming for redistricting.  Happy New Year seems like a wish, not a reality in this economy.  When I get a chance, the next blog will be about accuracy.

Faster than a speeding square root

In the data from the Census Bureau, the location of the census blocks are given by latitude and longitude. Instead of degrees, minutes, and seconds, the latitude and longitude are given in millionths of a degree. Fortunately, these are conveniently expressed as 32-bit integers. Finding the geographical distance between two points usually means employing the Pythagorean theorem by the computer. But if fast is desired and sloppy is acceptable, the square root can be avoided. The first method is to simply add a horizontal “X” distance to the vertical “Y” distance to yield an approximate hypotenuse. This first method is three times faster than the Pythagorean theorem, but the error is up to 42%. One reason for the great speed is no need for a squared term, which could overrange the integer. Another reason is that square root requires floating point values.

 For Sandboxwalls in the early stages, all of the census blocks near the district center will be included in the district anyway. The order in which they are included is unimportant. In spite of the sloppiness, the result is the same. In the late stages of the Sandboxwalls algorithm, such sloppiness would cause some census blocks to wind up with a different district. Although the Pythagorean theorem is needed in the end, the Sandboxwalls algorithm can be speeded up by not using square root in the beginning. The trade-off is how much speed is achieved for an allowed sloppiness.

 A second method works as follows: take the shorter of “X” and “Y”, and cut it in half. Add that half to the longer of “X” and “Y” to yield an approximate hypotenuse. Although taking a little more computer time, the error is 0% to 12%. Although taking half is a division operation, an optimized compiler will accomplish this quickly with a shift register operation. Keeping in mind the need to use shift register operations, method three subtracts an additional term. Besides the half of the shorter of “X” and “Y”, a 16th of the shorter is also subtracted. Method three is still almost twice as fast as using the Pythagorean theorem, and the error is 0% to 9.2%.

Method 4 subtracts an eighth instead of a 16th. The speed is the same but the error becomes from -2.8% to about 7%. Method Five is an attempt to get the best of both Method three and method Four. Mostly method five uses method three, but switches to method Four if the angle is large. The error for method Five is 0% to 7%, but the price for the greater accuracy is much less speed.

 I wondered if the quick values obtained by one of these methods could be used as the initial value in the Newton Raphson algorithm for square roots. Unfortunately, this algorithm consumes more computer time using Fortran than the standard optimized Pythagorean theorem.

Here is a chart of my results:

Method 1: long + short integers                                78 time units      error:  0%  to  42%
Method 2: long + half short integers                         94 time units      error:  0%  to  12%
Method 3: long + half short – sixteenth short         156 time units      error:  0%  to  9.2%
Method 4: long + half short – eighth short              156 time units      error: -3%  to   7%
Method 5: long + switch sixteenth,eighth                203 time units      error:  0%  to   7%
Pythagorean: single precision square root               281 time units      assumed accurate

This is the fortran code cut from a test program using method 5:

       i=kx*(longmz-longsf);  j=ky*(latmz-latsf) !-Martinez
       if(i<j) then                             
          if(8*i<7*j) then   !–used 7/8 ratio since 8=2^3
          if(7*i>8*j) then

Another time consuming process is sorting the census blocks. I’ll consider that in my next blog.

Make my districts contiguous please

In redistricting, contiguous is a fancy word meaning the district is one lump without any loose flakes floating around. If a district includes part of a county, I see  no reason for that district to have little pieces of that county that are unconnected with the main part of the district. This assumes that the county was contiguous. Consider a city that is not contiguous, since it has small unconnected parcels of land. If a district includes part of that city, then perhaps that district has a valid excuse to not be contiguous if that district includes some of those small unconnected parcels of land.

For sandboxwalls to keep districts contiguous, before adding a particular census block to the district, it must  confirm that the particular census block is adjacent to the district. This could be done by checking a contiguous blocks table with records as follows:
The first integer identifies the particular census block. The next integers identify the census blocks adjacent to the particular census block. In the rest of the record the integers are padded with a zero.   Every census block would have a record, and the records would be in a searchable order.   Such a table would allow the software to verify that no district has unconnected fragments. 

Such a table could I presume be constructed from data in the census TIGER files. I looked at the following website.            All Lines Shapefile Record Layout

            The shapefile name is: tl_2008__edges.shp

            The shapefile is county-based.

            The following is the shapefile’s attribute table layout:

Field   Length            Type    Description

STATEFP      2          String  Current state FIPS code

COUNTYFP   3          String  Current county FIPS code

TLID    10        Integer            Permanent edge ID

TFIDL 10        Integer            Permanent face ID on the left of the edge

TFIDR 10        Integer            Permanent face ID on the right of the edge

When they say they edge, I presume they mean a segment of the polygon that is the perimeter of a census block. I have no idea how to take the Permanent face ID and get a latitude and longitude of that census block. The census files are usually massive and  are geared for use by States and large companies, not hobbyists.  I have not attempted to download any of them. It is important enough to me to have such a table.  If desperate I could create a fake table that had a certain number of the nearest districts, even if they weren’t adjacent. Nearest could be determined by using latitude and longitude of the census blocks.  Hopefully someone can give me clues on how to construct an Adjacent Blocks Table.

To find the distance from a district center to the census block, do the following:
Subtract the longitude of the district center from the longitude of the census block to get horizontal distance X. Next, subtract the latitude of the district center from the latitude of the census block to get the vertical distance Y.  Since X and Y are on planet Earth instead of a flat surface, they must be multiplied by correction factors. To save computation time, sandboxwalls does not calculate the correction factors for every distance. That requires cosines. Instead, the correction factors are calculated for each district center and those are used repeatedly as an approximation. Now take the corrected X and Y and use the Pythagorean theorem. The hypotenuse is equal to the square root of the sum of the squares of X and Y.

Unfortunately, that means to find every distance requires a square root function. To save computation time, in my next blog I try a sloppy but fast way to find square root.

Redistricting on the Internet

The World Wide Web meant easy access to previously hidden information. No longer was it necessary to hunt through libraries to find information on unusual subjects, such as redistricting. I found that there were a few others out there currently pursuing computerized redistricting. Better yet, blogging offers the chance to make my findings available to the world. But the hope of finding others to help me in my pursuit of redistricting has not happened. Academic libraries don’t make all their materials available to the public, even though their research is funded by taxpayer dollars. What I have seen from the Ivory Tower is about how computerized redistricting is such a difficult problem to solve. Nothing about how to download and use the census data.

Over the decades, I have reached some different conclusions from  others in this niche. I consider it impractical to achieve some perfect redistricting plan. The gerrymanderers can simply have the computer analyze millions of plans to find one that steals the most votes in their favor. I reject the idea that there is a comparable easy measurement that measures the ” goodness” of a redistricting plan. A minimum district perimeter plan does not take into account the city and county boundaries, which I consider to be a terrible weakness. There is no perfect value of how much account should be taken. The values I have chosen are a bit arbitrary. Experimentation will reveal whether or not the chosen value is practical. Too little account would lead to a redistricting plan similar to that created by a minimum district perimeter algorithm. Too much account could lead to stretched out uneven districts, though I doubt they would look as brutally mangled as typical gerrymandered districts. The trick then would be to get a consensus on a practical value that could be enacted into law.

Simulated Annealing redistricting is too heuristic for me. Splitline redistricting is another algorithm which I consider to be a competitor. It ignores census blocks in favor of smooth straight lines. Sandboxwalls grows districts from a seed. Originally, the idea was to plant the seed in the center of the existing districts. Unfortunately, the present existing districts are mangled by gerrymandering. Thus, an awful lot of evolution would be required by Sandboxwalls to produce good districts from bad. It has occurred to me to start with something better than such mangled districts. A lot of time could be saved by planting the seed in the center of each of the Splitline districts. I have no idea how to get the latitude and longitude out of the Splitline districts. Even if I did, I am not interested in a fight with whoever or whatever owns Splitline.

But experimentation is a long way off. Concepts have to be turned into working FORTRAN code. That takes time, and staying employed is more important than playing with lost causes like computerized redistricting. Somehow data has to be extracted from the Census Bureau files. In my next blog, I will ask for clues on how to do this.

The Joy and Toll of Moore’s Law

Redistricting was also affected by Moore’s Law.  I saw no reason that Gordon Moore’s prediction should be a law, as I saw it as a mere coincidence that would be temporarily true. I still don’t understand why Moore’s Law is true, but I was obviously wrong about it being temporary.  The result of ever increasing computer power was not just that a computer could win a chess game.   Somehow, the promise of more computing power on everyone’s desktop (even mine!) didn’t mean I had the computer power to do automated redistricting.  The reality was major gains for gerrymandering.   It gave those who gerrymander an increasingly powerful tool.  By having the Census Bureau include voter districts, the GIS companies could more accurately correlate voter data with census data.  The addition of voter data made gerrymandering software stunningly accurate, so that most incumbant candidates for district office won before the first vote was cast.

Politicians have strained to ignore that a computer can do redistricting.  Ironically, the precision of gerrymandering  is made possible by gerrymandering software.  Redistricting software is limited to curious academic research and hobbyists, even though gerrymandering software is more difficult to write since it must also include voter registration data.  The difference is that gerrymandering software is profitable, since there is a strong demand for it from politicians.  Thanks to increased computer power, the purchase price of a safe seat is way less than the cost of an election campaign.

 Another problem was the fragility of software.  If the software is to be widely understood, a high level language is needed.   More powerful computers were an excuse to upgrade software, making previous software obsolete.  Digital decay and data rot set in.  BASIC went from MSBASIC to QBASIC to VBASIC which demonstrated a problem for redistricting software. A decade from now, software written today could be useless just because of a language change.  If a computer program to do redistricting is actually used, it would have to be backed by law. What was needed was a computer language would be stable for decades, not merely years.   Otherwise, the law would be useless as the computer program it was based on couldn’t be run in the future.   I decided to abandon BASIC in favor of FORTRAN for that reason.  FORTRAN isn’t dependent on a particular operating system.  It  runs on open source operating systems which are more stable as well.  Of course, FORTRAN is out of the mainstream where obscuring the software code contributes to profits.  And yes, FORTRAN has been around so long that many FORTRAN programmers are now in graveyards.

But the huge increase in computer power was just one revolution, and  Moore’s law says that revolution will continue.  But another bigger revolution is my next topic.


Get every new post delivered to your Inbox.