Data
JGeocoder uses Tiger/Line data files from U.S. census bureau. Many commercial geocodoer also use these data files (or data sources that are derived from them) but they are just not telling you the users. The Tiger/Line data files are free and it contains all the necessary information to build a geocoder for U.S. addresses.
JGeocoder only uses a subset of the Tiger/Line data. These data are first loaded into a relational database. JGeocoder will query this relational database in order to geocode an input address. The relational database is expected to have one data table for each state in the following format:
--this is the primary table for PA
create table TIGER_PA ( TLID numeric not null,
FEDIRP varchar(2) ,
FENAME varchar(30) ,
FETYPE varchar(4) ,
FEDIRS varchar(2) ,
FRADDL numeric,
TOADDL numeric,
FRADDR numeric,
TOADDR numeric,
ZIPL varchar(5) ,
ZIPR varchar(5) ,
FRLONG numeric not null,
FRLAT numeric not null,
TOLONG numeric not null,
TOLAT numeric not null,
LONG1 numeric ,
LAT1 numeric ,
LONG2 numeric ,
LAT2 numeric ,
LONG3 numeric ,
LAT3 numeric ,
LONG4 numeric ,
LAT4 numeric ,
LONG5 numeric ,
LAT5 numeric ,
LONG6 numeric ,
LAT6 numeric ,
LONG7 numeric ,
LAT7 numeric ,
LONG8 numeric ,
LAT8 numeric ,
LONG9 numeric ,
LAT9 numeric ,
LONG10 numeric ,
LAT10 numeric );
The above table definition is for the state PA. JGeocoder will query this table if the input is a PA address. Tables for other states have exactly the same schema defintion but are named based on the state. For example, the Califoria data table is named TIGER_CA (the tables names are assumed to be case insensitive by the way).
Loading the Data
If you are trying to test JGeocoder, you can simply download a pre-populated database from our sourceforge site and point JGeocoder to it. Currently this pre-populated database only contains PA addresses data. Therefore, if you are using this database then you will only be able to geocode PA addresses. All other addresses will be geocoded using either the ZIP centroid or City State centroid. (see Quick Start for instructions)
If you need to geocode addresses that are outside of PA, then you will have to load them into a relational database yourself. This is not difficult to do at all because the raw data files will be available on JGeocoder's sourceforge download page. Currently only PA, CA, GA, IL data files are availble for download, but I will continue to release these raw data files for other states in the near future.
These raw data files are in CSV format. Their fields matches exactly to the TIGER_<STATE> table schema showed above. Therefore all you need to do is just create database table with the appropriate name (CA data table needs to be named TIGER_CA for example) and load the data in the raw data files to it.
Geocoding Query
Once the Tiger/Line data is loaded into a relational database, it's actually not hard to estimate the lat/lon of a parsed and normalized address. Given the schema that was described above, the geocoding query will look something like the following:
--here we are querying the PA table
select t.tlid, t.fraddr, t.fraddl, t.toaddr, t.toaddl,
t.zipL, t.zipR, t.tolat, t.tolong, t.frlong, t.frlat,
t.long1, t.lat1, t.long2, t.lat2, t.long3, t.lat3, t.long4, t.lat4,
t.long5, t.lat5, t.long6, t.lat6, t.long7, t.lat7, t.long8, t.lat8,
t.long9, t.lat9, t.long10, t.lat10, t.fedirp, t.fetype, t.fedirs from TIGER_PA t
where t.fename = $street and
(
(t.fraddL <= $num and t.toaddL >= $num) or (t.fraddL >= $num and t.toaddL <= $num)
or (t.fraddR <= $num and t.toaddR >= $num) or (t.fraddR >= $num and t.toaddR <= $num)
)
and (t.zipL = $zip or t.zipR = $zip)
In order to make this query more efficient, you should create some indexes on the columns that are being queried. For example, you should add the following indexes to the TIGER_PA table.
create index IDX0_TIGER_PA on TIGER_PA(tlid);
create index IDX1_TIGER_PA on TIGER_PA(fename);
create index IDX2_TIGER_PA on TIGER_PA(fraddL);
create index IDX3_TIGER_PA on TIGER_PA(toaddL);
create index IDX4_TIGER_PA on TIGER_PA(fraddR);
create index IDX5_TIGER_PA on TIGER_PA(toaddR);
create index IDX6_TIGER_PA on TIGER_PA(zipL);
create index IDX7_TIGER_PA on TIGER_PA(zipR);
