[Gretl-devel] Gretl and PUMS Data

Allin Cottrell cottrell at wfu.edu
Thu Oct 25 12:14:01 EDT 2007


I recently responded to this question from a gretl user:

> I have been trying to figure out how to Gretl for Public Use 
> Micro Data Sample (PUMS). I am wondering if you can point me in 
> the right direction. Your response is greatly appreciated.

My response is below (you may have seen it on gretl-users), 
followed by a design question.

<initial response>

I haven't made much use of PUMS data myself, but here's what I 
found on quick experimentation.  I went to 

http://factfinder.census.gov/home/en/acs_pums_2006.html

and downloaded the 2006 Population Records for North Carolina in 
CSV format.  Gretl was close to being able to read this straight 
off, but there was one problem.  

When gretl encounters non-numeric data for a particular variable 
in a CSV import it treats the values of that variable as strings, 
constructs a numeric coding, and creates a "string table" that 
presents the coding to the user.  BUT this is done only if 
non-numeric data are encountered in the first data row for the 
variable in question.  That is, if we read (apparently) numeric 
data on rows 1 to k-1, then encounter non-numeric data on row k, 
we flag an error and stop reading.

The trouble is that some of the PUMS variables are codings, some 
but not all values of which contain non-numeric characters.  For 
example, NAICSP, the "NAICS Industry Code", which has values 
(among others) of 1133 and 113M.  

Here's a solution, perhaps not permanent if we can think of 
something better: I've added a new parameter to the "set" command, 
namely "codevars".  You can do, for example,

 set codevars NAICSP SOCP

prior to importing a CSV file.  This tells gretl that the 
variables NAICSP and SOCP should be interpreted as string-coded, 
even if the first values look to be numeric.

(In general you say: "set codevars <varnames>", where <varnames> 
is a space-separated list of names.  You can say "set codevars 
null" to clean out the list.)

For the North Carolina PUMS data, this now works to open the file 
in gretl:

 set codevars NAICSP SOCP
 open ss06pnc.csv

This feature is in CVS gretl, and also in the current Windows 
snapshot at

http://ricardo.ecn.wfu.edu/pub/gretl/gretl_install.exe

You may have to engage in some trial and error.  I've beefed up 
the error reporting a little.  So, in relation to the example 
above, if you do

 set codevars NAICSP 
 open ss06pnc.csv

you then see:

 Variable 106 (SOCP), observation 12, '434XXX':
 Extraneous character 'X' in data

which in effect tells you that you need to add SOCP to the 
"codevars" list -- if it seems to you that 434XXX is a legtitimate 
value for that variable.

</initial response>

Now here's my question.  I wonder if it might be better (or 
complementary, perhaps) to add an option flag to open/import, that 
forces gretl to treat all data columns containing non-numeric 
values as legitimate codings.  (There could be a corresponding 
checkbox in the GUI.)

Internally, this would require two passes through the file, one to 
assess which variables need special treatment, and a second to 
atually read (and code) the data.

The general issue here is that non-numeric values are sometimes 
legit, but sometimes reflect a screwed-up data file.  It might be 
useful for the user to be able to say, "I know that anything 
non-numeric in this file is in fact legit".

Allin.



More information about the Gretl-devel mailing list