[Gretl-devel] Gretl and PUMS Data

Allin Cottrell cottrell at wfu.edu
Fri Oct 26 09:36:59 EDT 2007

On Fri, 26 Oct 2007, Sven Schreiber wrote:

> Riccardo (Jack) Lucchetti schrieb:
> > 
> > A possible alternative may be the following: first, read all 
> > the data as if they were all strings. Then, with the data 
> > already in RAM, convert to numeric whenever possible. This 
> > way, you read the datafile only once, and the way stays open 
> > if we want, for instance, flag some of the variables as 
> > dummies or discrete variables straight away.
> Jack's idea sounds good. If I understand correctly, it's an 
> approach to convert as much as possible to usable variables and 
> data, and inform the user about the rest. (Rather than throwing 
> errors and stopping.) That would be good.

I like Jack's idea too, with a couple of reservations.  

First, I'm not too keen on reading all the data into RAM as 
strings.  To ensure no data loss, these strings would have to be 
fairly long -- say 32 characters.  Now with something like PUMS 
you can have tens or hundreds of thousands of observations on 
hundreds of variables.  This makes for a big memory chunk when 
stored as doubles, and perhaps 4 times as big when stored as 
strings.  So I tend to favour two passes.

Second, I think that attempting to parse all non-numeric stuff as 
coded data should probably be governed by an explicit option.  
It'll work fine on a well-formed PUMS file, but could cause a 
nasty mess with a very large data file that has a few extraneous 
non-numeric characters in it, 100% CPU for a long time.  Think of 
a file with 200000 observations and a stray 'x' on the last row.

> BTW, on a (only loosely) related issue, it would be useful if 
> gretl could handle files like some I recently downloaded from 
> the US BLS site; they report quarterly data with an additional 
> row for year averages, like so:
> 1950Q01 3.5
> 1950Q02 4.2
> 1950Q03 9.4
> 1950Q04 5.3
> 1950Q05 <you do the calc ;-)>

Yes, I've seen data of that sort too.  I'll think about that 


