[Gretl-devel] Gretl and PUMS Data
Allin Cottrell
cottrell at wfu.edu
Fri Oct 26 09:36:59 EDT 2007
On Fri, 26 Oct 2007, Sven Schreiber wrote:
> Riccardo (Jack) Lucchetti schrieb:
> >
> > A possible alternative may be the following: first, read all
> > the data as if they were all strings. Then, with the data
> > already in RAM, convert to numeric whenever possible. This
> > way, you read the datafile only once, and the way stays open
> > if we want, for instance, flag some of the variables as
> > dummies or discrete variables straight away.
>
> Jack's idea sounds good. If I understand correctly, it's an
> approach to convert as much as possible to usable variables and
> data, and inform the user about the rest. (Rather than throwing
> errors and stopping.) That would be good.
I like Jack's idea too, with a couple of reservations.
First, I'm not too keen on reading all the data into RAM as
strings. To ensure no data loss, these strings would have to be
fairly long -- say 32 characters. Now with something like PUMS
you can have tens or hundreds of thousands of observations on
hundreds of variables. This makes for a big memory chunk when
stored as doubles, and perhaps 4 times as big when stored as
strings. So I tend to favour two passes.
Second, I think that attempting to parse all non-numeric stuff as
coded data should probably be governed by an explicit option.
It'll work fine on a well-formed PUMS file, but could cause a
nasty mess with a very large data file that has a few extraneous
non-numeric characters in it, 100% CPU for a long time. Think of
a file with 200000 observations and a stray 'x' on the last row.
> BTW, on a (only loosely) related issue, it would be useful if
> gretl could handle files like some I recently downloaded from
> the US BLS site; they report quarterly data with an additional
> row for year averages, like so:
>
> 1950Q01 3.5
> 1950Q02 4.2
> 1950Q03 9.4
> 1950Q04 5.3
> 1950Q05 <you do the calc ;-)>
Yes, I've seen data of that sort too. I'll think about that
issue.
Allin.
More information about the Gretl-devel
mailing list