[Gretl-devel] Gretl and PUMS Data

Riccardo (Jack) Lucchetti r.lucchetti at univpm.it
Fri Oct 26 10:43:21 EDT 2007


On Fri, 26 Oct 2007, Allin Cottrell wrote:

> On Fri, 26 Oct 2007, Sven Schreiber wrote:
>
>> Riccardo (Jack) Lucchetti schrieb:
>>>
>>> A possible alternative may be the following: first, read all
>>> the data as if they were all strings. Then, with the data
>>> already in RAM, convert to numeric whenever possible. This
>>> way, you read the datafile only once, and the way stays open
>>> if we want, for instance, flag some of the variables as
>>> dummies or discrete variables straight away.
>>
>> Jack's idea sounds good. If I understand correctly, it's an
>> approach to convert as much as possible to usable variables and
>> data, and inform the user about the rest. (Rather than throwing
>> errors and stopping.) That would be good.
>
> I like Jack's idea too, with a couple of reservations.
>
> First, I'm not too keen on reading all the data into RAM as
> strings.  To ensure no data loss, these strings would have to be
> fairly long -- say 32 characters.  Now with something like PUMS
> you can have tens or hundreds of thousands of observations on
> hundreds of variables.  This makes for a big memory chunk when
> stored as doubles, and perhaps 4 times as big when stored as
> strings.  So I tend to favour two passes.

True. Still, it's not inconceivable to allow the RAM policy for small 
files and the "two-passes" policy for larger files. Clearly, this would 
require some heuristics, but...

> Second, I think that attempting to parse all non-numeric stuff as
> coded data should probably be governed by an explicit option.
> It'll work fine on a well-formed PUMS file, but could cause a
> nasty mess with a very large data file that has a few extraneous
> non-numeric characters in it, 100% CPU for a long time.  Think of
> a file with 200000 observations and a stray 'x' on the last row.
>
>> BTW, on a (only loosely) related issue, it would be useful if
>> gretl could handle files like some I recently downloaded from
>> the US BLS site; they report quarterly data with an additional
>> row for year averages, like so:
>>
>> 1950Q01 3.5
>> 1950Q02 4.2
>> 1950Q03 9.4
>> 1950Q04 5.3
>> 1950Q05 <you do the calc ;-)>
>
> Yes, I've seen data of that sort too.  I'll think about that
> issue.

The last two points are related IMO. It's very nice from the user point of 
view to have gretl handle sensibly cases such as these, but in the end 
it's the user's responsibility to feed a decently-formed CSV file into 
gretl. No-one can reasonably complain if gretl (or any other program, for 
that matter) refuses to read a CSV file which contains a stray x at the 
end. As for Sven's case, it'd be rather easy to do a

grep -v Q05 originalfile.csv > modifiedfile.csv

(pity those poor souls who lack Unix tools). My point is that we should 
not try to cover internally all possible cases that occur in practice; 
there's always going to be one more special case, and there are tools for 
this.

Riccardo (Jack) Lucchetti
Dipartimento di Economia
Università Politecnica delle Marche

r.lucchetti at univpm.it
http://www.econ.univpm.it/lucchetti


More information about the Gretl-devel mailing list