Pages

vendredi 15 octobre 2010

An example of procede large volume of text data file

I have some data saved in text format and they are in the dimension of 300*7*100*150, which represents time*components*latitudes*longitudes. I want to find the data point in the domain which is at a specific time and location.

Luckily I can find find the index of the point with other calculations, ie., time_ind, component_ind, latitude_ind and longitude_ind. A question is how to find the value of this point effeciently. The old way was to read in the dataset, then find the value. Since we have several large arrays like this, it will take huge memory to proceed with this method.

The more effecient way is (for example in Python):
--First find out the index of data point in the domain, which is time_ind*components*latitudes*longitudes + component_ind*latitudes*longitudes + latitude_ind*longitudes + longitude_ind

--f = open('file.txt','r') ---create the file object
temp = f.next() -----start to read the next data
--set up a loop of reading the data file with next() method with 'count' increasing. If the 'count' value equals to the index of data point, loop stop.
--use the value of temp as the output result.

In this method the machine each time will read in a small portion of text data (the length is decided by the text data format), and use very small memory; the method f.next() is also fast. When we look at the 'top' results, only 1% of memory is used, compared to possibly 30% used with the first method. The time cost is much reduced as well with a test using the 'time' command.

A tip: given a large range of data used in a loop, the command 'xrange' uses the less memory than 'range', and is slightly faster than 'range'.

Aucun commentaire:

Enregistrer un commentaire