a binary column of 0s and 1s then even one NaN will be enough to force it to be typed float64 rather than int8. The pandas representation for missing data, np.nan, is actually a float64 data type, so if you have e.g.
specify reading a selection of columns ( usecols kwarg)Ī lot of the above settings for read_csv will not work with columns that have NaN.specify values that should be recognized as NaN ( na_values kwarg).apply arbitrary conversion functions to each column ( converters kwarg).parse columns to date time ( parse_dates kwarg).Check out this SO answer and also the official read_csv docs for more info. There is a actually a lot more than can be done in terms of processing data during pandas read_csv. The data is a set of bank customers with some demographic information, stored in data.csv which has about 15 rows (pretend I said 15 million). Pandas to_sql for writing a dataframe to a databaseįor the purposes of this post I'm going to use a laughably small.Pandas read_sql_query for pulling from a database to a dataframe.
sqlite relational database software and super basic SQL syntax.Pandas dtypes and astype for "compressing" features.Pandas nrows and skiprows kwargs for read_csv.If you have a laptop with 4 or even 8 GB of RAM you could find yourself in this position with even just a normal Kaggle competition (I certainly did, hence this post). Here I'll talk about some tricks for working with larger-than-memory data sets in python using our good friend pandas as well as a standard lib module sqlite3 for interfacing local (on your machine) databases. In which I learn how not to set my computer on fire by trying to read a 3GB CSV file.