What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?What are the differences between feather and parquet?How to view Apache Parquet file in Windows?pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?How can I export to sqlite (or another format) and retain the date datatype?“Large data” work flows using pandasChange data type of columns in PandasWriting a pandas DataFrame to CSV fileAdding headers to a DataFrame with Pandas while dropping the indexConverting dates within the pandas data structure formatExporting pandas dataframe while retaining schemaJoining panda dataframes with UK date formatApache Spark export PostgreSQL data in Parquet formatPandas DataFrame with categorical columns from a Parquet file using read_parquet?How can I export to sqlite (or another format) and retain the date datatype?

Sentences with no verb, but an ablative

How to track mail undetectably?

How do I tell my girlfriend she's been buying me books by the wrong author for the last nine months?

Did the Shuttle payload bay have illumination?

Simplify the code

Merging two data frames into a new one with unique items marked with 1 or 0

When does it become illegal to exchange bitcoin for cash?

Is it OK to say "The situation is pregnant with a crisis"?

Aligning arrays within arrays within another array

Why is the saxophone not common in classical repertoire?

Why are symbols not written in words?

What's the idiomatic (or best) way to trim surrounding whitespace from a string?

What caused the flashes in the video footage of Chernobyl?

German idiomatic equivalents of 能骗就骗 (if you can cheat, then cheat)

Replacing 5 gang light switches that have 3 of them daisy chained together

Are the plates of a battery really charged?

Emphasize numbers in tables

How come having a Deathly Hallow is not a big deal?

How soon after takeoff can you recline your airplane seat?

To “Er” Is Human

What is this fluorinated organic substance?

Avoiding repetition when using the "snprintf idiom" to write text

Classify 2-dim p-adic galois representations

Does the Grothendieck group of finitely generated modules form a commutative ring where the multiplication structure is induced from tensor product?

What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?

What are the differences between feather and parquet?How to view Apache Parquet file in Windows?pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?How can I export to sqlite (or another format) and retain the date datatype?“Large data” work flows using pandasChange data type of columns in PandasWriting a pandas DataFrame to CSV fileAdding headers to a DataFrame with Pandas while dropping the indexConverting dates within the pandas data structure formatExporting pandas dataframe while retaining schemaJoining panda dataframes with UK date formatApache Spark export PostgreSQL data in Parquet formatPandas DataFrame with categorical columns from a Parquet file using read_parquet?How can I export to sqlite (or another format) and retain the date datatype?

My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db
file, using the answer here to make sure dates are read as
dates.

How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)

I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
compatibility

CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

How big is your data?

– Erfan
Mar 25 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11

add a comment |

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db
file, using the answer here to make sure dates are read as
dates.

How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)

I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
compatibility

CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

How big is your data?

– Erfan
Mar 25 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11

add a comment |

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db
file, using the answer here to make sure dates are read as
dates.

How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)

I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
compatibility

CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db
file, using the answer here to make sure dates are read as
dates.

How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)

I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
compatibility

CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

python pandas parquet feather

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

edited Mar 25 at 18:20

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

asked Mar 25 at 17:27

Pythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges

How big is your data?

– Erfan
Mar 25 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11

add a comment |

How big is your data?

– Erfan
Mar 25 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11

How big is your data?

– Erfan
Mar 25 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11

add a comment |

2 Answers
2

active

oldest

votes

If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
 json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

1

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

1

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

1

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

1

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

|
show 2 more comments

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55343416%2fwhat-format-to-export-pandas-dataframe-while-retaining-data-types-not-csv-sqli%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
 json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

1

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

1

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

1

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

1

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

|
show 2 more comments

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
 json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

1

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

1

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

1

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

1

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

|
show 2 more comments

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
 json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
 json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

edited Mar 25 at 18:37

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

answered Mar 25 at 18:00

chet-the-wizard

8586 silver badges15 bronze badges

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

1

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

1

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

1

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

1

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

|
show 2 more comments

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

1

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

1

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

1

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

1

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15

|
show 2 more comments

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

add a comment |

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

add a comment |

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

answered Apr 3 at 4:25

HDFEOS.org

956 bronze badges

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

add a comment |

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

2 Answers
2

2 Answers
2

2 Answers
2