What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?What are the differences between feather and parquet?How to view Apache Parquet file in Windows?pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?How can I export to sqlite (or another format) and retain the date datatype?“Large data” work flows using pandasChange data type of columns in PandasWriting a pandas DataFrame to CSV fileAdding headers to a DataFrame with Pandas while dropping the indexConverting dates within the pandas data structure formatExporting pandas dataframe while retaining schemaJoining panda dataframes with UK date formatApache Spark export PostgreSQL data in Parquet formatPandas DataFrame with categorical columns from a Parquet file using read_parquet?How can I export to sqlite (or another format) and retain the date datatype?

Sentences with no verb, but an ablative

How to track mail undetectably?

How do I tell my girlfriend she's been buying me books by the wrong author for the last nine months?

Did the Shuttle payload bay have illumination?

Simplify the code

Merging two data frames into a new one with unique items marked with 1 or 0

When does it become illegal to exchange bitcoin for cash?

Is it OK to say "The situation is pregnant with a crisis"?

Aligning arrays within arrays within another array

Why is the saxophone not common in classical repertoire?

Why are symbols not written in words?

What's the idiomatic (or best) way to trim surrounding whitespace from a string?

What caused the flashes in the video footage of Chernobyl?

German idiomatic equivalents of 能骗就骗 (if you can cheat, then cheat)

Replacing 5 gang light switches that have 3 of them daisy chained together

Are the plates of a battery really charged?

Emphasize numbers in tables

How come having a Deathly Hallow is not a big deal?

How soon after takeoff can you recline your airplane seat?

To “Er” Is Human

What is this fluorinated organic substance?

Avoiding repetition when using the "snprintf idiom" to write text

Classify 2-dim p-adic galois representations

Does the Grothendieck group of finitely generated modules form a commutative ring where the multiplication structure is induced from tensor product?



What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?


What are the differences between feather and parquet?How to view Apache Parquet file in Windows?pandas.read_csv() can apply different date formats within the same column! Is it a known bug? How to fix it?How can I export to sqlite (or another format) and retain the date datatype?“Large data” work flows using pandasChange data type of columns in PandasWriting a pandas DataFrame to CSV fileAdding headers to a DataFrame with Pandas while dropping the indexConverting dates within the pandas data structure formatExporting pandas dataframe while retaining schemaJoining panda dataframes with UK date formatApache Spark export PostgreSQL data in Parquet formatPandas DataFrame with categorical columns from a Parquet file using read_parquet?How can I export to sqlite (or another format) and retain the date datatype?













1















My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.



For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?



  • The only real solution I have tested is to export to a sqlite .db
    file
    , using the answer here to make sure dates are read as
    dates.

  • How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)


  • I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
    compatibility



  • CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:



    • pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus

    • I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.


UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b



I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.










share|improve this question
























  • How big is your data?

    – Erfan
    Mar 25 at 17:30











  • Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

    – Pythonista anonymous
    Mar 25 at 17:31












  • I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

    – Erfan
    Mar 25 at 18:07











  • Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

    – Pythonista anonymous
    Mar 25 at 18:11















1















My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.



For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?



  • The only real solution I have tested is to export to a sqlite .db
    file
    , using the answer here to make sure dates are read as
    dates.

  • How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)


  • I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
    compatibility



  • CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:



    • pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus

    • I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.


UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b



I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.










share|improve this question
























  • How big is your data?

    – Erfan
    Mar 25 at 17:30











  • Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

    – Pythonista anonymous
    Mar 25 at 17:31












  • I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

    – Erfan
    Mar 25 at 18:07











  • Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

    – Pythonista anonymous
    Mar 25 at 18:11













1












1








1








My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.



For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?



  • The only real solution I have tested is to export to a sqlite .db
    file
    , using the answer here to make sure dates are read as
    dates.

  • How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)


  • I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
    compatibility



  • CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:



    • pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus

    • I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.


UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b



I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.










share|improve this question
















My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.



For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?



  • The only real solution I have tested is to export to a sqlite .db
    file
    , using the answer here to make sure dates are read as
    dates.

  • How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)


  • I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards
    compatibility



  • CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:



    • pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus

    • I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.


UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b



I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.







python pandas parquet feather






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 25 at 18:20







Pythonista anonymous

















asked Mar 25 at 17:27









Pythonista anonymousPythonista anonymous

1,5866 gold badges27 silver badges60 bronze badges




1,5866 gold badges27 silver badges60 bronze badges












  • How big is your data?

    – Erfan
    Mar 25 at 17:30











  • Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

    – Pythonista anonymous
    Mar 25 at 17:31












  • I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

    – Erfan
    Mar 25 at 18:07











  • Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

    – Pythonista anonymous
    Mar 25 at 18:11

















  • How big is your data?

    – Erfan
    Mar 25 at 17:30











  • Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

    – Pythonista anonymous
    Mar 25 at 17:31












  • I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

    – Erfan
    Mar 25 at 18:07











  • Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

    – Pythonista anonymous
    Mar 25 at 18:11
















How big is your data?

– Erfan
Mar 25 at 17:30





How big is your data?

– Erfan
Mar 25 at 17:30













Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31






Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs.

– Pythonista anonymous
Mar 25 at 17:31














I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07





I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you.

– Erfan
Mar 25 at 18:07













Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11





Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me.

– Pythonista anonymous
Mar 25 at 18:11










2 Answers
2






active

oldest

votes


















1














If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:



Use to_pickle method of the DataFrame object.



Or, save a data type json file with your data types and specify your date format when saving the CSV:



# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)





share|improve this answer

























  • Did you read the question? He says csv is not an option since has to specify the dtype manually..

    – Erfan
    Mar 25 at 18:04






  • 1





    Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

    – Pythonista anonymous
    Mar 25 at 18:04






  • 1





    It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

    – Pythonista anonymous
    Mar 25 at 18:05






  • 1





    @Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

    – juanpa.arrivillaga
    Mar 25 at 18:12







  • 1





    @Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

    – chet-the-wizard
    Mar 25 at 18:15


















1














If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.



If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.



Apache Feather is the fastest if performance matters.






share|improve this answer























  • I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

    – Pythonista anonymous
    Apr 3 at 11:45











  • Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

    – Pythonista anonymous
    Apr 3 at 11:47











  • Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

    – user108569
    2 days ago













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55343416%2fwhat-format-to-export-pandas-dataframe-while-retaining-data-types-not-csv-sqli%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:



Use to_pickle method of the DataFrame object.



Or, save a data type json file with your data types and specify your date format when saving the CSV:



# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)





share|improve this answer

























  • Did you read the question? He says csv is not an option since has to specify the dtype manually..

    – Erfan
    Mar 25 at 18:04






  • 1





    Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

    – Pythonista anonymous
    Mar 25 at 18:04






  • 1





    It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

    – Pythonista anonymous
    Mar 25 at 18:05






  • 1





    @Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

    – juanpa.arrivillaga
    Mar 25 at 18:12







  • 1





    @Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

    – chet-the-wizard
    Mar 25 at 18:15















1














If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:



Use to_pickle method of the DataFrame object.



Or, save a data type json file with your data types and specify your date format when saving the CSV:



# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)





share|improve this answer

























  • Did you read the question? He says csv is not an option since has to specify the dtype manually..

    – Erfan
    Mar 25 at 18:04






  • 1





    Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

    – Pythonista anonymous
    Mar 25 at 18:04






  • 1





    It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

    – Pythonista anonymous
    Mar 25 at 18:05






  • 1





    @Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

    – juanpa.arrivillaga
    Mar 25 at 18:12







  • 1





    @Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

    – chet-the-wizard
    Mar 25 at 18:15













1












1








1







If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:



Use to_pickle method of the DataFrame object.



Or, save a data type json file with your data types and specify your date format when saving the CSV:



# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)





share|improve this answer















If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:



Use to_pickle method of the DataFrame object.



Or, save a data type json file with your data types and specify your date format when saving the CSV:



# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)






share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 25 at 18:37

























answered Mar 25 at 18:00









chet-the-wizardchet-the-wizard

8586 silver badges15 bronze badges




8586 silver badges15 bronze badges












  • Did you read the question? He says csv is not an option since has to specify the dtype manually..

    – Erfan
    Mar 25 at 18:04






  • 1





    Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

    – Pythonista anonymous
    Mar 25 at 18:04






  • 1





    It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

    – Pythonista anonymous
    Mar 25 at 18:05






  • 1





    @Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

    – juanpa.arrivillaga
    Mar 25 at 18:12







  • 1





    @Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

    – chet-the-wizard
    Mar 25 at 18:15

















  • Did you read the question? He says csv is not an option since has to specify the dtype manually..

    – Erfan
    Mar 25 at 18:04






  • 1





    Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

    – Pythonista anonymous
    Mar 25 at 18:04






  • 1





    It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

    – Pythonista anonymous
    Mar 25 at 18:05






  • 1





    @Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

    – juanpa.arrivillaga
    Mar 25 at 18:12







  • 1





    @Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

    – chet-the-wizard
    Mar 25 at 18:15
















Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04





Did you read the question? He says csv is not an option since has to specify the dtype manually..

– Erfan
Mar 25 at 18:04




1




1





Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04





Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended.

– Pythonista anonymous
Mar 25 at 18:04




1




1





It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05





It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me

– Pythonista anonymous
Mar 25 at 18:05




1




1





@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12






@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue.

– juanpa.arrivillaga
Mar 25 at 18:12





1




1





@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15





@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes.

– chet-the-wizard
Mar 25 at 18:15











1














If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.



If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.



Apache Feather is the fastest if performance matters.






share|improve this answer























  • I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

    – Pythonista anonymous
    Apr 3 at 11:45











  • Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

    – Pythonista anonymous
    Apr 3 at 11:47











  • Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

    – user108569
    2 days ago















1














If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.



If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.



Apache Feather is the fastest if performance matters.






share|improve this answer























  • I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

    – Pythonista anonymous
    Apr 3 at 11:45











  • Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

    – Pythonista anonymous
    Apr 3 at 11:47











  • Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

    – user108569
    2 days ago













1












1








1







If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.



If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.



Apache Feather is the fastest if performance matters.






share|improve this answer













If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.



If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.



Apache Feather is the fastest if performance matters.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 3 at 4:25









HDFEOS.orgHDFEOS.org

956 bronze badges




956 bronze badges












  • I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

    – Pythonista anonymous
    Apr 3 at 11:45











  • Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

    – Pythonista anonymous
    Apr 3 at 11:47











  • Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

    – user108569
    2 days ago

















  • I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

    – Pythonista anonymous
    Apr 3 at 11:45











  • Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

    – Pythonista anonymous
    Apr 3 at 11:47











  • Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

    – user108569
    2 days ago
















I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45





I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. stackoverflow.com/questions/48083405/…

– Pythonista anonymous
Apr 3 at 11:45













Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47





Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: stackoverflow.com/questions/50933429/…

– Pythonista anonymous
Apr 3 at 11:47













Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago





Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage.

– user108569
2 days ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55343416%2fwhat-format-to-export-pandas-dataframe-while-retaining-data-types-not-csv-sqli%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript