Time Optimization for pandas dataframe reconstruction (random to fixed sampling)Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe

Which modifier shown in the D&D Beyond character sheet do I add to attack rolls with my longbow?

Is it possible to keep cat litter on balcony during winter (down to -10°C)

Is there a way to realize a function of type ((a -> b) -> b) -> Either a b?

Is it sportsmanlike to waste opponents' time by giving check at the end of the game?

How to interpret Residuals vs. Fitted Plot

Pi to the power y, for small y's

What is a Thanos Word™?

How much money is needed to prove you can support yourself with ESTA

Simulate a pool using multithreading in Python

Which fallacy: "If white privilege exists, why did Elizabeth Warren pretend to be an Indian?"

Speaking German abroad and feeling condescended to when people speak English back to me

Decision problems for which it is unknown whether they are decidable

What spells can be countered?

Which skill would I use for ventriloquism?

Making Sandwiches

How does the sorcerer's Careful Spell Metamagic option work with the Thunderwave spell?

How do I resolve science-based problems in my worldbuilding?

Is the limit of a recursively defined sequence always a fixed point?

I have stack-exchanged through my undergrad math program. Am I likely to succeed in Mathematics PhD programs?

Will a nuclear country use nuclear weapons if attacked by conventional means by another nuclear country?

Hole in PCB due to corrosive reaction?

What is the meaning of Text inside of AMS logo

CO₂ level is high enough that it reduces cognitive ability. Isn't that a reason to worry?

How to present boolean options along with selecting exactly 1 of them as "primary"?



Time Optimization for pandas dataframe reconstruction (random to fixed sampling)


Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









1

















I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.



I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.

What I am trying to achieve is transform to a new data-set with fixed sampling rate.



I am transforming my data-set as follows:



  • I'm setting a time_delta as a static time step of 0.5 seconds

  • I'm dropping reccords corresponding to the same nanosecond

  • I'm generating timestamps from my start_time to the calculated end_time

  • I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.



Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

I've also tested locally without any improvement.




# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps



I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").



I would appreciate any help and pointers towards the right direction.



Thanks in advance










share|improve this question
































    1

















    I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.



    I have some data corresponding to stock market prices.
    Sampling is random at nanosecond level.

    What I am trying to achieve is transform to a new data-set with fixed sampling rate.



    I am transforming my data-set as follows:



    • I'm setting a time_delta as a static time step of 0.5 seconds

    • I'm dropping reccords corresponding to the same nanosecond

    • I'm generating timestamps from my start_time to the calculated end_time

    • I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

    I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.



    Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

    I've also tested locally without any improvement.




    # ====================================================================
    # Rate Re-Definition
    # ====================================================================

    SAMPLES_PER_SECOND = 2
    dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
    SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
    TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
    SAMPLING_PERIOD = dt * TOTAL_SAMPLES

    start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
    end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

    fixed_timestamps = np.arange(start_of_day_timestamp,
    end_of_day_timestamp,
    dt,
    dtype=np.uint64
    )


    # ====================================================================
    # Drop records corresponding to the same timestamps
    # ====================================================================

    df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


    # ====================================================================
    # Construct new dataframe
    # ====================================================================

    df2 = df1.iloc[0:1]
    index_bounds_limit = df1.shape[0] - 1
    index = 0

    for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
    while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
    index += 1
    df2 = df2.append(df1.iloc[index], ignore_index=True)

    df2['TimeStamp'] = fixed_timestamps



    I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").



    I would appreciate any help and pointers towards the right direction.



    Thanks in advance










    share|improve this question




























      1












      1








      1








      I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.



      I have some data corresponding to stock market prices.
      Sampling is random at nanosecond level.

      What I am trying to achieve is transform to a new data-set with fixed sampling rate.



      I am transforming my data-set as follows:



      • I'm setting a time_delta as a static time step of 0.5 seconds

      • I'm dropping reccords corresponding to the same nanosecond

      • I'm generating timestamps from my start_time to the calculated end_time

      • I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

      I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.



      Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

      I've also tested locally without any improvement.




      # ====================================================================
      # Rate Re-Definition
      # ====================================================================

      SAMPLES_PER_SECOND = 2
      dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
      SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
      TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
      SAMPLING_PERIOD = dt * TOTAL_SAMPLES

      start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
      end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

      fixed_timestamps = np.arange(start_of_day_timestamp,
      end_of_day_timestamp,
      dt,
      dtype=np.uint64
      )


      # ====================================================================
      # Drop records corresponding to the same timestamps
      # ====================================================================

      df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


      # ====================================================================
      # Construct new dataframe
      # ====================================================================

      df2 = df1.iloc[0:1]
      index_bounds_limit = df1.shape[0] - 1
      index = 0

      for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
      while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
      index += 1
      df2 = df2.append(df1.iloc[index], ignore_index=True)

      df2['TimeStamp'] = fixed_timestamps



      I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").



      I would appreciate any help and pointers towards the right direction.



      Thanks in advance










      share|improve this question















      I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.



      I have some data corresponding to stock market prices.
      Sampling is random at nanosecond level.

      What I am trying to achieve is transform to a new data-set with fixed sampling rate.



      I am transforming my data-set as follows:



      • I'm setting a time_delta as a static time step of 0.5 seconds

      • I'm dropping reccords corresponding to the same nanosecond

      • I'm generating timestamps from my start_time to the calculated end_time

      • I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

      I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.



      Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

      I've also tested locally without any improvement.




      # ====================================================================
      # Rate Re-Definition
      # ====================================================================

      SAMPLES_PER_SECOND = 2
      dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
      SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
      TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
      SAMPLING_PERIOD = dt * TOTAL_SAMPLES

      start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
      end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

      fixed_timestamps = np.arange(start_of_day_timestamp,
      end_of_day_timestamp,
      dt,
      dtype=np.uint64
      )


      # ====================================================================
      # Drop records corresponding to the same timestamps
      # ====================================================================

      df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


      # ====================================================================
      # Construct new dataframe
      # ====================================================================

      df2 = df1.iloc[0:1]
      index_bounds_limit = df1.shape[0] - 1
      index = 0

      for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
      while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
      index += 1
      df2 = df2.append(df1.iloc[index], ignore_index=True)

      df2['TimeStamp'] = fixed_timestamps



      I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").



      I would appreciate any help and pointers towards the right direction.



      Thanks in advance







      python-3.x dataframe






      share|improve this question














      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 28 at 21:54









      ioakeimoioakeimo

      797 bronze badges




      797 bronze badges

























          0






          active

          oldest

          votes













          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407433%2ftime-optimization-for-pandas-dataframe-reconstruction-random-to-fixed-sampling%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown


























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407433%2ftime-optimization-for-pandas-dataframe-reconstruction-random-to-fixed-sampling%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown









          Popular posts from this blog

          Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

          Swift 4 - func physicsWorld not invoked on collision? The Next CEO of Stack OverflowHow to call Objective-C code from Swift#ifdef replacement in the Swift language@selector() in Swift?#pragma mark in Swift?Swift for loop: for index, element in array?dispatch_after - GCD in Swift?Swift Beta performance: sorting arraysSplit a String into an array in Swift?The use of Swift 3 @objc inference in Swift 4 mode is deprecated?How to optimize UITableViewCell, because my UITableView lags

          Access current req object everywhere in Node.js ExpressWhy are global variables considered bad practice? (node.js)Using req & res across functionsHow do I get the path to the current script with Node.js?What is Node.js' Connect, Express and “middleware”?Node.js w/ express error handling in callbackHow to access the GET parameters after “?” in Express?Modify Node.js req object parametersAccess “app” variable inside of ExpressJS/ConnectJS middleware?Node.js Express app - request objectAngular Http Module considered middleware?Session variables in ExpressJSAdd properties to the req object in expressjs with Typescript