Time Optimization for pandas dataframe reconstruction (random to fixed sampling)Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe

Which modifier shown in the D&D Beyond character sheet do I add to attack rolls with my longbow?

Is it possible to keep cat litter on balcony during winter (down to -10°C)

Is there a way to realize a function of type ((a -> b) -> b) -> Either a b?

Is it sportsmanlike to waste opponents' time by giving check at the end of the game?

How to interpret Residuals vs. Fitted Plot

Pi to the power y, for small y's

What is a Thanos Word™?

How much money is needed to prove you can support yourself with ESTA

Simulate a pool using multithreading in Python

Which fallacy: "If white privilege exists, why did Elizabeth Warren pretend to be an Indian?"

Speaking German abroad and feeling condescended to when people speak English back to me

Decision problems for which it is unknown whether they are decidable

What spells can be countered?

Which skill would I use for ventriloquism?

Making Sandwiches

How does the sorcerer's Careful Spell Metamagic option work with the Thunderwave spell?

How do I resolve science-based problems in my worldbuilding?

Is the limit of a recursively defined sequence always a fixed point?

I have stack-exchanged through my undergrad math program. Am I likely to succeed in Mathematics PhD programs?

Will a nuclear country use nuclear weapons if attacked by conventional means by another nuclear country?

Hole in PCB due to corrosive reaction?

What is the meaning of Text inside of AMS logo

CO₂ level is high enough that it reduces cognitive ability. Isn't that a reason to worry?

How to present boolean options along with selecting exactly 1 of them as "primary"?

Time Optimization for pandas dataframe reconstruction (random to fixed sampling)

Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.

I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.

What I am trying to achieve is transform to a new data-set with fixed sampling rate.

I am transforming my data-set as follows:

I'm setting a time_delta as a static time step of 0.5 seconds

I'm dropping reccords corresponding to the same nanosecond

I'm generating timestamps from my start_time to the calculated end_time

I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.

Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

I've also tested locally without any improvement.


# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
 end_of_day_timestamp,
 dt,
 dtype=np.uint64
 )


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
 while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
 index += 1 
 df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps

I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").

I would appreciate any help and pointers towards the right direction.

Thanks in advance

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

add a comment
|

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.

I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.

What I am trying to achieve is transform to a new data-set with fixed sampling rate.

I am transforming my data-set as follows:

I'm setting a time_delta as a static time step of 0.5 seconds

I'm dropping reccords corresponding to the same nanosecond

I'm generating timestamps from my start_time to the calculated end_time

I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.

Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

I've also tested locally without any improvement.


# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
 end_of_day_timestamp,
 dt,
 dtype=np.uint64
 )


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
 while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
 index += 1 
 df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps

I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").

I would appreciate any help and pointers towards the right direction.

Thanks in advance

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

add a comment
|

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.

I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.

What I am trying to achieve is transform to a new data-set with fixed sampling rate.

I am transforming my data-set as follows:

I'm setting a time_delta as a static time step of 0.5 seconds

I'm dropping reccords corresponding to the same nanosecond

I'm generating timestamps from my start_time to the calculated end_time

I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.

Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

I've also tested locally without any improvement.


# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
 end_of_day_timestamp,
 dt,
 dtype=np.uint64
 )


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
 while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
 index += 1 
 df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps

I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").

I would appreciate any help and pointers towards the right direction.

Thanks in advance

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.

I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.

What I am trying to achieve is transform to a new data-set with fixed sampling rate.

I am transforming my data-set as follows:

I'm setting a time_delta as a static time step of 0.5 seconds

I'm dropping reccords corresponding to the same nanosecond

I'm generating timestamps from my start_time to the calculated end_time

I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.

I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.

Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.

I've also tested locally without any improvement.


# ====================================================================
# Rate Re-Definition
# ====================================================================

SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES

start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD

fixed_timestamps = np.arange(start_of_day_timestamp,
 end_of_day_timestamp,
 dt,
 dtype=np.uint64
 )


# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================

df1 = df.drop_duplicates(subset='TimeStamp', keep="last")


# ====================================================================
# Construct new dataframe
# ====================================================================

df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0

for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
 while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
 index += 1 
 df2 = df2.append(df1.iloc[index], ignore_index=True)

df2['TimeStamp'] = fixed_timestamps

I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").

I would appreciate any help and pointers towards the right direction.

Thanks in advance

python-3.x dataframe

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

asked Mar 28 at 21:54

ioakeimo

797 bronze badges

add a comment
|

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407433%2ftime-optimization-for-pandas-dataframe-reconstruction-random-to-fixed-sampling%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Styjun

0

Your Answer

Post as a guest

0

0

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현

0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

0

0

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Kamusi Yaliyomo Aina za kamusi | Muundo wa kamusi | Faida za kamusi | Dhima ya picha katika kamusi | Marejeo | Tazama pia | Viungo vya nje | UrambazajiKuhusu kamusiGo-SwahiliWiki-KamusiKamusi ya Kiswahili na Kiingerezakuihariri na kuongeza habari

은진 송씨 목차 역사 본관 분파 인물 조선 왕실과의 인척 관계 집성촌 항렬자 인구 같이 보기 각주 둘러보기 메뉴은진 송씨세종실록 149권, 지리지 충청도 공주목 은진현