Time Optimization for pandas dataframe reconstruction (random to fixed sampling)Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe
Which modifier shown in the D&D Beyond character sheet do I add to attack rolls with my longbow?
Is it possible to keep cat litter on balcony during winter (down to -10°C)
Is there a way to realize a function of type ((a -> b) -> b) -> Either a b?
Is it sportsmanlike to waste opponents' time by giving check at the end of the game?
How to interpret Residuals vs. Fitted Plot
Pi to the power y, for small y's
What is a Thanos Word™?
How much money is needed to prove you can support yourself with ESTA
Simulate a pool using multithreading in Python
Which fallacy: "If white privilege exists, why did Elizabeth Warren pretend to be an Indian?"
Speaking German abroad and feeling condescended to when people speak English back to me
Decision problems for which it is unknown whether they are decidable
What spells can be countered?
Which skill would I use for ventriloquism?
Making Sandwiches
How does the sorcerer's Careful Spell Metamagic option work with the Thunderwave spell?
How do I resolve science-based problems in my worldbuilding?
Is the limit of a recursively defined sequence always a fixed point?
I have stack-exchanged through my undergrad math program. Am I likely to succeed in Mathematics PhD programs?
Will a nuclear country use nuclear weapons if attacked by conventional means by another nuclear country?
Hole in PCB due to corrosive reaction?
What is the meaning of Text inside of AMS logo
CO₂ level is high enough that it reduces cognitive ability. Isn't that a reason to worry?
How to present boolean options along with selecting exactly 1 of them as "primary"?
Time Optimization for pandas dataframe reconstruction (random to fixed sampling)
Sample random rows in dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileHow to select rows from a DataFrame based on column values?Get list from pandas DataFrame column headersselecting values from a pandas dataframe based on row and column labels stored in a different dataframe
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;
I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
- I'm setting a time_delta as a static time step of 0.5 seconds
- I'm dropping reccords corresponding to the same nanosecond
- I'm generating timestamps from my
start_time
to the calculatedend_time
- I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance
python-3.x dataframe
add a comment
|
I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
- I'm setting a time_delta as a static time step of 0.5 seconds
- I'm dropping reccords corresponding to the same nanosecond
- I'm generating timestamps from my
start_time
to the calculatedend_time
- I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance
python-3.x dataframe
add a comment
|
I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
- I'm setting a time_delta as a static time step of 0.5 seconds
- I'm dropping reccords corresponding to the same nanosecond
- I'm generating timestamps from my
start_time
to the calculatedend_time
- I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance
python-3.x dataframe
I am very new to python and pandas and my limited experience has led me to come up with an inefficient solution making my code too slow.
I have some data corresponding to stock market prices.
Sampling is random at nanosecond level.
What I am trying to achieve is transform to a new data-set with fixed sampling rate.
I am transforming my data-set as follows:
- I'm setting a time_delta as a static time step of 0.5 seconds
- I'm dropping reccords corresponding to the same nanosecond
- I'm generating timestamps from my
start_time
to the calculatedend_time
- I'm iterating through my original dataframe copying (and duplicating when needed) the last known record in my time_delta for each step to a new dataframe.
I believe that my issue is probably that I am appending the records one-by-one to my new dataframe, however I haven't been able to figure out a way utilizing pandas built-ins to optimize my code.
Runtime is currently ~4min for a day's data (turning ~30K samples to 57600) when executing on Google Colab.
I've also tested locally without any improvement.
# ====================================================================
# Rate Re-Definition
# ====================================================================
SAMPLES_PER_SECOND = 2
dt = 1000000000 / SAMPLES_PER_SECOND # Time delta in nanoseconds
SECONDS_IN_WORK_DAY = 28800 # 60 seconds * 60 minutes * 8 hours
TOTAL_SAMPLES = SECONDS_IN_WORK_DAY * SAMPLES_PER_SECOND
SAMPLING_PERIOD = dt * TOTAL_SAMPLES
start_of_day_timestamp = ceil_to_minute(df['TimeStamp'].iloc[0])
end_of_day_timestamp = start_of_day_timestamp + SAMPLING_PERIOD
fixed_timestamps = np.arange(start_of_day_timestamp,
end_of_day_timestamp,
dt,
dtype=np.uint64
)
# ====================================================================
# Drop records corresponding to the same timestamps
# ====================================================================
df1 = df.drop_duplicates(subset='TimeStamp', keep="last")
# ====================================================================
# Construct new dataframe
# ====================================================================
df2 = df1.iloc[0:1]
index_bounds_limit = df1.shape[0] - 1
index = 0
for i in tqdm(range(1, TOTAL_SAMPLES), desc="Constructing fixed sampling rate records... "):
while index < index_bounds_limit and df1['TimeStamp'].iloc[index] < fixed_timestamps[i]:
index += 1
df2 = df2.append(df1.iloc[index], ignore_index=True)
df2['TimeStamp'] = fixed_timestamps
I need to reduce the time as much as possible (while maintaining readability/maintainability, no need to use "hacks").
I would appreciate any help and pointers towards the right direction.
Thanks in advance
python-3.x dataframe
python-3.x dataframe
asked Mar 28 at 21:54
ioakeimoioakeimo
797 bronze badges
797 bronze badges
add a comment
|
add a comment
|
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407433%2ftime-optimization-for-pandas-dataframe-reconstruction-random-to-fixed-sampling%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55407433%2ftime-optimization-for-pandas-dataframe-reconstruction-random-to-fixed-sampling%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown