How to detect contiguous spans in which data changes linearly within a DataFrame?
I'm trying to detect contiguous spans in which the relevant variable changes linearly within certain data in a DataFrame. There may be many spans within the data that satisfy this. I started my aproach using ransac
based on Robust linear model estimation using RANSAC. However I'm having issues using the example for my data.
Objective
Detect contiguous spans in which the relevant variable changes linearly within data. The spans to be detected are composed by more than 20 consecutive data points. The desired output would be the range dates in which the contiguous spans are placed.
Toy example
In the toy exmple code below I generate random data and then set two portions of the data to create a contiguous spans that vary linearly. Then I try to fit a linear regression model to the data. The rest of the code I used (which is not shown here) is just the rest of the code in the Robust linear model estimation using RANSAC page. However I know I would need to change that remaining code in order to reach the goal.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Plot data
df.plot()
plt.show()
## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())
## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
For this toy example code a desired output (which I wasn't able to code yet) would be a DataFrame like this:
>>> out
start end
0 2016-08-10 08:15 2016-08-10 15:00
1 2016-08-10 17:00 2016-08-10 22:30
The graph generated looks like:
Error code
However when step 6 is executed I get below error:
ValueError: Expected 2D array, got 1D array instead: ... Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
I would like to be able to detect in this example both contiguous spans in which the relevant variable changes linearly (line1
and line2
). But I'm not being able to implement the example stated on the ransac code example.
Question
What should I modify in my code to be able to continue? And, may there be a better approach to achieve to detect the contiguous spans in which the relevant variable changes linearly?
python pandas numpy scikit-learn ransac
add a comment |
I'm trying to detect contiguous spans in which the relevant variable changes linearly within certain data in a DataFrame. There may be many spans within the data that satisfy this. I started my aproach using ransac
based on Robust linear model estimation using RANSAC. However I'm having issues using the example for my data.
Objective
Detect contiguous spans in which the relevant variable changes linearly within data. The spans to be detected are composed by more than 20 consecutive data points. The desired output would be the range dates in which the contiguous spans are placed.
Toy example
In the toy exmple code below I generate random data and then set two portions of the data to create a contiguous spans that vary linearly. Then I try to fit a linear regression model to the data. The rest of the code I used (which is not shown here) is just the rest of the code in the Robust linear model estimation using RANSAC page. However I know I would need to change that remaining code in order to reach the goal.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Plot data
df.plot()
plt.show()
## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())
## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
For this toy example code a desired output (which I wasn't able to code yet) would be a DataFrame like this:
>>> out
start end
0 2016-08-10 08:15 2016-08-10 15:00
1 2016-08-10 17:00 2016-08-10 22:30
The graph generated looks like:
Error code
However when step 6 is executed I get below error:
ValueError: Expected 2D array, got 1D array instead: ... Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
I would like to be able to detect in this example both contiguous spans in which the relevant variable changes linearly (line1
and line2
). But I'm not being able to implement the example stated on the ransac code example.
Question
What should I modify in my code to be able to continue? And, may there be a better approach to achieve to detect the contiguous spans in which the relevant variable changes linearly?
python pandas numpy scikit-learn ransac
We need sample data
– Setop
Nov 26 '18 at 17:52
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22
add a comment |
I'm trying to detect contiguous spans in which the relevant variable changes linearly within certain data in a DataFrame. There may be many spans within the data that satisfy this. I started my aproach using ransac
based on Robust linear model estimation using RANSAC. However I'm having issues using the example for my data.
Objective
Detect contiguous spans in which the relevant variable changes linearly within data. The spans to be detected are composed by more than 20 consecutive data points. The desired output would be the range dates in which the contiguous spans are placed.
Toy example
In the toy exmple code below I generate random data and then set two portions of the data to create a contiguous spans that vary linearly. Then I try to fit a linear regression model to the data. The rest of the code I used (which is not shown here) is just the rest of the code in the Robust linear model estimation using RANSAC page. However I know I would need to change that remaining code in order to reach the goal.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Plot data
df.plot()
plt.show()
## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())
## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
For this toy example code a desired output (which I wasn't able to code yet) would be a DataFrame like this:
>>> out
start end
0 2016-08-10 08:15 2016-08-10 15:00
1 2016-08-10 17:00 2016-08-10 22:30
The graph generated looks like:
Error code
However when step 6 is executed I get below error:
ValueError: Expected 2D array, got 1D array instead: ... Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
I would like to be able to detect in this example both contiguous spans in which the relevant variable changes linearly (line1
and line2
). But I'm not being able to implement the example stated on the ransac code example.
Question
What should I modify in my code to be able to continue? And, may there be a better approach to achieve to detect the contiguous spans in which the relevant variable changes linearly?
python pandas numpy scikit-learn ransac
I'm trying to detect contiguous spans in which the relevant variable changes linearly within certain data in a DataFrame. There may be many spans within the data that satisfy this. I started my aproach using ransac
based on Robust linear model estimation using RANSAC. However I'm having issues using the example for my data.
Objective
Detect contiguous spans in which the relevant variable changes linearly within data. The spans to be detected are composed by more than 20 consecutive data points. The desired output would be the range dates in which the contiguous spans are placed.
Toy example
In the toy exmple code below I generate random data and then set two portions of the data to create a contiguous spans that vary linearly. Then I try to fit a linear regression model to the data. The rest of the code I used (which is not shown here) is just the rest of the code in the Robust linear model estimation using RANSAC page. However I know I would need to change that remaining code in order to reach the goal.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Plot data
df.plot()
plt.show()
## 5. Create arrays
X = np.asarray(df.index)
y = np.asarray(df.data.tolist())
## 6. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
For this toy example code a desired output (which I wasn't able to code yet) would be a DataFrame like this:
>>> out
start end
0 2016-08-10 08:15 2016-08-10 15:00
1 2016-08-10 17:00 2016-08-10 22:30
The graph generated looks like:
Error code
However when step 6 is executed I get below error:
ValueError: Expected 2D array, got 1D array instead: ... Reshape your
data either using array.reshape(-1, 1) if your data has a single
feature or array.reshape(1, -1) if it contains a single sample.
I would like to be able to detect in this example both contiguous spans in which the relevant variable changes linearly (line1
and line2
). But I'm not being able to implement the example stated on the ransac code example.
Question
What should I modify in my code to be able to continue? And, may there be a better approach to achieve to detect the contiguous spans in which the relevant variable changes linearly?
python pandas numpy scikit-learn ransac
python pandas numpy scikit-learn ransac
edited Nov 26 '18 at 19:29
Cedric Zoppolo
asked Nov 22 '18 at 17:05
Cedric ZoppoloCedric Zoppolo
1,34711428
1,34711428
We need sample data
– Setop
Nov 26 '18 at 17:52
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22
add a comment |
We need sample data
– Setop
Nov 26 '18 at 17:52
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22
We need sample data
– Setop
Nov 26 '18 at 17:52
We need sample data
– Setop
Nov 26 '18 at 17:52
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22
add a comment |
2 Answers
2
active
oldest
votes
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
It is because sklearn
is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x:
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
If you are trying to do something similar, but for float data, I would do something using diff
the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
1
I used(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of(dff.data.shift(1) != dff.data)
as I was working with floats
– Cedric Zoppolo
Dec 11 '18 at 19:36
add a comment |
ValueError
To answer the question about the ValueError: The reason you are getting the error and the example isn't, is that while you originally create an array with shape (100,1)
(like the example), the linear model is fitting to df.data.tolist()
which has a shape (100,)
. This can be fixed by reshaping X
to 2D by X = X.reshape(-1,1)
. The next error will be that the X
values cannot be in datetime64
format. This could then be fixed by converting the time to seconds. For example, a standard epoch to use is 1970-01-01T00:00Z
and then all data points are seconds since that date and time. This conversion can be done by:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Here's the full code showing the linear fit in the plot below:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Detecting Contiguous Spans
To detect the spans of linear data, as you stated, RANSAC is a good method to use. To do this, the linear model would be changed to lr = linear_model.RANSACRegressor()
. However, this would only return one span, whereas you need to detect all spans. This means you need to repeat the span detections, while removing the spans after each detection so they don't get detected again. This should be repeated until the number of points in a detected span is less than 20.
The residual threshold for the RANSAC fit needs to be very small so as not to pick up points outside the span. The residual_threshold
can be changed if there is any noise in the real data. However, this is not always going to be sufficient, and false inliers are likely to be found, which will affect the recorded span ranges.
False Inliers
Since RANSAC is not checking if the in-span points are consecutive, it is possible for outliers to be falsely included in a span. To guard against this, points marked as in-span should be changed to outliers if they are surrounded by outliers. The fastest way to do this is to convolve lr.inlier_mask_
with [1,1,1]
. Any solitary "inliers" will have a value of 1 after the convolution (and are thus really outliers), while points as part of a span run will be 2 or 3. Thus, the following will fix false inliers:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times =
end_times =
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Here's the out
dataframe:
You can also plot the spans to verify.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435573%2fhow-to-detect-contiguous-spans-in-which-data-changes-linearly-within-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
It is because sklearn
is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x:
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
If you are trying to do something similar, but for float data, I would do something using diff
the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
1
I used(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of(dff.data.shift(1) != dff.data)
as I was working with floats
– Cedric Zoppolo
Dec 11 '18 at 19:36
add a comment |
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
It is because sklearn
is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x:
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
If you are trying to do something similar, but for float data, I would do something using diff
the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
1
I used(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of(dff.data.shift(1) != dff.data)
as I was working with floats
– Cedric Zoppolo
Dec 11 '18 at 19:36
add a comment |
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
It is because sklearn
is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x:
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
If you are trying to do something similar, but for float data, I would do something using diff
the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
To just go on and fit your linear regression, you will have to do the following:
lr.fit(X.reshape(-1,1), y)
It is because sklearn
is waiting for a 2d array of values, with each row being a row of features.
So after this would you like to fit models for many different ranges and see if you find spans of linear change?
If you are looking for exactly linear ranges (which is possible to detect in the case of integers for example, but not for floats), then I would do something like:
dff = df.diff()
dff['block'] = (dff.data.shift(1) != dff.data).astype(int).cumsum()
out = pd.DataFrame(list(dff.reset_index().groupby('block')['index'].apply(lambda x:
[x.min(), x.max()] if len(x) > 20 else None).dropna()))
Output would be:
>>> out
0 1
0 2016-08-10 08:30:00 2016-08-10 15:00:00
1 2016-08-10 17:15:00 2016-08-10 22:30:00
If you are trying to do something similar, but for float data, I would do something using diff
the same way, but then specifying some kind of acceptable error or similar. Please let me know if this is what you would like to achieve. Or here you could also use RANSAC for sure on different ranges (but that would just discard the terms which are not well aligned, so if there would be some element breaking the span, you would still detect it as being a span). Everything depends on what are you exactly interested in.
edited Nov 30 '18 at 1:09
Cedric Zoppolo
1,34711428
1,34711428
answered Nov 26 '18 at 20:46
zsomkozsomko
4966
4966
1
I used(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of(dff.data.shift(1) != dff.data)
as I was working with floats
– Cedric Zoppolo
Dec 11 '18 at 19:36
add a comment |
1
I used(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of(dff.data.shift(1) != dff.data)
as I was working with floats
– Cedric Zoppolo
Dec 11 '18 at 19:36
1
1
I used
(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of (dff.data.shift(1) != dff.data)
as I was working with floats– Cedric Zoppolo
Dec 11 '18 at 19:36
I used
(abs(dff.data.shift(1)-dff.data) >= 1e-6)
instead of (dff.data.shift(1) != dff.data)
as I was working with floats– Cedric Zoppolo
Dec 11 '18 at 19:36
add a comment |
ValueError
To answer the question about the ValueError: The reason you are getting the error and the example isn't, is that while you originally create an array with shape (100,1)
(like the example), the linear model is fitting to df.data.tolist()
which has a shape (100,)
. This can be fixed by reshaping X
to 2D by X = X.reshape(-1,1)
. The next error will be that the X
values cannot be in datetime64
format. This could then be fixed by converting the time to seconds. For example, a standard epoch to use is 1970-01-01T00:00Z
and then all data points are seconds since that date and time. This conversion can be done by:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Here's the full code showing the linear fit in the plot below:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Detecting Contiguous Spans
To detect the spans of linear data, as you stated, RANSAC is a good method to use. To do this, the linear model would be changed to lr = linear_model.RANSACRegressor()
. However, this would only return one span, whereas you need to detect all spans. This means you need to repeat the span detections, while removing the spans after each detection so they don't get detected again. This should be repeated until the number of points in a detected span is less than 20.
The residual threshold for the RANSAC fit needs to be very small so as not to pick up points outside the span. The residual_threshold
can be changed if there is any noise in the real data. However, this is not always going to be sufficient, and false inliers are likely to be found, which will affect the recorded span ranges.
False Inliers
Since RANSAC is not checking if the in-span points are consecutive, it is possible for outliers to be falsely included in a span. To guard against this, points marked as in-span should be changed to outliers if they are surrounded by outliers. The fastest way to do this is to convolve lr.inlier_mask_
with [1,1,1]
. Any solitary "inliers" will have a value of 1 after the convolution (and are thus really outliers), while points as part of a span run will be 2 or 3. Thus, the following will fix false inliers:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times =
end_times =
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Here's the out
dataframe:
You can also plot the spans to verify.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
add a comment |
ValueError
To answer the question about the ValueError: The reason you are getting the error and the example isn't, is that while you originally create an array with shape (100,1)
(like the example), the linear model is fitting to df.data.tolist()
which has a shape (100,)
. This can be fixed by reshaping X
to 2D by X = X.reshape(-1,1)
. The next error will be that the X
values cannot be in datetime64
format. This could then be fixed by converting the time to seconds. For example, a standard epoch to use is 1970-01-01T00:00Z
and then all data points are seconds since that date and time. This conversion can be done by:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Here's the full code showing the linear fit in the plot below:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Detecting Contiguous Spans
To detect the spans of linear data, as you stated, RANSAC is a good method to use. To do this, the linear model would be changed to lr = linear_model.RANSACRegressor()
. However, this would only return one span, whereas you need to detect all spans. This means you need to repeat the span detections, while removing the spans after each detection so they don't get detected again. This should be repeated until the number of points in a detected span is less than 20.
The residual threshold for the RANSAC fit needs to be very small so as not to pick up points outside the span. The residual_threshold
can be changed if there is any noise in the real data. However, this is not always going to be sufficient, and false inliers are likely to be found, which will affect the recorded span ranges.
False Inliers
Since RANSAC is not checking if the in-span points are consecutive, it is possible for outliers to be falsely included in a span. To guard against this, points marked as in-span should be changed to outliers if they are surrounded by outliers. The fastest way to do this is to convolve lr.inlier_mask_
with [1,1,1]
. Any solitary "inliers" will have a value of 1 after the convolution (and are thus really outliers), while points as part of a span run will be 2 or 3. Thus, the following will fix false inliers:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times =
end_times =
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Here's the out
dataframe:
You can also plot the spans to verify.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
add a comment |
ValueError
To answer the question about the ValueError: The reason you are getting the error and the example isn't, is that while you originally create an array with shape (100,1)
(like the example), the linear model is fitting to df.data.tolist()
which has a shape (100,)
. This can be fixed by reshaping X
to 2D by X = X.reshape(-1,1)
. The next error will be that the X
values cannot be in datetime64
format. This could then be fixed by converting the time to seconds. For example, a standard epoch to use is 1970-01-01T00:00Z
and then all data points are seconds since that date and time. This conversion can be done by:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Here's the full code showing the linear fit in the plot below:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Detecting Contiguous Spans
To detect the spans of linear data, as you stated, RANSAC is a good method to use. To do this, the linear model would be changed to lr = linear_model.RANSACRegressor()
. However, this would only return one span, whereas you need to detect all spans. This means you need to repeat the span detections, while removing the spans after each detection so they don't get detected again. This should be repeated until the number of points in a detected span is less than 20.
The residual threshold for the RANSAC fit needs to be very small so as not to pick up points outside the span. The residual_threshold
can be changed if there is any noise in the real data. However, this is not always going to be sufficient, and false inliers are likely to be found, which will affect the recorded span ranges.
False Inliers
Since RANSAC is not checking if the in-span points are consecutive, it is possible for outliers to be falsely included in a span. To guard against this, points marked as in-span should be changed to outliers if they are surrounded by outliers. The fastest way to do this is to convolve lr.inlier_mask_
with [1,1,1]
. Any solitary "inliers" will have a value of 1 after the convolution (and are thus really outliers), while points as part of a span run will be 2 or 3. Thus, the following will fix false inliers:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times =
end_times =
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Here's the out
dataframe:
You can also plot the spans to verify.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
ValueError
To answer the question about the ValueError: The reason you are getting the error and the example isn't, is that while you originally create an array with shape (100,1)
(like the example), the linear model is fitting to df.data.tolist()
which has a shape (100,)
. This can be fixed by reshaping X
to 2D by X = X.reshape(-1,1)
. The next error will be that the X
values cannot be in datetime64
format. This could then be fixed by converting the time to seconds. For example, a standard epoch to use is 1970-01-01T00:00Z
and then all data points are seconds since that date and time. This conversion can be done by:
X = (X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
Here's the full code showing the linear fit in the plot below:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
## 6. Predict values
z = lr.predict(X)
df['linear fit'] = z
## 7. Plot
df.plot()
plt.show()
Detecting Contiguous Spans
To detect the spans of linear data, as you stated, RANSAC is a good method to use. To do this, the linear model would be changed to lr = linear_model.RANSACRegressor()
. However, this would only return one span, whereas you need to detect all spans. This means you need to repeat the span detections, while removing the spans after each detection so they don't get detected again. This should be repeated until the number of points in a detected span is less than 20.
The residual threshold for the RANSAC fit needs to be very small so as not to pick up points outside the span. The residual_threshold
can be changed if there is any noise in the real data. However, this is not always going to be sufficient, and false inliers are likely to be found, which will affect the recorded span ranges.
False Inliers
Since RANSAC is not checking if the in-span points are consecutive, it is possible for outliers to be falsely included in a span. To guard against this, points marked as in-span should be changed to outliers if they are surrounded by outliers. The fastest way to do this is to convolve lr.inlier_mask_
with [1,1,1]
. Any solitary "inliers" will have a value of 1 after the convolution (and are thus really outliers), while points as part of a span run will be 2 or 3. Thus, the following will fix false inliers:
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
Code
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets
import numpy as np
## 1. Generate random data for toy sample
times = pd.date_range('2016-08-10', periods=100, freq='15min')
df = pd.DataFrame(np.random.randint(0,100,size=(100, 1)), index=times, columns=["data"])
## 2. Set line1 within random data
date_range1_start = "2016-08-10 08:15"
date_range1_end = "2016-08-10 15:00"
line1 = df.data[date_range1_start:date_range1_end]
value_start1 = 10
values1 = range(value_start1,value_start1+len(line1))
df.data[date_range1_start:date_range1_end] = values1
## 3. Set line2 within random data
date_range2_start = "2016-08-10 17:00"
date_range2_end = "2016-08-10 22:30"
value_start2 = 90
line2 = df.data[date_range2_start:date_range2_end]
values2 = range(value_start2,value_start2-len(line2),-1)
df.data[date_range2_start:date_range2_end] = values2
## 4. Create arrays
X = np.asarray(df.index)
X = ( X - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
X = X.reshape(-1,1)
y = np.asarray(df.data.tolist())
## 5. Fit line using all data
lr = linear_model.RANSACRegressor(residual_threshold=0.001)
lr.fit(X, y)
# Placeholders for start/end times
start_times =
end_times =
# Repeat fit and check if number of span inliers is greater than 20
while np.sum(lr.inlier_mask_) > 20:
# Remove false inliers
lr.inlier_mask_ = np.convolve(lr.inlier_mask_.astype(int), [1,1,1], mode='same') > 1
# Store start/end times
in_span = np.squeeze(np.where(lr.inlier_mask_))
start_times.append(str(times[in_span[0]]))
end_times.append(str(times[in_span[-1]]))
# Get outlier and check for another span
outliers = np.logical_not(lr.inlier_mask_)
X = X[outliers]
y = y[outliers]
times = times[outliers]
# Fit to remaining points
lr.fit(X, y)
out = pd.DataFrame({'start':start_times, 'end':end_times}, columns=['start','end'])
out.sort_values('start')
Here's the out
dataframe:
You can also plot the spans to verify.
plt.plot(df['data'],c='b')
for idx,row in out.iterrows():
x0 = np.datetime64(row['start'])
y0 = df.loc[x0]['data']
x1 = np.datetime64(row['end'])
y1 = df.loc[x1]['data']
plt.plot([x0,x1],[y0,y1],c='r')
edited Nov 27 '18 at 15:13
answered Nov 26 '18 at 21:20
A KrugerA Kruger
1,16827
1,16827
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
add a comment |
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
1
1
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
This answer is greatly detailed and works as I expect. Hence I will set the bounty to this answer. However I will accept the other solution that uses a simple but effective method.
– Cedric Zoppolo
Nov 29 '18 at 20:00
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53435573%2fhow-to-detect-contiguous-spans-in-which-data-changes-linearly-within-a-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
We need sample data
– Setop
Nov 26 '18 at 17:52
I created sample data in the toy example. I may provide real data but not sure how I can do that. I have some data in a pickle file.
– Cedric Zoppolo
Nov 26 '18 at 18:00
Sorry. Then I don't get what you call a graph. To my mind, a graph is a set of nodes and edges.
– Setop
Nov 26 '18 at 18:20
By "linear graphs within the data," I believe @CedricZoppolo means "contiguous spans in which the relevant variable changes linearly." He means graph as in plot, not graph as in nodes and edges.
– Peter Leimbigler
Nov 26 '18 at 18:57
@PeterLeimbigler is correct. I may have used the wrong term. I will try to rephrase my question to ensure everyone understands my question.
– Cedric Zoppolo
Nov 26 '18 at 19:22