Anybody know if OrcTableSource supports S3 file system?
I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
apache-flink
add a comment |
I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
apache-flink
More information?
– cellepo
Nov 22 '18 at 3:47
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20
add a comment |
I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
apache-flink
I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:
OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();
seems this path is incorrect but anyone can help out? appreciate a lot!
Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)
By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.
DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);
apache-flink
apache-flink
edited Nov 22 '18 at 7:55
pirho
4,195101830
4,195101830
asked Nov 22 '18 at 3:31
TUTUTUTU
61
61
More information?
– cellepo
Nov 22 '18 at 3:47
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20
add a comment |
More information?
– cellepo
Nov 22 '18 at 3:47
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20
More information?
– cellepo
Nov 22 '18 at 3:47
More information?
– cellepo
Nov 22 '18 at 3:47
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20
add a comment |
1 Answer
1
active
oldest
votes
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to addcore-site.xmlto your class path as well as the S3 Hadoop FS implementation (or put the implementation inlibof Flink's Home directory)
– Till Rohrmann
Nov 23 '18 at 7:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423475%2fanybody-know-if-orctablesource-supports-s3-file-system%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to addcore-site.xmlto your class path as well as the S3 Hadoop FS implementation (or put the implementation inlibof Flink's Home directory)
– Till Rohrmann
Nov 23 '18 at 7:12
add a comment |
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to addcore-site.xmlto your class path as well as the S3 Hadoop FS implementation (or put the implementation inlibof Flink's Home directory)
– Till Rohrmann
Nov 23 '18 at 7:12
add a comment |
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
See this link for more information about setting up S3 for Hadoop.
This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.
answered Nov 22 '18 at 16:49
Till RohrmannTill Rohrmann
9,21111237
9,21111237
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to addcore-site.xmlto your class path as well as the S3 Hadoop FS implementation (or put the implementation inlibof Flink's Home directory)
– Till Rohrmann
Nov 23 '18 at 7:12
add a comment |
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to addcore-site.xmlto your class path as well as the S3 Hadoop FS implementation (or put the implementation inlibof Flink's Home directory)
– Till Rohrmann
Nov 23 '18 at 7:12
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?
– TUTU
Nov 23 '18 at 3:39
Then I guess it would be good to add
core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)– Till Rohrmann
Nov 23 '18 at 7:12
Then I guess it would be good to add
core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)– Till Rohrmann
Nov 23 '18 at 7:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423475%2fanybody-know-if-orctablesource-supports-s3-file-system%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
More information?
– cellepo
Nov 22 '18 at 3:47
If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)
– TUTU
Nov 22 '18 at 4:20