Anybody know if OrcTableSource supports S3 file system?












1















I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:



OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();


seems this path is incorrect but anyone can help out? appreciate a lot!




Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)




By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.



DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);









share|improve this question

























  • More information?

    – cellepo
    Nov 22 '18 at 3:47











  • If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

    – TUTU
    Nov 22 '18 at 4:20
















1















I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:



OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();


seems this path is incorrect but anyone can help out? appreciate a lot!




Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)




By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.



DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);









share|improve this question

























  • More information?

    – cellepo
    Nov 22 '18 at 3:47











  • If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

    – TUTU
    Nov 22 '18 at 4:20














1












1








1








I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:



OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();


seems this path is incorrect but anyone can help out? appreciate a lot!




Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)




By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.



DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);









share|improve this question
















I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:



OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC
.path("s3://orders/so.orc") // s3://orders/so.csv
// schema of ORC files
.forOrcSchema(OrderHeaderORCSchema)
.withConfiguration(orcconfig)
.build();


seems this path is incorrect but anyone can help out? appreciate a lot!




Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)




By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.



DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =
env.readCsvFile("s3://orders/so.csv")
.types(String.class, String.class, String.class
,String.class, String.class);






apache-flink






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 7:55









pirho

4,195101830




4,195101830










asked Nov 22 '18 at 3:31









TUTUTUTU

61




61













  • More information?

    – cellepo
    Nov 22 '18 at 3:47











  • If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

    – TUTU
    Nov 22 '18 at 4:20



















  • More information?

    – cellepo
    Nov 22 '18 at 3:47











  • If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

    – TUTU
    Nov 22 '18 at 4:20

















More information?

– cellepo
Nov 22 '18 at 3:47





More information?

– cellepo
Nov 22 '18 at 3:47













If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20





If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20












1 Answer
1






active

oldest

votes


















0














The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet



<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>


See this link for more information about setting up S3 for Hadoop.



This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.






share|improve this answer
























  • Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

    – TUTU
    Nov 23 '18 at 3:39











  • Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

    – Till Rohrmann
    Nov 23 '18 at 7:12











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423475%2fanybody-know-if-orctablesource-supports-s3-file-system%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet



<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>


See this link for more information about setting up S3 for Hadoop.



This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.






share|improve this answer
























  • Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

    – TUTU
    Nov 23 '18 at 3:39











  • Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

    – Till Rohrmann
    Nov 23 '18 at 7:12
















0














The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet



<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>


See this link for more information about setting up S3 for Hadoop.



This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.






share|improve this answer
























  • Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

    – TUTU
    Nov 23 '18 at 3:39











  • Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

    – Till Rohrmann
    Nov 23 '18 at 7:12














0












0








0







The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet



<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>


See this link for more information about setting up S3 for Hadoop.



This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.






share|improve this answer













The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet



<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>


See this link for more information about setting up S3 for Hadoop.



This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 22 '18 at 16:49









Till RohrmannTill Rohrmann

9,21111237




9,21111237













  • Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

    – TUTU
    Nov 23 '18 at 3:39











  • Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

    – Till Rohrmann
    Nov 23 '18 at 7:12



















  • Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

    – TUTU
    Nov 23 '18 at 3:39











  • Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

    – Till Rohrmann
    Nov 23 '18 at 7:12

















Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39





Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39













Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12





Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423475%2fanybody-know-if-orctablesource-supports-s3-file-system%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Ottavio Pratesi

Tricia Helfer

15 giugno