Anybody know if OrcTableSource supports S3 file system?

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:

OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC

    .path("s3://orders/so.orc") // s3://orders/so.csv

    // schema of ORC files

    .forOrcSchema(OrderHeaderORCSchema)

    .withConfiguration(orcconfig)

    .build();

seems this path is incorrect but anyone can help out? appreciate a lot!

Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)

By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.

DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =

    env.readCsvFile("s3://orders/so.csv")

        .types(String.class, String.class, String.class

                ,String.class, String.class);

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

More information?

– cellepo
Nov 22 '18 at 3:47

If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20

add a comment |

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:

OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC

    .path("s3://orders/so.orc") // s3://orders/so.csv

    // schema of ORC files

    .forOrcSchema(OrderHeaderORCSchema)

    .withConfiguration(orcconfig)

    .build();

seems this path is incorrect but anyone can help out? appreciate a lot!

Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)

By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.

DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =

    env.readCsvFile("s3://orders/so.csv")

        .types(String.class, String.class, String.class

                ,String.class, String.class);

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

More information?

– cellepo
Nov 22 '18 at 3:47

If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20

add a comment |

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:

OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC

    .path("s3://orders/so.orc") // s3://orders/so.csv

    // schema of ORC files

    .forOrcSchema(OrderHeaderORCSchema)

    .withConfiguration(orcconfig)

    .build();

seems this path is incorrect but anyone can help out? appreciate a lot!

Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)

By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.

DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =

    env.readCsvFile("s3://orders/so.csv")

        .types(String.class, String.class, String.class

                ,String.class, String.class);

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

I'm running into some troubles with using OrcTableSource to fetch Orc file from cloud Object storage(IBM COS), the code fragment is provided below:

OrcTableSource soORCTableSource = OrcTableSource.builder() // path to ORC

    .path("s3://orders/so.orc") // s3://orders/so.csv

    // schema of ORC files

    .forOrcSchema(OrderHeaderORCSchema)

    .withConfiguration(orcconfig)

    .build();

seems this path is incorrect but anyone can help out? appreciate a lot!

Caused by: java.io.FileNotFoundException: File /so.orc does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:428)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:142)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768) at
org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:528)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:370) at
org.apache.orc.OrcFile.createReader(OrcFile.java:342) at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:225)
at
org.apache.flink.orc.OrcRowInputFormat.open(OrcRowInputFormat.java:63)
at
org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:170)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) at
java.lang.Thread.run(Thread.java:748)

By the way, I've already set up flink-s3-fs-presto-1.6.2 and had following code running correctly. The question is limited to OrcTableSource only.

DataSet<Tuple5<String, String, String, String, String>> orderinfoSet =

    env.readCsvFile("s3://orders/so.csv")

        .types(String.class, String.class, String.class

                ,String.class, String.class);

apache-flink

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

edited Nov 22 '18 at 7:55

pirho

4,195101830

edited Nov 22 '18 at 7:55

pirho

4,195101830

edited Nov 22 '18 at 7:55

pirho

4,195101830

asked Nov 22 '18 at 3:31

TUTU

asked Nov 22 '18 at 3:31

TUTU

asked Nov 22 '18 at 3:31

TUTU

More information?

– cellepo
Nov 22 '18 at 3:47

If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20

add a comment |

More information?

– cellepo
Nov 22 '18 at 3:47

If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20

More information?

– cellepo
Nov 22 '18 at 3:47

If I use path: s3:///bucket/object instead of s3://bucket/object in OrcTableSource Builder, then I'm seeing following exception. Caused by: org.apache.flink.fs.s3presto.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: ed4025ca-91a8-4795-8068-948fbfc3508f), S3 Extended Request ID: null at org.apache.flink.fs.s3presto.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1587)

– TUTU
Nov 22 '18 at 4:20

add a comment |

1 Answer
1

active

oldest

votes

The problem is that Flink's OrcRowInputFormat uses two different file systems: One for generating the input splits and one for reading the actual input splits. For the former, it uses Flink's FileSystem abstraction and for the latter it uses Hadoop's FileSystem. Therefore, you need to configure Hadoop's configuration core-site.xml to contain the following snippet

<property>

  <name>fs.s3.impl</name>

  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

</property>

See this link for more information about setting up S3 for Hadoop.

This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53423475%2fanybody-know-if-orctablesource-supports-s3-file-system%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

<property>

  <name>fs.s3.impl</name>

  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

</property>

See this link for more information about setting up S3 for Hadoop.

This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

add a comment |

<property>

  <name>fs.s3.impl</name>

  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

</property>

See this link for more information about setting up S3 for Hadoop.

This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

add a comment |

<property>

  <name>fs.s3.impl</name>

  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

</property>

See this link for more information about setting up S3 for Hadoop.

This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

<property>

  <name>fs.s3.impl</name>

  <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>

</property>

See this link for more information about setting up S3 for Hadoop.

This is a limitation of Flink's OrcRowInputFormat and should be fixed. I've created the corresponding issue.

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

answered Nov 22 '18 at 16:49

Till Rohrmann

9,21111237

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

add a comment |

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

Thanks Till for your reply. However, I'm not running my Flink with a dedicated Hadoop env, but just use S3 wrapper with shaded Hadoop package. so the additional manual configuration is not applicable for this case. Any other way to walk around? or do you have a target date to fix OrcRowInputFormat in Flink?

– TUTU
Nov 23 '18 at 3:39

Then I guess it would be good to add core-site.xml to your class path as well as the S3 Hadoop FS implementation (or put the implementation in lib of Flink's Home directory)

– Till Rohrmann
Nov 23 '18 at 7:12

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk