Skip to main content
Version: v3.0 print this page

Lakeformation Datasets

Lakeformation extends S3-Athena datasets with added security and supports CSV, TSV, XLSX, JSON and Parquet files. It also checks data integrity and offers ACID transactions, data compaction, and time-travel queries.

Lakeformation includes optional partial data validation, which is enabled by default. The validation process helps detect and correct corrupt or invalid data files

The AWS Athena CSV Parser/SerDe has the following limitations:

  • Embedded line breaks in CSV files are not supported.
  • Empty fields in columns defined as a numeric data type are not supported.

As a workaround, Users can import them as string columns and create views on top of it by casting them to the required data types.

For Datasets with JSON file type:

  • AWS Limitations
    • It uses the OpenX JSON SerDe with the following limitations:
      • It expect JSON data to be on a single line (not formatted), with records separated by a new line character.
      • Comma character is not allowed at the end of each line.
      • The full data in the file should not be enclosed in square brackets.
    • Views are not supported on top of Lakeformation JSON datasets.
  • Amorphic feature limitations (Not Applicable)
    • Malware Detection
    • Data Profiling

Below is an example of an invalid JSON file:

[
{
"EmailId": "test-cwdl@cloudwick.com",
"IsAdmin": "no",
"UserId": "testuser"
},
{
"EmailId": "test1-cwdl@cloudwick.com",
"IsAdmin": "no",
"UserId": "testuser1"
}
]

Below is an example of a valid JSON file:

{ "EmailId": "test-cwdl1@cloudwick.com", "IsAdmin": "no", "UserId": "testuser1" }
{ "EmailId": "test-cwdl2@cloudwick.com", "IsAdmin": "no", "UserId": "testuser2" }
{ "EmailId": "test-cwdl3@cloudwick.com", "IsAdmin": "yes", "UserId": "testuser3" }
{ "EmailId": "test-cwdl4@cloudwick.com", "IsAdmin": "no", "UserId": "testuser4" }
{ "EmailId": "test-cwdl5@cloudwick.com", "IsAdmin": "yes", "UserId": "testuser5" }
Note

For JSON files, if dataset validation is enabled then column names in the files must exactly match the column names in the dataset schema

Create Lakeformation Datasets

Users can create LakeFormation datasets by selecting Lake Formation as target location, and file type as CSV, TSV, XLSX, JSON or Parquet, Dataset can be created by using either of the three ways:

  • Using already defined LakeFormation Datasets Templates
  • Importing required JSON payload
  • Using the form and entering the required details

Create LakeFormation Dataset

Loading Data into Lakeformation Datasets

Loading of Lakeformation datasets is same as S3-Athena datasets. To know more, refer to Athena Datasets for more detail.

Fine grained permissions with Lakeformation Datasets

Lakeformation datasets provide an additional layer of security for the data stored in Amorphic. Currently, three levels of access control are available: Owners, Editors and Read-only.

Owners and Editors of the datasets are provided with full column access by default and cannot be modified.

Fine-grained access control lets dataset owners and editors specify which users with read-only access can access specific columns or rows.

Please find the list of examples on how user permissions are applied based on Authorized Users and Tags.

EffectivePermissions_Lakeformation_Datasets

This feature has certain limitations. For more information, refer to the Limitations section.

Below animation shows how to apply fine grained permissions on LF dataset:

Apply Fine Grained Permissions

Query Datasets

Once data is loaded into the dataset, users can query and analyze it directly from the Playground.

Query Dataset

If user has read-only permissions, displayed results will be limited to allowed columns and rows.

Query Dataset

Note

Lakeformation governed datasets are deprecated as of v2.3. Users can utilize Iceberg datasets instead of the Lakeformation Governed datasets which provide the same features and more:

* Read the data
* Upsert records
* Delete records
* Time travel and version travel queries
* View History and Snapshots

Limitations

  • If a user has fine-grained permissions applied at an authorized user level and has read-only permissions through the tags, Amorphic applies the narrowed down permissions, i.e. the fine-grained row and column permissions set at the authorized user level. Please refer to the "How Are Permissions Applied" section above.
  • Views:
    • View permissions needs to be aligned with Dataset permissions i.e Owner of the view needs to provide the dataset access of the underlying lakeformation dataset before granting the view access.
    • When the owner of the Lakeformation dataset updates the access control using authorized users or tags, querying the view fails with a message saying "view is stale; it must be re-created". The owner of the view needs to either use the CREATE OR REPLACE statement to recreate the view, or delete and re-create the view with the necessary user permissions. For more details, please check the AWS Documentation on this topic.
    • When creating a view from Lake Formation datasets, if the user does not have access to all columns in the source dataset, they will be able to create the view but will not be able to query it. However, they can still edit the view using only the columns they have access to.
  • DMS tasks doesn't support loading of data to Lakeformation target datasets.
  • Currently, due to character limitations on IAM policies, AWS can only register up to 500 Lakeformation datasets. If you receive the error message DS-1061 - Failed to register dataset in the lakeformation catalog, error message: Unable to register the following path: s3://..., this may be due to the limit. As a workaround, you can remove unnecessary Lakeformation datasets from Amorphic and try again to publish a new Lakeformation dataset.