Amazon S3 transfers

The BigQuery Data Transfer Service for Amazon S3 allows you to automatically schedule and manage recurring load jobs from Amazon S3 into BigQuery.

Before you begin

Earlier you create an Amazon S3 transfer:

  • Verify that you have completed all deportment required to enable the BigQuery Data Transfer Service.
  • Create a BigQuery dataset to store your data.
  • Create the destination table for your transfer and specify the schema definition. The destination table must follow the table naming rules. Destination table names also support parameters.
  • Retrieve your Amazon S3 URI, your access central ID, and your secret access primal. For information on managing your access keys, meet the AWS documentation.
  • If you intend to setup transfer run notifications for Pub/Sub, yous must take pubsub.topics.setIamPolicy permissions. Pub/Sub permissions are not required if you just prepare upwards email notifications. For more than information, encounter BigQuery Information Transfer Service run notifications.

Limitations

Amazon S3 transfers are subject area to the following limitations:

  • Currently, the bucket portion of the Amazon S3 URI cannot be parameterized.
  • Transfers from Amazon S3 are ever triggered with the WRITE_APPEND preference which appends information to the destination table. See configuration.load.writeDisposition in the load job configuration for additional details.
  • Depending on the format of your Amazon S3 source data, there may be additional limitations. For more information, see:

    • CSV limitations
    • JSON limitations
    • Limitations on nested and repeated data
  • The minimum interval fourth dimension betwixt recurring transfers is 24 hours. The default interval for a recurring transfer is 24 hours.

Required permissions

Earlier creating an Amazon S3 transfer:

  • Ensure that the person creating the transfer has the post-obit required permissions in BigQuery:

    • bigquery.transfers.update permissions to create the transfer
    • Both bigquery.datasets.become and bigquery.datasets.update permissions on the target dataset

    The bigquery.admin predefined IAM role includes bigquery.transfers.update, bigquery.datasets.update and bigquery.datasets.get permissions. For more information on IAM roles in BigQuery Information Transfer Service, see Access control reference.

  • Consult the documentation for Amazon S3 to ensure you lot have configured any permissions necessary to enable the transfer. At a minimum, the Amazon S3 source data must have the AWS managed policy AmazonS3ReadOnlyAccess practical to it.

Setting upwards an Amazon S3 data transfer

To create an Amazon S3 data transfer:

Console

  1. Go to the BigQuery folio in the Cloud Console.

    Go to the BigQuery page

  2. Click Transfers.

  3. Click Create a Transfer.

  4. On the Create Transfer folio:

    • In the Source type section, for Source, choose Amazon S3.

      Transfer source

    • In the Transfer config name section, for Brandish name, enter a proper name for the transfer such as My Transfer. The transfer name can be any value that allows you to easily identify the transfer if yous need to modify it later.

      Transfer name

    • In the Schedule options department, for Schedule, get out the default value (Starting time now) or click Start at a set time.

      • For Repeats, cull an option for how ofttimes to run the transfer. Options include:

        • Daily (default)
        • Weekly
        • Monthly
        • Custom
        • On-demand

        If you choose an option other than Daily, additional options are available. For example, if you lot choose Weekly, an option appears for y'all to select the day of the week.

      • For Start date and run time, enter the date and time to start the transfer. If you cull Start at present, this selection is disabled.

        Transfer schedule

    • In the Destination settings section, for Destination dataset, choose the dataset you created to store your information.

      Transfer dataset

    • In the Data source details section:

      • For Destination table, enter the name of the table you created to store the data in BigQuery. Destination table names support parameters.
      • For Amazon S3 URI, enter the URI in the post-obit format s3://mybucket/myfolder/.... URIs also support parameters.
      • For Admission key ID, enter your admission central ID.
      • For Cloak-and-dagger access key, enter your secret admission key.
      • For File format cull your data format: newline delimited JSON, CSV, Avro, Parquet, or ORC.

        S3 source details

    • In the Transfer options - all formats section:

      • For Number of errors allowed, enter an integer value for the maximum number of bad records that can be ignored.
      • (Optional) For Decimal target types, enter a comma-separated listing of possible SQL data types that the source decimal values could be converted to. Which SQL data type is selected for conversion depends on the following atmospheric condition:
        • The data blazon selected for conversion will be the first information type in the following list that supports the precision and scale of the source data, in this club: NUMERIC, BIGNUMERIC, and Cord.
        • If none of the listed data types will support the precision and the scale, the data type supporting the widest range in the specified list is selected. If a value exceeds the supported range when reading the source data, an mistake volition be thrown.
        • The information type String supports all precision and scale values.
        • If this field is left empty, the information type will default to "NUMERIC,Cord" for ORC, and "NUMERIC" for the other file formats.
        • This field cannot contain duplicate data types.
        • The order of the data types that you list in this field is ignored.

      Transfer options all format

    • If y'all chose CSV or JSON as your file format, in the JSON,CSV section, cheque Ignore unknown values to accept rows that comprise values that do not match the schema. Unknown values are ignored. For CSV files, this pick ignores actress values at the end of a line.

      Ignore unknown values

    • If you chose CSV equally your file format, in the CSV section enter any boosted CSV options for loading data.

      CSV options

    • (Optional) In the Notification options section:

      • Click the toggle to enable e-mail notifications. When yous enable this option, the transfer administrator receives an email notification when a transfer run fails.
      • For Select a Pub/Sub topic, choose your topic name or click Create a topic to create ane. This option configures Pub/Sub run notifications for your transfer.
  5. Click Salvage.

bq

Enter the bq mk command and supply the transfer creation flag — --transfer_config.

bq mk \ --transfer_config \ --project_id=project_id                        \ --data_source=data_source                        \ --display_name=name                        \ --target_dataset=dataset                        \ --params='parameters'                      

Where:

  • project_id: Optional. Your Google Cloud projection ID. If --project_id isn't supplied to specify a detail project, the default project is used.
  • data_source: Required. The data source — amazon_s3.
  • display_name: Required. The brandish name for the transfer configuration. The transfer name can exist any value that allows yous to easily identify the transfer if y'all need to modify it later on.
  • dataset: Required. The target dataset for the transfer configuration.
  • parameters: Required. The parameters for the created transfer configuration in JSON format. For instance: --params='{"param":"param_value"}'. The following are the parameters for an Amazon S3 transfer:

    • destination_table_name_template: Required. The name of your destination table.
    • data_path: Required. The Amazon S3 URI, in the post-obit format:

      s3://mybucket/myfolder/...

      URIs also support parameters.

    • access_key_id: Required. Your access key ID.

    • secret_access_key: Required. Your secret access key.

    • file_format: Optional. Indicates the type of files you wish to transfer: CSV, JSON, AVRO, PARQUET, or ORC. The default value is CSV.

    • max_bad_records: Optional. The number of immune bad records. The default is 0.

    • decimal_target_types: Optional. A comma-separated list of possible SQL data types that the source decimal values could exist converted to. If this field is not provided, the datatype volition default to "NUMERIC,STRING" for ORC, and "NUMERIC" for the other file formats.

    • ignore_unknown_values: Optional, and ignored if file_format is not JSON or CSV. Whether to ignore unknown values in your information.

    • field_delimiter: Optional, and applies simply when file_format is CSV. The graphic symbol that separates fields. The default value is a comma.

    • skip_leading_rows: Optional, and applies but when file_format is CSV. Indicates the number of header rows you don't want to import. The default value is 0.

    • allow_quoted_newlines: Optional, and applies only when file_format is CSV. Indicates whether to allow newlines within quoted fields.

    • allow_jagged_rows: Optional, and applies but when file_format is CSV. Indicates whether to accept rows that are missing trailing optional columns. The missing values volition be filled in with NULLs.

For example, the following command creates an Amazon S3 transfer named My Transfer using a data_path_template value of s3://mybucket/myfile/*.csv, target dataset mydataset, and file_format CSV. This example includes non-default values for the optional params associated with the CSV file_format.

The transfer is created in the default project:

                        bq mk --transfer_config \ --target_dataset=mydataset \ --display_name='My Transfer' \ --params='{"data_path_template":"s3://mybucket/myfile/*.csv", "destination_table_name_template":"MyTable", "file_format":"CSV", "max_bad_records":"ane", "ignore_unknown_values":"true", "field_delimiter":"|", "skip_leading_rows":"1", "allow_quoted_newlines":"true", "allow_jagged_rows":"false", "delete_source_files":"truthful"}' \ --data_source=amazon_s3                                              

Subsequently running the command, you lot receive a bulletin like the following:

[URL omitted] Please copy and paste the above URL into your web browser and follow the instructions to retrieve an authentication code.

Follow the instructions and paste the authentication code on the command line.

API

Use the projects.locations.transferConfigs.create method and supply an case of the TransferConfig resource.

Java

Querying your data

When your information is transferred to BigQuery, the data is written to ingestion-fourth dimension partitioned tables. For more information, see Introduction to partitioned tables.

If you query your tables direct instead of using the automobile-generated views, you must utilize the _PARTITIONTIME pseudo-cavalcade in your query. For more information, see Querying partitioned tables.

Impact of prefix matching versus wildcard matching

The Amazon S3 API supports prefix matching, merely not wildcard matching. All Amazon S3 files that match a prefix will be transferred into Google Deject. All the same, only those that match the Amazon S3 URI in the transfer configuration volition actually get loaded into BigQuery. This could result in excess Amazon S3 egress costs for files that are transferred but not loaded into BigQuery.

As an instance, consider this data path:

                    s3://bucket/binder/*/subfolder/*.csv                                      

Along with these files in the source location:

                    s3://bucket/folder/whatsoever/subfolder/file1.csv s3://bucket/folder/file2.csv                                      

This volition issue in all Amazon S3 files with the prefix s3://saucepan/folder/ being transferred to Google Deject. In this example, both file1.csv and file2.csv will be transferred.

However, simply files matching s3://bucket/folder/*/subfolder/*.csv will actually load into BigQuery. In this instance, only file1.csv will exist loaded into BigQuery.

Troubleshooting

The post-obit provides information about common errors and the recommendation resolution.

Amazon S3 PERMISSION_DENIED errors

Error Recommended activity
The AWS Access Key Id you provided does not exist in our records. Ensure the access key exists and the ID is correct.
The request signature we calculated does not match the signature you lot provided. Check your cardinal and signing method. Ensure that the transfer configuration has the correct corresponding Secret Access Key
Failed to obtain the location of the source S3 bucket. Additional details: Access Denied

Failed to obtain the location of the source S3 bucket. Additional details: HTTP/1.1 403 Forbidden

S3 error message: Admission Denied

Ensure the AWS IAM user has permission to perform the following:
  • List the Amazon S3 bucket.
  • Get the location of the bucket.
  • Read the objects in the bucket.
Server unable to initialize object upload.; InvalidObjectState: The operation is not valid for the object's storage form

Failed to obtain the location of the source S3 saucepan. Additional details: All admission to this object has been disabled

Restore any objects that are archived to Amazon Glacier. Objects in Amazon S3 that are archived to Amazon Glacier are not attainable until they are restored
All access to this object has been disabled Ostend that the Amazon S3 URI in the transfer configuration is correct

Amazon S3 transfer limit errors

Mistake Recommended action
Number of files in transfer exceeds limit of 10000. Evaluate if the number of wildcards in the Amazon S3 URI can be reduced to just one. If this is possible, retry with a new transfer configuration, every bit the maximum number of files per transfer run will exist higher.

Evaluate if the transfer configuration can exist dissever into multiple transfer configurations, each transferring a portion of the source information.

Size of files in transfer exceeds limit of 16492674416640 bytes. Evaluate if the transfer configuration can exist split into multiple transfer configurations, each transferring a portion of the source data.

General issues

Error Recommended action
Files are transferred from Amazon S3 simply not loaded into BigQuery. The transfer logs may await await like to this:

Moving information from Amazon S3 to Google Cloud consummate: Moved <NNN> object(s).
No new files plant matching <Amazon S3 URI>.

Confirm that the Amazon S3 URI in the transfer configuration is right.

If the transfer configuration was meant to load all files with a common prefix, ensure that the Amazon S3 URI ends with a wildcard.
For instance, to load all files in s3://my-bucket/my-folder/, the Amazon S3 URI in the transfer configuration must exist s3://my-saucepan/my-folder/*, not merely s3://my-bucket/my-folder/.

Other problems See Troubleshooting transfer configurations.

What's next

  • For an introduction to Amazon S3 transfers, see Overview of Amazon S3 transfers
  • For an overview of BigQuery Data Transfer Service, see Introduction to BigQuery Data Transfer Service.
  • For information on using transfers including getting information about a transfer configuration, listing transfer configurations, and viewing a transfer'south run history, see Working with transfers.